Interoperability Analysis of Tomato Fruit Detection Models for Images Taken at Different Facilities, Cultivation Methods, and Times of the Day

: This study investigated the interoperability of a tomato fruit detection model trained using nighttime images from two greenhouses. The goal was to evaluate the performance of the models in different environmets, including different facilities, cultivation methods, and imaging times. An innovative imaging approach is introduced to eliminate the background, highlight the target plants, and test the adaptability of the model under diverse conditions. The results demonstrate that the tomato fruit detection accuracy improves when the domain of the training dataset contains the test environment. The quantitative results showed high interoperability, achieving an average accuracy (AP 50 ) of 0.973 in the same greenhouse and a stable performance of 0.962 in another greenhouse. The imaging approach controlled the lighting conditions, effectively eliminating the domain-shift problem. However, training on a dataset with low diversity or inferring plant appearance images but not on the training dataset decreased the average accuracy to approximately 0.80, revealing the need for new approaches to overcome fruit occlusion. Importantly, these findings have practical implications for the application of automated tomato fruit set monitoring systems in greenhouses to enhance agricultural efficiency and productivity.


Introduction
The daily monitoring of tomato fruit sets in greenhouses provides essential information for accurately predicting yields and harvest times, understanding maturity, and making crop management and marketing decisions.The tomato fruit set in a greenhouse fluctuates over time and place owing to various factors, including climate, environmental controls, the nutritional status of the crop, and the grower's management of harvesting and picking operations.Automatic fruit monitoring using images is an effective means of collecting information on tomato fruit sets in greenhouses.Monitoring cases using robotics such as automatic carts [1,2] and drones [3] have been reported.In addition, the automatic identification and location of mature tomato fruits are essential for research on harvesting robots for labor-saving [4].Therefore, the automatic detection of tomato fruits in greenhouses has become a crucial technology for driving innovation in agricultural production.
The technological basis for tomato fruit detection is the general object detection technique in computer vision, which is positioned as a task for detecting specific objects in an image.Over the past two decades, the technology for detecting particular objects in images, including fruits, has progressed dramatically in accuracy and speed since the advent of deep learning [5].Deep learning-based detection methods have increased their application in agriculture.Early applications to agriculture are reviewed in detail by Kamilaris and Prenafeta-Boldú [6].In addition to fruit detection, other applications such as crop type classification in remote sensing [7], phenotyping for stress tolerance [8], weed detection [9], and plant disease classification [10] have been developed in a wide range of agricultural fields.
Fruit detection in agriculture has evolved significantly with general object recognition trends, notably with the shift from traditional techniques [11] to deep learning approaches [12].Initially, research focused on handcrafted features and machine learning to identify fruits, which were resource-intensive and slow.The introduction of deep learning has markedly improved fruit detection, with advancements in general object detection models rapidly integrated into this field since the late 2010s, thereby reducing the time lag in the adoption of new technologies.In the present study, tomatoes were among the most frequently detected fruit crops.This is probably because tomatoes are commercially successful crops with large markets.As with other fruit detection methods, the direction of technological development in tomato fruit detection has changed since the advent of deep learning.Before deep learning, machine learning-based methods were used to detect tomatoes [13][14][15][16][17].However, since the advent of deep learning, studies have replaced traditional fruit detection techniques, which have been the dominant method, with those using deep learning.Early studies mainly reported the use of the following frameworks: early single-shot models, such as YOLO v3 [18,19] and SSD [3,20]; and double-shot models, such as faster RCNN [2,21] and Mask RCNN [1,22].In addition to using state-of-the-art deep learning architectures in 2022 and beyond, research on improving parts of the network and optimizing it for tomato detection has become a central concern for researchers; YOLO v4 [20,23,24], YOLO v5 [25,26], YOLO X [27], YOLO v8 [28], and CornerNet [29].
One of the problems with image-based fruit detection is that detection accuracy is not guaranteed when operating in a production greenhouse [30].This is generally recognized as a domain-shift problem.In other words, this is a problem in which the performance is fully realized in the environment where the fruit detection model is trained (source domain).However, the effectiveness is reduced in the environment where the system is deployed (target domain).Many machine learning algorithms assume that the source and target data are independent and identically distributed (i.i.d.) [31].However, it is challenging to learn models that perform well on new out-of-distribution (OOD) data that are not found during training.In general, the environment in a tomato production greenhouse is not always constant.The appearance of the fruit varies greatly depending on the time of day, season, and the photographic equipment.In addition, because different growers use different materials and cultivation methods, it can be inferred that the images of the fruits and their background information are diverse.Therefore, understanding how a model trained in one environment performs under other conditions [30].This is an unavoidable problem when using fruit detection systems in greenhouses for commercialization.
Lighting and occlusion conditions are often problematic as examples of the impact of different environments on fruit detection accuracy [11].Previous studies compared the accuracy of lighting and occlusion conditions using constructed fruit detection models [15,17,18,22,24,29,32].The ripening state of the fruit also affects the detection accuracy because of the change in coloration during the ripening period [20] in mature and immature fruits [20,[33][34][35].Accuracy comparisons between flowers and fruits [28,36,37] have also been performed.The influence of fruit set position within the plant has also been evaluated [1,14].In these previous studies, the training and test data were essentially from the same greenhouse, and the cultivation methods were similar.This makes it difficult to verify whether the constructed fruit detection models work correctly in other environments.
Haggag et al. [30] were the first in tomato fruit detection to evaluate the generalization performance of a model by evaluating the fruit detection accuracy using a dataset in which the training and test data were collected in different environments.They investigated the effects of varying camera angles, dataset sizes, and light environment conditions on fruit detection performance.They reported that fruit detection in environments with significantly different light conditions and contrasts resulted in limited performance because the model did not adequately learn the representation in the inference environment from the training data.Thus, although tomato fruit detection performance is strongly influenced by the interrelationship between the training and test environments, previous studies have not fully elucidated this effect.For example, different greenhouses (including materials and structures), other plant appearances depending on the cultivation method, and the time of day of imaging can affect the accuracy of tomato fruit detection.
Therefore, the following objectives were established for this study: • The first objective was to build a fruit detection model using training datasets taken at night in two greenhouses with different degrees of leaf-picking and to test its interoperability in other environments.The models will be tested in environments with various facilities, cultivation methods, and times of day to evaluate fruit detection performance; • The second objective is to propose an imaging approach and validate its effectiveness.
This approach captured only the target plant body by removing the background.This effect can be used to test the interoperability of fruit detection models in different greenhouse environments.
The remainder of this paper is structured as follows.Of the materials and methods described in Section 2, we explain the fruit set monitoring system (Section 2.1), and the fruit datasets (Section 2.2) used in the study.Then, we describe the t-SNE approach that is used to visually observe the variability (domain) of each dataset (Section 2.3).Then, we describe how to train (Section 2.4) and validate (Section 2.5) fruit detection model using deep learning.The results in Section 3.1 describe a comparison of the dataset similarities visualized by t-SNE.In Section 3.2, we describe the process of training the fruit detection model and the selection of the best model.In Section 3.3, we discuss the main results of this study, namely fruit detection accuracy.In Section 4, we compare the results obtained in this study with those of previous reports to discuss the factors contributing to fruit detection accuracy in different environments.Next, we discuss the application of the proposed imaging approach to other greenhouse environments.We summarize the findings of this study in Section 5.

Fruit Image Collection Method
Figure 1a shows the fruit set monitoring system used in this study.The system consisted of a plant-scanning device and an automatic trolley.The device automatically ran on a pipe rail at a certain time at night, as shown in Figure 1b.A light source illuminated the plants during this, and a camera captured the video images of the crop rows.
The onboard plant-scanning device generates images with the background removed.Figure 2a shows a conceptual diagram of the fruit set monitoring system for scanning in a greenhouse, visualized from the side.Tomato plants in a high-wire system are usually planted in two rows with a work path between them.The automatic trolley operates on the hot-water pipe rails of the work path.In the fruit set monitoring system, when the plants in the front (shown in red) are the measurement targets, the plants in the back (shown in purple) are the background.The plants in the back can also be a source of error in tomato fruit detection.Therefore, it is desirable to obtain an image in which only the "plants in front" appear in the image.The following section describes how to generate a panoramic image with the background removed from the shooting video.As shown in Figure 1a, the plant-scanning device consists of a light source and a shading plate that adjusts the illumination range.Adjusting the angle of the shading plate generates a difference in the light intensity before and after the intersection point on the optical axis of the camera, as shown in Figure 2b.This device scans at a constant speed in parallel with the cultivation rows while capturing moving images.Subsequently, by joining the center columns (columns on the optical axis) of each frame of the image indicated by the red dashed line in the scanning direction, a panoramic image, as shown in Figure 3c, The onboard plant-scanning device generates images with the background removed.Figure 2a shows a conceptual diagram of the fruit set monitoring system for scanning in a greenhouse, visualized from the side.Tomato plants in a high-wire system are usually planted in two rows with a work path between them.The automatic trolley operates on the hot-water pipe rails of the work path.In the fruit set monitoring system, when the plants in the front (shown in red) are the measurement targets, the plants in the back (shown in purple) are the background.The plants in the back can also be a source of error in tomato fruit detection.Therefore, it is desirable to obtain an image in which only the "plants in front" appear in the image.The following section describes how to generate a panoramic image with the background removed from the shooting video.As shown in Figure 1a, the plant-scanning device consists of a light source and a shading plate that adjusts the illumination range.Adjusting the angle of the shading plate generates a difference in the light intensity before and after the intersection point on the optical axis of the camera, as shown in Figure 2b.This device scans at a constant speed in parallel with the cultivation rows while capturing moving images.Subsequently, by joining the center columns (columns on the optical axis) of each frame of the image indicated by the red dashed line in the scanning direction, a panoramic image, as shown in Figure 3c, is generated.The image in Figure 3c does not show the background plants reflected in Figure 3a,b; only the plants in the front are visible.The automatic trolley was modified using a commercially available hot-tube pest control machine (AUTO-4WAES; Arimitsu Industry Co., Osaka, Japan).The motor driver (Dual MC33926 Motor Driver for Raspberry Pi, Pololu Corporation, Las Vegas, NV, USA) was connected to the GPIO signal of a single-board computer (Raspberry Pi 3 Model B, RS Components, Corby, UK) via a DC.The car body was programmed to move forward and backward at an arbitrary speed by controlling the DC brush motors via the GPIO signals of the Raspberry Pi.A digital time switch (H5S OMRON Corporation, Kyoto, Ja- The automatic trolley was modified using a commercially available hot-tube pest control machine (AUTO-4WAES; Arimitsu Industry Co., Osaka, Japan).The motor driver (Dual MC33926 Motor Driver for Raspberry Pi, Pololu Corporation, Las Vegas, NV, USA) was connected to the GPIO signal of a single-board computer (Raspberry Pi 3 Model B, RS Components, Corby, UK) via a DC.The car body was programmed to move forward and backward at an arbitrary speed by controlling the DC brush motors via the GPIO signals of the Raspberry Pi.A digital time switch (H5S OMRON Corporation, Kyoto, Japan) was used to activate the power supplies of the trolley and scanning device at regular intervals to operate the automatic cart at a fixed time every night.The Raspberry Pi referred to here was only used to control the running of the automatic cart and not for the image analysis described below.
AgriEngineering 2024, 6, FOR PEER REVIEW 6 via a USB 3.0 cable, and a running image processing program extracted and combined the center column of each frame in real-time to generate a panoramic plant body The panoramic images were stored in the built-in memory.The panoramic images were temporarily stored in a built-in storage area and saved via Wi-Fi for pre-prepared online storage.
In both the experiments, the panoramic images collected in the greenhouses were transferred once to the image processing computer in a laboratory without any real-time fruit detection process.The fruit detection processes were conducted by batch processing after the panoramic imaging operation.The panoramic images were recombined and cropped to an image processing computer as input images for the fruit detection model.The image sizes for each dataset are described in Section 2.2.

Composition of the Fruit Data Set
Table 1 lists the datasets used in this study.The datasets are broadly classified into two categories: training/validation datasets (DS) for building the fruit detection models, and test datasets (TESTDS) for evaluating the detection performance of the trained models.In Table 1,   The light source, camera, and video recorder unit of the plant scanning device differed between the research greenhouse and the commercial greenhouse, where the plant images were collected (both are described below).In the research greenhouse, four LED bar lights (LEDSC980-W, MISUMI Group Inc., Tokyo, Japan) were used as the light sources, two on each side, and two modular cameras (F1005-E, Axis Communications AB, Lund, Sweden) were used as the cameras with two cameras on each side, for four cameras.The modular cameras have a fixed focal length of 2.8 mm, and the aperture value is f/2.0.The horizontal angle of view is 113 • and the vertical angle of view is 62 • .The video was recorded using a dedicated recorder (F44 Dual Audio Input Main Unit, Axis Communications AB, Lund, Sweden) and stored on an SD card (i200-64GB, Micron Technology, Inc., Boise, ID, USA).During the test period, the stored data were periodically retrieved, and a batch-processing program created in Python was used to generate the panoramic images of the plant bodies.
The light source used in the commercial greenhouse was a white LED tape (ZFS-155000-CW; JKL Components Corporation, Los Angeles, CA, USA).An industrial area camera (UI-3360CP-C-HQ, IDS Imaging Development Systems GmbH, Obersulm, Germany) with a C-mount lens (HF6XA-5M, FUJIFILM Corporation, Tokyo, Japan) on one side was used.
The camera system has a focal length of 6.23 mm and the aperture value is fixed on f/1.9.The horizontal angle of view is 74.7 • and the vertical angle of view is 58.1 • .The plant-scanning device is inverted to switch the imaging direction.Area camera signals were collected on an internal PC (Jetson TX2, NVIDIA Corporation, Santa Clara, CA, USA) via a USB 3.0 cable, and a running image processing program extracted and combined the center column of each frame in real-time to generate a panoramic plant body The panoramic images were stored in the built-in memory.The panoramic images were temporarily stored in a built-in storage area and saved via Wi-Fi for pre-prepared online storage.
In both the experiments, the panoramic images collected in the greenhouses were transferred once to the image processing computer in a laboratory without any real-time fruit detection process.The fruit detection processes were conducted by batch processing after the panoramic imaging operation.The panoramic images were recombined and cropped to an image processing computer as input images for the fruit detection model.The image sizes for each dataset are described in Section 2.2.

Composition of the Fruit Data Set
Table 1 lists the datasets used in this study.The datasets are broadly classified into two categories: training/validation datasets (DS) for building the fruit detection models, and test datasets (TESTDS) for evaluating the detection performance of the trained models.In Table 1, (a) MIXED_DS and (a') MIXED_TESTDS and (b) DELEAFING_DS, and (b') DELEAFING_TESTDS are data pairs collected in the same greenhouse.The split ratios of the train, validation, and test datasets are 66%, 22%, and 12%, respectively.The data were split randomly.The (e) GLOBAL_DS is a Train/Val.dataset used to assess the performance of the combined local models (a) and (b), which randomly contain half of the data from both the local models.The (c') and (d') test datasets are collected in a different house from the MIXED and DELEAFING datasets mentioned above.These datasets are used for testing only.An example test dataset image is shown in Figure 4.The tomato varieties imaged in each dataset are basically different, although some are common, and one or two varieties are included per dataset.The training and validation datasets, as well as the test dataset, were created by the annotation tool VGG Image Annotator (https://www.robots.ox.ac.uk/~vgg/software/via/, accessed on 30 April 2024) using the polygonal regions around the fruit.MIXED_DS is an image dataset collected at night in a high-elevation research greenhouse at the National Agriculture and Food Research Organization (NARO).These images are similar to the test images shown in Figure 4a'.Two cameras captured the images of the harvested fruit bunches and approximately two upper bunches.Thereafter, the images before and after leaf-picking were combined, depending on the shooting position.The images were captured several times at night from 7:00 p.m. to 4:00 a.m.The plant-

Train and Validation Dataset: (a) MIXED_DS
MIXED_DS is an image dataset collected at night in a high-elevation research greenhouse at the National Agriculture and Food Research Organization (NARO).These images are similar to the test images shown in Figure 4a'.Two cameras captured the images of the harvested fruit bunches and approximately two upper bunches.Thereafter, the images before and after leaf-picking were combined, depending on the shooting position.The images were captured several times at night from 7:00 p.m. to 4:00 a.m.The plant-scanning device was operated to automatically collect tomato plant images.The image size was 1506 × 1024 pixels.

Train and Validation Dataset: (b) DELEAFING_DS
DELEAFING_DS is an image dataset collected nightly from a commercial tomato greenhouse.These images are similar to the test images shown in Figure 4b'.A single camera captured the images, mainly during the harvesting of fruit bunches.Almost all the images were obtained after leaf-picking and contained tiny leaf areas.After sunset, the plant scanning system was operated from 6:30 p.m. to 8:00 p.m. to generate the real-time panoramic images of the tomato plants.The image size is 2048 × 2048 pixels.

Train and Validation Dataset: (e) GLOBAL_DS
GLOBAL_DS is a dataset that randomly combines equal amounts of the images collected from both NARO (a) and commercial tomato greenhouses (b).GLOBAL_DS contains the features of the local datasets in (a) and (b) and was built with the expectation of interoperability beyond the local models.The scanning system, experiment date, image size, and tomato cultivars are identical for both local models.

Test Dataset: (a') MIXED_TESTDS
The images collected in the same test as MIXED_DS were used as test data for a section of plants that were not used for the training data.The size of the images is 1506 × 1024 pixels as well.

Test Dataset: (b') DELEAFING_TESTDS
The images from one row of cultivation collected in the same test as DELEAFING_DS were not used as the training data but were used as the test data.The size of the image was 2048 × 2048 pixels.

Test Dataset: (c') LEAFING_TESTDS
The images were captured at night in a greenhouse at NARO.The examples of these images are shown in Figure 4c'.Owing to the low-node-order pinching system without leaf-picking, the images in this dataset frequently show fruits occluded by leaves.The scanning device used was the same as that used for MIXED_DS.The image size was 1330 × 1024 pixels.

Test Dataset: (d') DAY_TESTDS
This dataset of images from a high-wire system was captured during the daytime in a research greenhouse at NARO.The images were collected using an action camera (GoPro HERO4 Session, GoPro, Inc., San Mateo, CA, USA) rather than a plant-scanning device.Based on the automatic trolley described above, the Gopro was installed at a height of 550 mm from the pipe rail, and a video was captured while driving.The images were cut from the captured video images at regular time intervals, and randomly extracted images were used as the dataset (Figure 4d').The image size was 2160 × 3840 pixels.

Visualization of Image Similarity by t-SNE
We investigated the relative similarities between the datasets used in this study to interpret the differences in fruit detection performance across the datasets.As a measure of similarity, we used the t-distributed stochastic neighbor embedding (t-SNE) method proposed by Maaten and Hinton [38], which has been reported in previous studies.t-SNE transforms high-dimensional image data into low-dimensional ones.As an example of using t-SNE to discuss the domain (distributional variation) of an image dataset, Shu et al. [39] visualized the similarity between numerical images using t-SNE.They also discussed the effects of domain adaptation on image classification tasks.In this study, we followed this approach and attempted to explain the differences in detection accuracy based on the similarity between datasets.

Methods for Training Fruit Detection Models
In this study, three fruit detection models were constructed using the MIXED_DS, DELEAFING_DS, and GLOBAL_DS datasets for training and validation.Datasets with different amounts of data were prepared in advance to evaluate the impact of the size of the datasets on the detection accuracy.Specifically, the datasets were prepared with a maximum number of 452 pieces at 100% and downscaled to 75% (n = 339), 50% (n = 226), and 25% (n = 113).The split ratio of the training and validation data was 3:1 for all the datasets.
Instance segmentation, which simultaneously estimates the location of each fruit and pixel-level region, was used to detect tomato fruits.The tomato detection model was trained, validated, and tested on the Mask RCNN architecture run by TensorFlow and Keras (https://github.com/matterport/Mask_RCNN;accessed on 30 April 2024).In this study, Mask R-CNN uses a backbone network to extract features, a Region Proposal Network (RPN) to propose object candidate regions, and RoI Align to convert these regions into fixed-size feature maps.The final outputs are obtained through heads for classification, bounding box regression, and mask prediction.ResNet-50 is used as the backbone network and the number of layers is 50.The algorithm was run on a graphics card (GeForce GTX 1080 Ti, NVIDIA Corporation, Santa Clara, CA, USA), CPU (Xeon processor E5-1650 v4 (6 cores, 3.60 GHz), Intel Corporation, Santa Clara, CA, USA), 64 GB RAM, and 64-bit Ubuntu 18.04 LTS OS.The training method used a Resnet-50 convolutional neural network model pretrained on the MS COCO dataset and was used for transition learning.One epoch was iterated as many times as the number of images used for training for 300 training epochs for each dataset.The learning rate was set to 0.001, and the IoU threshold was set to 0.50 or higher.During the learning process, the loss function of the training data (TrainLoss), the loss function of the validation data (ValLoss), and the average precision (AP) of the validation data were calculated for each epoch.The model with the epoch and training dataset sizes with the highest average precision (AP) was adopted as the optimal model for subsequent validation.

Fruit Detection Accuracy Verification Method
The three fruit detection models were evaluated for detection accuracy using test datasets from different facilities, cultivation methods, and times of day to confirm interoperability in different environments.The following four types of local model validations were performed:

Validation 1: Inference on the Same Facility Images as the Training DS
The detection accuracy was evaluated using test data from the same facility where the training dataset was constructed using the two fruit detection models trained with MIXED_DS and DELEAFING_DS.Specifically: • The model trained with MIXED_DS was tested against MIXED_TESTDS; • The model trained with DELEAFING_DS was tested against DELEAFING_TESTDS.• The model trained with MIXED_DS was tested against DELEAFING_TESTDS; • The model trained with DELEAFING_DS was tested against MIXED_TESTDS.

Validation 3: Inference on the Different Cultivation Method Images
Both MIXED_DS and DELEAFING_DS are the image datasets of high-wire systems, and the leaves were picked partially or completely.This validation evaluated the inference performance of the low-node-order pinching system images of unpicked leaves.Specifically: • The model trained with MIXED_DS was tested against LEAFING_TESTDS; • The model trained with DELEAFING_DS was tested against LEAFNG_TESTDS.

Validation 4: Inference on the Daytime Images
Both MIXED_DS and DELEAFING_DS were image datasets collected at night.This validation evaluated the inference performance of the images collected during the daytime.Specifically: • The model trained with MIXED_DS was tested against DAY_TESTDS; • The model trained with DELEAFING_DS was tested against DAY_TESTDS.
In addition to the validation of the local model described above, two types of global model validations were performed:

Validation 5: Inference on the Local Dataset Images
In this validation, the models trained on the global dataset were used to assess their inference performance on test images from a local dataset collected in a single greenhouse.Specifically: • The model trained with GLOBAL_DS was tested against MIXED_TESTDS; • The model trained with GLOBAL_DS was tested against DELEAFING_TESTDS.

Validation 6: Inference on the Leafing and Daytime Images
In this validation, models trained on the global dataset were used to evaluate inference performance on environments not included in the training data (pre leafing and day-time images).Specifically: • The model trained with GLOBAL _DS was tested against LEAFING _TESTDS; • The model trained with GLOBAL _DS was tested against DAY_TESTDS.
The precision and recall of the model were calculated to determine the AP, generating a precision-recall curve.The area under the precision-recall curve (AUC) was calculated as the AP.In this study, the IoU threshold was set to 0.50.From the combination of datasets used in the above four validations, the interoperability under different environmental conditions was evaluated by the average precision (AP 50 ).
In addition to the above validation, the visualization of fruit detection results using a confusion matrix was performed.The confusion matrix is a method used in classification problems and can evaluate the performance of a model.In this case, the number of correct detections (true positives, TP), the number of false detections (false positives, FP), and the number of undetected fruits (false negatives, FN) were displayed.

Visualization of Image Similarity by t-SNE
We visualized the relative similarity of the images to examine the differences in the detection accuracy of the test data.The distribution of each dataset visualized using t-SNE is shown in Figure 5. Differences in the color and markers indicate different dataset images, with closer distances on the coordinates indicating more significant similarities between the images.The formed clusters are shown as a range of black lines and labels to aid in interpreting the results.
The results were classified into one large cluster A; two clusters derived from B and C; and three small clusters D, E, and F, which were distant from the large cluster.
The large cluster A consists mainly of the images taken at night.It consists primarily of the images after the leaf-picking of MIXED_DS and DELEAFING_DS.Then, among the MIXED_DS, the images that were not leaf-picked were classified as cluster B, and the images in which the growth bags were captured were classified as cluster C. Some nighttime images, i.e., the LEAFING_DS images, were included in the MIXED_DS cluster, but most were scattered, had low mutual similarity, and did not constitute a cluster under the present conditions.
The other small clusters consisted primarily of the DAY_TESTDS images captured during daytime.Cluster D contains the images taken in rows bordering the sides of the house.Cluster E contained relatively bright images, and cluster F contained relatively dark images, which formed different clusters owing to differences in light conditions.
The large cluster A consists mainly of the images taken at night.It consists primarily of the images after the leaf-picking of MIXED_DS and DELEAFING_DS.Then, among the MIXED_DS, the images that were not leaf-picked were classified as cluster B, and the images in which the growth bags were captured were classified as cluster C. Some nighttime images, i.e., the LEAFING_DS images, were included in the MIXED_DS cluster, but most were scattered, had low mutual similarity, and did not constitute a cluster under the present conditions.
The other small clusters consisted primarily of the DAY_TESTDS images captured during daytime.Cluster D contains the images taken in rows bordering the sides of the house.Cluster E contained relatively bright images, and cluster F contained relatively dark images, which formed different clusters owing to differences in light conditions.

Fruit Detection Model Training Process
The fruit detection model was trained using MIXED_DS, DELEAFING_DS, and GLOBAL_DS as the training data.Figure 6 shows the loss function of the training data (TrainLoss), the loss function of the validation data (ValLoss), and the average precision (AP) of the validation data for each training epoch.To facilitate visual comparison between data sizes, a moving average process was performed and displayed for each of the three epochs.In MIXED_DS, TrainLoss showed a monotonous decrease up to about 80 epochs throughout the training process, while ValLoss remained in the range of 0.5~0.7,starting from about 20 epochs.In DELEAFING_DS, the TrainLoss continued to decrease for up to 100 epochs throughout the study.However, ValLoss differed, reaching a minimum at 10~70 epochs, followed by a slight increase.The AP also reached a maximum at

Fruit Detection Model Training Process
The fruit detection model was trained using MIXED_DS, DELEAFING_DS, and GLOBAL_DS as the training data.Figure 6 shows the loss function of the training data (TrainLoss), the loss function of the validation data (ValLoss), and the average precision (AP) of the validation data for each training epoch.To facilitate visual comparison between data sizes, a moving average process was performed and displayed for each of the three epochs.In MIXED_DS, TrainLoss showed a monotonous decrease up to about 80 epochs throughout the training process, while ValLoss remained in the range of 0.5~0.7,starting from about 20 epochs.In DELEAFING_DS, the TrainLoss continued to decrease for up to 100 epochs throughout the study.However, ValLoss differed, reaching a minimum at 10~70 epochs, followed by a slight increase.The AP also reached a maximum at 10~70 epochs, followed by a gradual decrease, indicating a trend toward overfitting.In GLOBAL_DS, TrainLoss, ValLoss, and AP showed intermediate properties to those of MIX_DS and DELEAFING_DS.
A common trend was observed for the three datasets regarding differences in the dataset size.In all the datasets, TrainLoss was similar, regardless of the size of the dataset.AP tended to be higher with the increase in dataset size.However, the degree of increase decreased as the size of the dataset increased.The performance of n113 was significantly inferior to that of the other datasets, whereas the performances of n339 and n452 were almost the same.
The model with the highest AP was used in the subsequent analyses.MIXED_DS used the model at 25 epochs trained on the n339 dataset (AP = 0.941).DELEAFING_DS used the model at 62 epochs trained on the n339 dataset (AP = 0.979).GLOBAL_DS used the model at 45 epochs trained on the n226 dataset (AP = 0.952).The measured inference time for both (a') MIXED_TESTDS and (b') DELEAFING_TESTDS using the optimal model trained by GLOBAL_DS was processed in 0.3326 ± 0.0777 s per image.The model with the highest AP was used in the subsequent analyses.MIXED_DS used the model at 25 epochs trained on the n339 dataset (AP = 0.941).DELEAFING_DS used the model at 62 epochs trained on the n339 dataset (AP = 0.979).GLOBAL_DS used the model at 45 epochs trained on the n226 dataset (AP = 0.952).The measured inference time for both (a') MIXED_TESTDS and (b') DELEAFING_TESTDS using the optimal model trained by GLOBAL_DS was processed in 0.3326 ± 0.0777 s per image.The models trained with MIXED_DS and DELEAFING_DS were evaluated for their detection accuracy against the test data collected at the same facility where the training dataset was constructed.The precision-recall curves obtained from the validation are shown in Figure 7a, with AP 50 values of 0.918 and 0.973 for MIXED_DS and DELEAF-ING_DS, respectively.If the training and test datasets were collected at the same facility, DELEAFING_DS, with its highly homogeneous plant appearance, showed a higher AP 50 than MIXED_DS, which contained various plant appearances.
detection accuracy against the test data collected at the same facility where the training dataset was constructed.The precision-recall curves obtained from the validation are shown in Figure 7a, with AP50 values of 0.918 and 0.973 for MIXED_DS and DELEAF-ING_DS, respectively.If the training and test datasets were collected at the same facility, DELEAFING_DS, with its highly homogeneous plant appearance, showed a higher AP50 than MIXED_DS, which contained various plant appearances.The models trained with MIXED_DS and DELEAFING_DS were used to evaluate the detection accuracy for the different facilities from which the training dataset was constructed.The results of the precision-recall curves are shown in Figure 7b.The AP 50 values of the models trained using MIXED_DS and DELEAFING_DS were 0.962 and 0.822, respectively.In contrast to validation 1, the model trained with DELEAFING_DS exhibited a lower detection performance than that trained with MIXED_DS.Notably, the model trained with MIXED_DS had a high detection accuracy of AP 50 = 0.962 for DELEAFING_TESTDS, which was higher than that at the same facility (AP 50 = 0.918 on MIXED_TESTDS in validation 1).

Validation 3: Inference on the Different Cultivation Method Images
The models trained with MIXED_DS and DELEAFING_DS were used to evaluate the detection accuracy for low-node-order pinching system (LEAFING_TESTDS) images.The results of both the validations are shown in Figure 7c with precision-recall curves.
The AP 50 values of MIXED_DS and DELEAFING_DS are 0.795 and 0.805, respectively, and the detection accuracy is slightly higher for the MIXED_DS model.However, the AP 50 was lower, ranging from 0.02 to 0.17, compared to the validation 1 and 2 results.

Validation 4: Inference on the Daytime Images
The models trained with MIXED_DS and DELEAFING_DS were used to evaluate the detection accuracy of the images captured during the daytime (DAY_TESTDS).The results of both the validations are shown in Figure 7d with precision-recall curves.The AP 50 values for MIXED_DS and DELEAFING_DS were 0.558 and 0.574, respectively.These values were almost equal; however, the DELEAFING_DS model had a slightly higher detection accuracy.However, the test images captured during the daytime resulted in the lowest AP 50 for all the validations in this study.In particular, the AP 50 was significantly lower, ranging from 0.27 to 0.42, compared to validation 1 and 2 results.The detection accuracy was also inferior to that of the low-node-order pinching system images (validation 3), with the AP 50 decreasing by 0.22-0.25.

Validation 5: Inference on the Local Dataset Images
The model trained with GLOBAL_DS was used to evaluate the detection accuracy for the local dataset test images (MIXED_TESTDS and DELEAFING_TESTDS).The results of both the validations are shown in Figure 7e with precision-recall curves.
The AP 50 values of MIXED_TESTDS and DELEAFING_TESTDS are 0.891 and 0.994, respectively, and the detection accuracy is slightly lower than the results of the same facility inference on the local datasets (validation 1).When compared to the inference performance for different facilities on the local datasets (validation 2), the detection accuracy was inferior to MIXED_DS but better than DELEAFING_DS.

Validation 6: Inference on the Leafing and Daytime Images
The model trained with GLOBAL_DS was used to evaluate the detection accuracy for different environmental images (LEAFING_TESTDS and DAY_TESTDS).The results of both the validations are shown in Figure 7f with precision-recall curves.
The AP 50 values of MIXED_TESTDS and DELEAFING_TESTDS are 0.797 and 0.640, respectively, and the detection accuracy is lower than the results of the same facility on the global datasets (validation 5).Looking at the different test data, the detection accuracy of the models trained on GLOBAL_DS was similar to the results for the local dataset (validation 3) for LEAFING_TESTDS.However, it is interesting to note that for DAY_TESTDS, the detection accuracy is improved compared to the results for the local dataset (validation 4).

Confusion Matrix Results of Each Validation
In Figure 8, the confusion matrix for the result with the highest F-score, calculated from precision and recall, is displayed.In each figure, the bottom-right corresponds to TP, the top-right corresponds to FP, and the bottom-left corresponds to FN.Since this is a fruit detection task, the top-left representing TN (true negatives) is displayed as 0.
The confusion matrix achieved a high number of true positives (TP), especially for the models where the training and test datasets are in the same domain (a, b, e).False positives (FP) and false negatives (FN) were relatively low, which indicates that a significant number of the fruits were detected correctly.On the other hand, it can be observed that FPs and FNs are elevated when inferred from test data that is different from the training data set (c, d, f).

Discussion
In this study, three fruit detection models were built for two greenhouses using image datasets collected at night.We then evaluated the differences in detection accuracy for the test data collected from various facilities and cultivation methods during the daytime.The confusion matrix achieved a high number of true positives (TP), especially for the models where the training and test datasets are in the same domain (a, b, e).False positives (FP) and false negatives (FN) were relatively low, which indicates that a significant number of the fruits were detected correctly.On the other hand, it can be observed that FPs and FNs are elevated when inferred from test data that is different from the training data set (c, d, f).

Discussion
In this study, three fruit detection models were built for two greenhouses using image datasets collected at night.We then evaluated the differences in detection accuracy for the test data collected from various facilities and cultivation methods during the daytime.
First, the results of the t-SNE visualization of the relative similarity of the images (Figure 5) showed that MIXED_DS included images with a more diverse plant appearance than DELEAFING_DS.This may be because MIXED_DS included images before and after leaf-picking, whereas DELEAFING_DS included images almost exclusively after leafpicking.DELEAFING_DS tended to overfit compared to MIXED_DS in training (Figure 6) because DELEAFING_DS enforced training with a limited number of patterns, which did not improve the generalization performance, leading to overfitting.It has been reported that the diversity of the training datasets can affect the occurrence of overfitting [40].The results for the datasets with identical training and inference (validation 1, Figure 7a) show that DELEAFING_DS has a higher detection accuracy than MIXED_DS.Higher similarity (smaller distribution of domains on t-SNE) in both the training and test data increased the fruit detection performance.
However, the results for the datasets with different training and inference (validation 2, Figure 7b) showed a higher detection accuracy when trained on MIXED_DS with lower similarity (larger domains on t-SNE) than when trained on DELEAFING_DS with higher similarity.This result is interesting and suggests that the difference in diversity between the training and test data affects the detection accuracy.When training with MIXED_DS, which has a high diversity of plant appearances, the test data of DELEAFING_DS were sufficiently explained.However, when learning with DELEAFING_DS, the test data of MIXED_DS were outside the distribution, causing a domain shift that was thought to have resulted in inadequate performance.In machine learning, such as deep learning, it is recommended that the training data be selected to have sufficient variation depending on the context of the deployment destination [12].This finding is consistent with the results of the present study.
The results of validation 3 (Figure 7c) and validation 4 (Figure 7d) showed that for the test data with lower similarity to the training dataset (LEAFING_TESTDS and DAY _TESTDS), the accuracy was lower for fruit detection because they were not included in the domain of the training data.However, the trend in detection accuracy was consistent with the degree of similarity between the domains.The decrease in detection accuracy was more pronounced for DAY_TESTDS, which had a lower similarity than for LEAFING_TESTDS, which had a higher similarity (closer domain) to the training data.The reason why both MIXED_DS and DELEAFING_DS did not perform well with breakfast _TESTDS, which are daytime images, is thought to be because the inference was an extrapolation problem due to the domain shift.However, the domain of LEAFING_DS was closer than that of DAY_TESTDS because it was captured at night and was partially similar to the leafed images in MIXED_DS and DELEAFING_DS, and inference by interpolation was partially possible.This may have contributed to detection accuracy.However, in LEAFING_TESTDS, the situation in which many fruits were hidden owing to the lack of leaf-picking was presumably why the fruit detection accuracy was lower than that in validations 1 and 2.
The GLOBAL_DS results presented in validation 5 (3.3.5) and validation 6 (3.3.6)provide different findings from the local dataset results discussed above.At the beginning of the study, it was assumed that the model trained by GLOBAL_DS (global model) would perform better than any of the local dataset models, but this was not the case.This result may be explained by the domains shown in Figure 5.This time, GLOBAL_DS was constructed by combining equal amounts of both MIXED_DS and DELEAFING_DS, so the domain itself was not expanded.Rather, half of the data were occupied by the narrower domain of DELEAFING_DS, which may have slightly narrowed the spread of the domain that MIXED_DS has.This is considered to have resulted in the GLOBAL_DS model being inferior to the MIXED_DS model (validation 5), as it was unable to learn the diversity that MIXED_DS possesses.Note that this result is based on the present dataset combination, and it is expected that the global dataset will perform better if the domain of the local dataset is separated.
In summary, the results of this study indicated that the accuracy of tomato fruit detection improves when the domain of the training dataset contains the test environment.However, when the detection was inferred from a new type of untrained image, the accuracy decreased significantly, and this tendency was generally related to the similarity of the t-SNE.
The detection accuracy of tomato fruits obtained in this study was AP 50 = 0.973 for the evaluation in the same greenhouse and AP 50 = 0.962 for the assessment of different greenhouses, both of which were the highest when inferred by DELEAFING_TESTDS.The accuracy of this study was very close to that of the state-of-the-art tomato fruit detection models reported at the time of writing: mAP = 0.969 (modified based on YOLO v5s [25]) and mAP = 0.953 (modified based on YOLO v5s [26]).In comparison with previous reports using the same two-shot detector as in this study, AP 50 = 0.878 (faster RCNN [21]), F1 = 0.94 (Mask RCNN [1]), accuracy = 0.902 (faster RCNN [2]), and F1 = 0.920 (Mask RCNN [22]), all of which are lower than the accuracy of this study.It should be noted that the results of this study were obtained at night and that some of the images were used after leaf-picking when the fruit was more easily visible.Nevertheless, consistently achieving this accuracy in different greenhouse environments would be helpful for the commercial use of the monitoring system.
The system must be sufficiently robust to allow for accurate tomato fruit detection in different environments to use a fruit detection model built in one greenhouse in another.To this end, we examined the factors that can affect the detection accuracy in tomato production greenhouses.According to a review of general object detection by Liu et al. [41], the detection accuracy for tasks that target a single category, such as tomato fruit detection, is due to the "intrinsic factors" and "imaging conditions".The former are error factors caused by variations in the instance itself.In the case of tomato fruits, although there are differences in shape and color depending on the variety and maturity state, this is not considered a particular problem.In the latter category of "imaging conditions", variations in diverse environments such as lighting (dawn, midday, dusk, and indoor), weather, camera, background, illumination, occlusion, and viewing distance can dramatically affect the appearance of the fruit.
It is necessary to address the differences in "imaging conditions" to obtain the robustness of tomato fruit detection in different greenhouse environments.The following two methods can be considered policies to address this problem: The first is to increase the diversity of the training dataset and expand the source domain.For example, Riou et al. [42] proposed a data expansion strategy that bridges the gap between the source domain during training and the distribution of the target domains obtained in a greenhouse to improve the robustness of cucumber fruit detection.It is necessary to build a comprehensive dataset reflecting all the conditions or adapt the fruit detection model for each new environment through fine-tuning to ensure robustness in various greenhouse environments.However, creating such a dataset is challenging in terms of both cost and effort owing to the diverse backgrounds of tomato greenhouses.
The second way to deal with differences in "imaging conditions" is to fix the variation using an optimized vision system.Especially, suppose that the light source can be kept constant and the background information that causes variation can be eliminated.In this case, the variation in the target domain can be reduced, and the task of fruit detection can be made more accessible.The plant-scanning device used in this study increased the similarity of the images by removing the background information and capturing only the plant bodies, thus enabling them to be clustered into one large domain (Figure 5).Consequently, interoperability between the different facilities was achieved with a very high AP 50 of 0.962.The approach of suppressing variation using a vision system is common and has been proposed in previous studies, such as by Arad et al. [43] and Afonso et al. [1].These innovations in vision systems are strategies that attempt to control the diversity of tomato plant images in a greenhouse and are realistic solutions to improve the interoperability of the trained fruit detection model.
Thus, the imaging approach proposed in this study effectively improved the robustness of tomato fruit detection in different greenhouse environments.However, several issues remain to be resolved.The first is the difference in plant appearance, as shown by the difference between clusters A and B in Figure 5.The fruits were severely occluded prior to leaf-picking.This effect leads to a decrease in the detection accuracy (approximately 0.80 in Figure 7).Therefore, to improve such occlusions, the difference in the detection rate depending on the shooting direction, as in Haggag et al. [30] and Hemming et al. [44], as well as the indirect estimation of fruits hidden by occlusions, are considered effective [45].It is also desirable to develop a new cultivation system that can achieve both the nonoccurrence of occlusion and high yield.The presence of unexpected objects in the image, such as the growth bags shown in cluster C in Figure 5, or other attractants, may be a factor that reduces the robustness of tomato fruit detection.Therefore, close attention must be paid to installations other than the tomato plants, which are the subject of the application, when initiating vision-based applications in production facilities.In addition, this study, like other similar studies [30,42], only assesses a single type of model architecture (Mask RCNN).This approach was chosen to control for the variability of the models used and to focus on the results of the dataset variability.However, the potential for other architectures and model optimizations to influence the present results should be fully evaluated in future research.

Conclusions
This study highlights the critical role of dataset diversity and a specialized imaging approach that controls lighting conditions to improve the adaptability and accuracy of tomato fruit detection models under diverse greenhouse conditions.By utilizing the nighttime images of two greenhouses with different levels of leaf-picking, we found that the models trained with more varied plant appearance datasets could significantly reduce domain shift problems by using a background removal imaging approach.The model achieved high detection accuracy with an AP 50 of 0.973 in the same greenhouse environment and 0.962 when tested in different greenhouses.These results highlight the feasibility and potential efficiency of implementing such a model in a commercial environment and promise to significantly improve the accuracy and adaptability of the tomato fruit set monitoring systems.Future challenges include improving the robustness and interoperability of the model in various greenhouse environments by addressing fruit occlusion owing to the differences in cultivation methods and the presence of unlearned objects.

Patents
The imaging technique presented in this publication was based on the patent [46].Funding: This research was funded by a grant from a commissioned project study on "AI-based optimization of environmental control and labor management for large-scale greenhouse production" by the Ministry of Agriculture, Forestry, and Fisheries, Japan.

Figure 1 .
Figure 1.The design of the fruit set monitoring system.(a) The system appearance and names of each part; (b) an image showing nighttime photography.

Figure 1 .Figure 2 .
Figure 1.The design of the fruit set monitoring system.(a) The system appearance and names of each part; (b) an image showing nighttime photography.AgriEngineering 2024, 6, FOR PEER REVIEW 5

Figure 2 .
Figure 2. Fruit set monitoring system diagram.(a) Side view: The plants facing the pipe rails are photographed; the opposite side's plants are background.(b) Upward view: The shading plate angle creates different light intensities between the target and background plants, highlighted by the red frame.The images in this area are stitched together to remove background information.

Figure 3 .
Figure 3. Differences in images using various methods.(a) Daytime image without light source, stitched.(b) Night image without shading plate.(c) Night image with shading plate.Background plants are visible in (a,b), but removed in (c), showing only target plants.
(a) MIXED_DS and (a') MIXED_TESTDS and (b) DELEAFING_DS, and (b') DELEAFING_TESTDS are data pairs collected in the same greenhouse.The split ratios of the train, validation, and test datasets are 66%, 22%, and 12%, respectively.The data were split randomly.The (e) GLOBAL_DS is a Train/Val.dataset used to assess the performance of the combined local models (a) and (b), which randomly contain half of the data from both the local models.The (c') and (d') test datasets are collected in a different house from the MIXED and DELEAFING datasets mentioned above.These datasets are used for testing only.An example test dataset image is shown in Figure 4.The tomato varieties imaged in each dataset are basically different, although some are common, and one or two varieties are included per dataset.The training and validation datasets, as well as the test dataset, were created by the annotation tool VGG Image Annotator (https://www.robots.ox.ac.uk/~vgg/software/via/, accessed on 30 April 2024) using the polygonal regions around the fruit.

Figure 3 .
Figure 3. Differences in images using various methods.(a) Daytime image without light source, stitched.(b) Night image without shading plate.(c) Night image with shading plate.Background plants are visible in (a,b), but removed in (c), showing only target plants.

2. 5 . 2 .
Validation 2: Inference on the Different Facility Images from the Training DS Validation 2 combinations were interchanged and validated to evaluate fruit detection accuracy in different facilities.Specifically:

Figure 5 .
Figure 5. Results of the qualitative evaluation of the similarity between the datasets visualized using t-SNE.The clusters A, B, and C mainly consist of MIXED_DS and DELEAFING_DS.A is the largest cluster of images after the leaf-picking.B is the image that is not leaf-picked.C is a cluster of images that includes the growth bags.The smaller clusters D, E, and F consist primarily of the DAY_TESTDS images.D contains the sides of the house images.E and F contained relatively bright and dark images, respectively.

Figure 5 .
Figure 5. Results of the qualitative evaluation of the similarity between the datasets visualized using t-SNE.The clusters A, B, and C mainly consist of MIXED_DS and DELEAFING_DS.A is the largest cluster of images after the leaf-picking.B is the image that is not leaf-picked.C is a cluster of images that includes the growth bags.The smaller clusters D, E, and F consist primarily of the DAY_TESTDS images.D contains the sides of the house images.E and F contained relatively bright and dark images, respectively.

3. 3 .
Differences in Fruit Detection Performance 3.3.1.Validation 1: Inference on the Same Facility Images as the Training DS

Figure 7 .
Figure 7.Comparison of fruit detection accuracy for the test data under different conditions.(a) Inference for the same greenhouse; (b) inference for the different greenhouse; (c) inference for the leafing plant images; (d) inference for the daytime images; (e) inference for the local dataset images; (f) inference for the leafing and daytime images.

3. 3 . 2 .
Validation 2: Inference on the Different Facility Images from the Training DS

Figure 8 .
Figure 8. Confusion matrix results for the fruit detection models under different test datasets.(a) Inference for the same greenhouse; (b) inference for the different greenhouse; (c) inference for the leafing plant images; (d) inference for the daytime images; (e) inference for the local dataset images; (f) inference for the leafing and daytime images.

Figure 8 .
Figure 8. Confusion matrix results for the fruit detection models under different test datasets.(a) Inference for the same greenhouse; (b) inference for the different greenhouse; (c) inference for the leafing plant images; (d) inference for the daytime images; (e) inference for the local dataset images; (f) inference for the leafing and daytime images.

Table 1 .
Training, validation, and test datasets in this study.

Type DS Name Total Images * (Train/Val) Total Fruits (Train/Val) Time of the Day Camera Experiment Date (Start and End Date) Cultivar Cultivation Method
* the maximum quantity of the datasets.The downsized dataset is described in Section 2.4.* the maximum quantity of the datasets.The downsized dataset is described in Section 2.4.