Automatic Phenotyping of Tomatoes in Production Greenhouses Using Robotics and Computer Vision: From Theory to Practice

: High-throughput phenotyping is playing an increasingly important role in many areas of agriculture. Breeders will use it to obtain values for the traits of interest so that they can estimate genetic value and select promising varieties; growers may be interested in having predictions of yield well in advance of the actual harvest. In most phenotyping applications, image analysis plays an important role, drastically reducing the dependence on manual labor while being non-destructive. An automatic phenotyping system combines a reliable acquisition system, a high-performance segmentation algorithm for detecting fruits in individual images, and a registration algorithm that brings the images (and the corresponding detected plants or plant components) into a coherent spatial reference frame. Recently, signiﬁcant advances have been made in the ﬁelds of robotics, image registration, and especially image segmentation, which each individually have improved the prospect of developing a fully integrated automatic phenotyping system. However, so far no complete phenotyping systems have been reported for routine use in a production environment. This work catalogs the outstanding issues that remain to be resolved by describing a prototype phenotyping system for a production tomato greenhouse, for many reasons a challenging environment.


Introduction
Plant breeders are developing new varieties with a diverse set of goals, such as maximizing yield to be able to feed a rapidly growing global population, resistance to new diseases, and adaptation to climate change, to name only a few [1]. To be able to select the most suitable genetic variety according to these criteria, breeders need to be able to characterize these varieties by traits such as root morphology, biomass, leaf, fruit characteristics, and yield. This is, in a nutshell, the goal of plant phenotyping [1,2]. Automation of this process would be an important advance, as it would reduce labor costs, be more time-efficient, and reduce errors. An additional benefit would be the potential to monitor plant development continuously throughout the growing season. Moreover, genetic analyses have now become automated to such an extent that many individual plants can be analyzed quickly and at low cost, often making automatic phenotyping the bottleneck in genotype-phenotype analyses. Several different approaches to automated phenotyping have been developed which often rely on imaging, which offers a fast, noninvasive, and non-destructive way of obtaining phenotypic information from plants. Recent advances in computer vision show huge potential in analyzing these image data [2,3], in particular Deep Learning [4] which has been rapidly adopted in agricultural applications [5][6][7].
However, different crops and cultivation environments (open field, greenhouse, or laboratory growth chamber) pose different challenges for an automated phenotyping pipeline. For open field crops, drones can be used because of the freedom of movement possible [8,9]. At the other extreme, small plants such as Arabidopsis can be placed in beds in growth chambers or closed conveyor belt systems and can thus be phenotyped automatically in large numbers by characterizing the whole plant [10][11][12]. Such settings have the advantages of controllable illumination and imaging settings, and individual plants can be separated from neighbors easily. In a greenhouse, the deployment of drones is more challenging because space is more restricted and crops are positioned closer to each other. The deployment of robots on the ground also faces several challenges: space between the rows is generally limited to the bare minimum to operate the greenhouse, while horticultural crops, such as tomato, sweet pepper, and cucumber, often grow extensively both in horizontal and vertical directions. This necessitates the use of several cameras to fully capture these plants. The field-of-views of these cameras will show considerable parallax with respect to each other, as the cameras are positioned close to the plants, which complicates their joint analysis. Moreover, day-time measurements face different illumination conditions because of the influence of outside lighting conditions (clouds, daily, and seasonal variations in the intensity of the sunlight).
Next to the aforementioned challenges, in a breeding application, a large number of different varieties are grown in a single greenhouse compartment for selection purposes. The different varieties show large variation in morphology, which require characterization by the phenotyping tool. Often, the plants of each variety are grouped consecutively in experimental plots. As plants often show extensive growth patterns in horizontal and vertical directions, the delineation of each plot along the row, and thereby the assignment of plant characteristics and fruits to each corresponding variety, is complicated.
It is therefore clear that the development of an automated phenotyping system faces significant challenges which mostly depend on the practical setting in which such a system will be deployed. The following paragraph reviews previous efforts to develop components of such phenotyping system and the extent to which integrated phenotyping systems have already been developed. This section will then conclude by outlining the main aim of this paper.

Related Work
Owing to the high cost of human labor and the need to produce food more efficiently on a larger scale, robots are increasingly being used in agricultural tasks such as harvesting [13,14]. For instance, in [15,16], a sweet pepper harvesting robot was presented, with fruit detection using color, shape, and texture features, and obstacle avoidance using deep learning semantic segmentation [17,18]. A cucumber harvesting robot using the multi-path convolutional neural network to detect the cucumbers was presented in [19]. Robots are also being used for spraying [20] and pruning [21][22][23].
A robot for in-field phenotyping of crops was proposed in [24]. This system uses RGB and spectral cameras and a GPS sensor to localize individual plants. In [25], a robotic platform to automatically measure characteristics of pepper plants in greenhouses was presented. This device used multiple cameras and features such as plant height, leaf area index, and other statistical features computed from the RGB images. A greenhouse phenotyping platform for soya beans was presented in [26]. In [27], a mobile robotic phenotyping platform for growth chamber settings based on a Kinect RGBD camera and with a moving arm capable of probing individual leaves, was proposed. A field phenotyping robot for rice, using laser and light sensors in addition to an RGB camera was presented in [28]. In [8], the use of different sensors (RGB and multispectral) and different platforms, robots, and drones, was proposed to be able to deal with larger variations in the field of study. Another multi-sensor system for phenotyping of field crops such as wheat was developed by Lemnatec Gmbh [29], which used RGB, hyperspectral, and laser cameras, among other sensors, to obtain top view images of the canopies of standing crops.
The authors of [30] studied phenotyping of internode length in cucumber plants imaged with a industrial machine vision camera from multiple viewpoints. The authors achieved a relative error of 5.8% on plants at fixed positions in a climate chamber at a distance that prevented occlusion between the plants.
In [31], a robot acquired images of apple trees in an orchard, from which the ripe apple fruits were detected using watershed segmentation and the circular Hough transform with an F1 score of 0.86. In [32], mango fruits were detected from monocular camera images using FasterRCNN, and were then tracked across successive images using motion tracking and structure from motion. Lidar and GPS locations were used to match these fruits to individual trees. Tensorflow's object detection application programming interface (API) (https://github.com/tensorflow/models/tree/master/research/object_detection, accessed on 16 July 2021) [33] was used in [34] to detect tomatoes, from images in which an entire plant was captured. This API uses either FasterRCNN or SSD, for a trade-off between speed and accuracy.
An early attempt at predicting tomato yield from aerial images [35] used the normalized difference vegetation index (NDVI) to build a prediction model, which was found to have a prediction root mean square error of 6%. Aerial images taken by unmanned aerial vehicles (UAVs/drones) were used in [36] to calculate features such as canopy cover, height, volume, and Excessive Greenness Index which, along with weather information, was used to train an artificial neural network regression model to predict the harvested yield. UAV images were also used in [9] to obtain such features as plant area, border length, width, and length, that were used to train a random forest predictor for fresh shoot mass, fruit numbers, and yield mass per plant. Color features of tomato fruits extracted using colorspace transforms in a post-harvest setting have been reported to be informative about the genetic variation [37].
Note that in [9,35,36], the cultivation was on open fields rather than in a greenhouse, and thus the separation of plants or plots was relatively simple. In more complex situations (such as production greenhouses), it is necessary to map each fruit detected in each image to a harvest unit (plant, plot, or row). This requires integration of the individual images (and their corresponding detected fruits) to a coherent spatial reference frame in which the relevant unit of analysis (plant, plot or row) can then also be situated. In [38], LiDAR was used to match mangoes detected in 2D images using FasterRCNN to their respective trees. An incremental Structure-from-Motion (SfM) method for the 3D reconstruction from unordered image collections was proposed in [39]. This method does not require depth information, but was developed for relatively large distances from the camera to the imaged object. In [40], a method was proposed to obtain a wide-area mosaic image of a tomato cultivation lane in a greenhouse. Point correspondences were obtained using the infrared images, and depth information was used for background elimination. Photogrammetry and feature matching were used in [41] to register images from a multi-camera system, for detecting citrus fruits. SfM was used for 3D localization of mangoes using only monocular cameras in [32,42], by tracking the fruits detected by deep learning, using prediction models such as the Hungarian algorithm or landmark matching. A combination of RGBDbased visual SLAM (Simultaneous Localization And Mapping) and semantic segmentation using SegNet [43] was used in [44] to generate 3D semantic maps of greenhouses for robot path planning. In [45], a method was presented for detecting apples in 3D, by generating 3D point clouds of apple trees from 2D images using structure from motion. This led to an improvement in the precision of detection of apples in 2D images using MaskRCNN as in 3D it is possible to discard more false positives by combining the 2D detection results.
In our previous work [46], MaskRCNN was used for detecting tomato fruits from RealSense RGBD images, with precision, recall, and F1 metrics of 0.94. In [47], we used colorspace transformations and morphological operations to detect tomato flowers, obtaining a recall of 0.79 and precision of 0.77.

Goals of this Paper
The previous sections have shown that significant advances have been realized in the fields of image acquisition, registration, and segmentation, but that the challenge of developing an integrated phenotyping system that performs in a practical growing environment largely remains. This paper will confront these challenges by describing the Phenobot, a robotic system with the purpose of phenotyping tomatoes in a production greenhouse. It consists of an autonomous robot that can navigate a greenhouse at a preset time and acquire images of the plants, and an image analysis pipeline. The latter consists of computer vision algorithms for fruit and ribbon detection at the image level, image registration to create a spatial reference frame for the full row including fruits, ribbons and thereby plot positions within this reference frame and then predicting plot-level yield by the average fruit radius per plot. The aim of this paper are as follows: 1.
To describe the development of an integrated phenotypic system from an acquisition system (robot, commercially available cameras), a set of high-performance segmentation algorithms for tomato and ribbon detection, and an adaptation of a well-known image registration algorithm.

2.
To evaluate the potential of this integrated phenotypic system in a realistic production environment 3.
To outline the challenges that this system faces in such a complex environment.

Hardware
The robotic platform is based on the IRIS! scout robot, a fully autonomous robot built by the Dutch companies Metazet-Formflex and Micothon, with embedded processing developed by the Canadian company Ecoation. This robot is capable of navigating autonomously through a greenhouse along the heating pipes, and can perform path changes without user intervention. RFID tags placed at the start of each row were used to ensure that the robot is at the right position, with the end of the row determined by setting a distance for how far the robot can go into the row. The battery life permits two runs over the whole greenhouse on a single charge.
The imaging system consists of four low-cost Intel RealSense D435 cameras which are stereo depth cameras. These cameras are mounted on the trolley of the robot, placed at heights of 930, 1630, 2300, and 3000 mm from the ground, in landscape mode. They are roughly at a distance of 0.5 m from the plants, which partially informed the selection of the RealSense D435 cameras, as they have a wide field of view. This low-cost solution also made it possible to replace cameras when they started malfunctioning, which is likely to happen in a humid and warm environment such as a production greenhouse. Data from the top camera were discarded as they contained predominantly foliage. A lighting system consisting of eight EFFISMART 36 light-emitting diodes (LEDs) is used to provide illumination for runs at night. The full setup is shown in Figure 1.
The on-board image acquisition software was developed in C# and makes use of the RealSense SDK. The robot was programmed to stop every 40 centimeters, over a row length of 50 m, and take a set of images with the 4 cameras at that position. The cameras were configured to produce pixel aligned RGB and depth images, of size 720 × 1280 pixels.

Data
The tomato greenhouse consists of 14 rows, each 50 m long. The plants, all truss tomatoes, are grouped by variety, in plots consisting of four plants. Plots are demarcated by ribbons attached to each first plant of the plot. There are 22 plots per row for a total of 308 plots. A complete run yields around 10,000 measurements, each consisting of an RGB and depth image pair. Acquisition was performed at night, to reduce the variability in lighting conditions and to minimize interference with day-to-day operations in the greenhouse.
The data for this paper were measured on 26 June, 2019 and on 28 June, two days later. On the intermediate day (27 June), all ripe tomatoes were harvested. The ground-truth harvest data consist of the number of tomatoes per plot and the total weight in kilograms per plot. The data are measured twice to allow for a derivation of harvest yield by detecting the missing tomatoes from the post-harvest data (when compared to the pre-harvest data) and by only using these in further computations. A sequence of images taken by all 4 cameras over a few consecutive stops is shown in Figure 2. The date and time of image capture, camera position, and distance covered read from the odometer are encoded in the image filenames.

Image Analysis Pipeline
As plots are used to test and select new tomato varieties for commercialization, they are the unit of interest here. Figure 3 provides a schematic overview of the building blocks necessary to process image-level data to create plot-level predictions. It can be understood as a stylized representation of the plots shown in Figure 2. This processing pipeline starts with the following steps:
Image registration (both vertically and horizontally, using a Discrete Fourier Transform (DFT)-based registration) 4.
Creation of a unified reference frame corresponding to a full row 5.
Position tomatoes and ribbons within this reference frame and assign tomatoes to plots These steps are performed both for the pre-and for the post-harvest row images. We investigate the performance of this setup by comparing the ground-truth harvest yield (average weight per plot) with two predictors: the first based on only on the pre-harvest plot-level average tomato radius and the second based on a combination of an estimate of which tomatoes were harvested (by comparing pre-and post-harvest data) and the average tomato radius of the harvested tomato only. The first comparison is closest to a setup in which tomatoes are continuously monitored to predict harvest yield, as here only current pre-harvest data can be used, while the second comparison provides the closest comparison with the ground-truth. The second comparison involves two extra analysis steps:
Identification of harvested trusses by comparison between pre-and post-harvest data.
The Deep Learning algorithms (MaskRCNN, FasterRCNN) were run on a Linux Mint system with an NVIDIA Titan XP 12 GB GPU, while the rest of the processing pipeline was implemented as a set of MATLAB scripts and run on Intel i5 based laptops running either Windows 10 or Linux Mint. The following sections detail the processing steps outlined above.

Fruit Detection
For detecting fruits from individual images, we use Detectron MaskRCNN [48], a deep learning object detector, which we have previously used in [46]. This software was chosen as it can detect not only the bounding boxes of the fruits, but also the instance pixel masks. The model based on the 101 layer ResNext [49] backbone was trained on a set of manually annotated images taken in May and June 2019, before the data sets used for this study were acquired. The training set, which is available online (https://data.4tu.nl/articles/dataset/ Rob2Pheno_Annotated_Tomato_Image_Dataset/13173422, accessed on 16 July 2021), is relatively small with 123 images, and contains images taken between two weeks and one month before the images analyzed in the present paper. In all cases, the images were taken at night and the LED flash illumination system was used to try and keep the illumination settings consistent.
This detection model is applied on all the images from each of the pre-and postharvest data sets. For each detected tomato, the center coordinates and radius in pixels are estimated by fitting a circle to the tomato circumference. In the case of occluded tomatoes, only the longest circular portion of the object contour is used for this fitting. The resulting center coordinates and radii, in both pixels and millimeters, are saved in a CSV file by a MATLAB script, for the next parts of the processing pipeline.

Ribbon Detection
The plots of 4 plants of a single variety are separated by blue-white or yellow-black ribbons. Separation of these plots therefore requires ribbon detection. We use the Faster-RCNN deep learning object detector [50] trained on 2 classes for each of the ribbon types. The training data for ribbon detection consisted of one entire row, which was annotated manually by drawing bounding boxes around the ribbons. After ribbon detection, the center coordinates of the detected ribbons are saved in a separate CSV file. As we have the sequence of plot varieties, we can match the tomatoes between a pair of ribbons to the corresponding plot.

Image Registration
The acquired images display a large degree of overlap both horizontally and vertically. It is therefore necessary to create a scene for each row which contains all detected tomatoes and ribbons. The use of stitching algorithms based on feature detection and matching has proven difficult, as the objects of interest (tomatoes) are placed relatively close to the camera which, in combination with the background at greater distance, causes substantial parallax problems. We have therefore used a combination of depth-masking, tomato detection and an intensity-based registration algorithm as will be explained below. This resulted in the creation of a unified spatial reference frame in which all tomatoes and ribbons from a row are consistently positioned. The starting (x = 0, y = 0) coordinate corresponds to the lower left corner of the first image in the row. Figure 4 illustrates the registration steps necessary to combine the images to form a unified spatial reference frame for the tomatoes and ribbons. As outlined in Section 2.2, each row is covered by a set of images that overlap both vertically and horizontally. The registration procedure starts by integrating images from the three different cameras (C1R1, C1R2, and C1R3 in Figure 4) vertically into column images (C1 and C2 in Figure 4). These column images are then again registered in the horizontal direction. The resulting transformations are then used to position the ribbon and tomato center coordinates into the unified reference frame. We use a mixture of the original RGB images and the segmented tomato images as a basis for registration. This mixture is heavily biased towards the tomato segmentations, such that, if tomatoes are present, the segmented tomatoes will dominate the registration, but when few tomatoes are present (such as in some post-harvest images), the background provides sufficient extra information to perform successful registration. This has allowed the image registration to be based mainly on the image data that is most important (segmented tomatoes and ribbons).
We use a Discrete Fourier Transform (DFT)-based registration algorithm [51], which is an intensity-based registration algorithm that finds the optimal transformation of the moving image with respect to the target image, with the correlation between images as objective function. We further constrain this algorithm by only allowing translations (movements in the horizontal and vertical directions, no rotations or scaling) and by only allowing limited horizontal and vertical translations. These constraints were necessary, as the sparse nature of the available image data caused substantial mismatches when running performing DFT algorithm without optimization constraints.
As we expect the images from each vertical position on the robot to have a consistent position with respect to each other at each horizontal stopping position along the row, we only performed vertical registration on a limited subset of images and applied the median of the resulting vertical registration parameter sets on the full set of images. This creates the column images in Figure 4) for which we perform horizontal registration for each separate (column) image pair.

Truss Detection
We perform truss detection to identify the harvested trusses in a later stage. A truss in an tomato segmentation image is characterized by a large contiguous area of (partially) overlapping tomatoes. We have therefore performed a simple connected-components analysis using an 8-connected structuring element with a small minimum component size threshold to remove spurious trusses.

Detection of Harvested Trusses
So far, we have outlined the part of the image analysis pipeline that segments tomatoes and ribbons and places them in a unified reference frame for each row. We perform these steps separately for each row's pre-and post-harvest acquisition. The following steps identify trusses in the pre-harvest row reference frame that are missing in the post-harvest row reference frame. We start by assuming that the position of the tomato trusses relative to the ribbons does not change between pre-and post-harvest acquisitions. This allows us to create an image of trusses for each plot pre-and post-harvest, with the starting position of the ribbon where the plot starts as the x = 0 coordinate. The overlap between pre-and post-harvest trusses then determines whether a pre-harvest truss has a corresponding post-harvest truss. The pre-harvest trusses that do not have a corresponding post-harvest truss are assumed to be harvested.

Comparison of Plot-Level Average Tomato Radius to Harvest Yield
The image analysis pipeline concludes by performing a statistical analysis of the average radius of tomatoes within harvested trusses (as detected by the algorithm) and its correspondence with the plot-level average tomato weight, as measured by the total weight of the tomatoes harvested per plot, divided by total number of tomatoes harvested per plot.

Fruit Segmentation
An example of fruit detection using MaskRCNN is shown in Figure 5. In [46], we reported that the precision, recall, and F1 metrics were above 0.9 for both the single fruit class and two ripeness classes. Figure 5 also shows an example of segmentation based on a more classical computer vision algorithm using color space transforms and shape fitting. These results are shown in Figure 5A,D as red or green contours indicating the detected fruits. MaskRCNN clearly outperforms the alternative algorithm by missing fewer fruits and detecting fewer false positives. For more examples, please refer to the work in [46] and the associated supplementary material.  Figure 6 shows the detection of the plot separator ribbons using FasterRCNN. It is interesting to note that, even when a ribbon gets missed in one image due to its orientation with respect to the camera, it is detected in neighboring images. The ribbon detection still required manual correction to remove large ribbons from the row behind the current row-the depth images did not always provide enough information to distinguish between ribbons in the foreground and ribbons in the background.

Image Registration
As outlined in Section 2.4, the registration algorithm starts by vertically integrating the images of the three cameras at each position in the row into column images. Figure 7 shows an example of the results of this vertical integration. This shows good integration, for instance, when comparing the circled truss of the middle left image and the corresponding circled truss in the lower left image and how they are integrated in the column image.    Figure 9 shows the results of combining the horizontal and vertical registration over the full row as a heat map showing the overlap of the segmented tomatoes from the individual images when transformed to the row-based reference frame. This procedure results in clearly separated trusses, which can then be used for further processing. This result might suggest that segmentation could simply be performed on the integrated RGB image of the full row. However, a careful inspection of the trusses in such images shows that there are too many discontinuities for the segmentation algorithm to perform well.

Overall Performance
For an assessment of the overall performance, we concentrate on image estimates of yield, expressed as average fruit radii per plot. Figure 10 shows a scatter plot of the predicted average fruit radius per plot, and the average fruit weight as measured after harvesting the fruits. Here, we have only analyzed the pre-harvest images and therefore have included all tomatoes into the analysis, regardless of whether they are harvested or not. This plot shows a reasonable correspondence between predicted fruit radius and measured fruit weight (r = 0.43, p = 5e −11 ), albeit with an increased variance at higher fruits weights. An analysis including pre-and post-harvest images, including truss detection and identification of harvested trusses yielded very similar results. A breakdown of the relationship between plot-level average tomato radius and measured weight per row (not shown) indicates that performance can vary substantially across rows. We should note, however, that the limited number of data points per row precludes any firm conclusions.

Discussion
This project and related work have demonstrated that image acquisition is effective: the combination of a relatively low-tech robot, which uses the already present heating pipes of the greenhouse and low-cost commercially available cameras proved, after initial setup issues, to be reliable and to deliver images of sufficient quality for the segmentation algorithm. The night-time acquisition schedule (in combination with the LED-based illumination system) is advantageous because it ensures constant illumination conditions, which in turn ensures constant performance of the segmentation algorithm. In addition, it does not interfere with day-to-day operations in the greenhouse. Segmentation, as described in [47], showed high performance, which is especially remarkable since the training data set in this paper is relatively modest. Adaptation to a new environment should therefore be relatively painless.
Image registration and the construction of a row-based reference system for the detected tomatoes was more complicated. This step is necessary when it is not possible to image the whole plant or unit of harvest as was done in [31] in which entire apple trees were covered in one image. Both vertical and horizontal integration was necessary: vertical to integrate the views of the three different cameras and horizontal integration to integrate the views from different stops of the robot along the row. Both types of integration were performed sequentially (vertical first, then horizontal) by a heavily modified version of a fast DFT-based registration algorithm [51]. The registration approach relies on both segmentation results and the RGB images themselves-registration based mainly on the segmentation results makes the registration much less complex and reduces the parallax problem considerably, while the addition of the RGB background was necessary to register images with only few detected tomatoes, which is the case in many post-harvest images. Apart from this process of feature adaptation for the registration algorithm, we also heavily constrained the DFT-based registration optimization procedure to only allow solutions in a narrow range of horizontal and vertical shifts.
Although the integration of images into a coherent row-based reference frame was successful, this does not completely solve the problem of placing detected fruits into this reference frame: many fruits will be present in multiple images, which leads to double counting. Extensive experimentation with rule-based double-counting solutions, based on the fact that tomatoes that are present in different images should show high overlap when creating the row-based representation, did not improve the total results.
We use the fruit radius as predicted by Deep Learning to compare with the fruit weight as measured during harvest. The fruits were already selected to fall within a limited depth range to focus only on fruits from the row immediately in front of the camera. Hence, depth information is used implicitly in determining fruit radii. Because of the limited depth resolution of the cameras no further improvement in the precision of the radius estimates is to be expected-prediction accuracy is dominated by other aspects, as detailed in the following paragraphs.
Another difficulty is occlusion. Tomatoes are not fully visible in all images: tomatoes within trusses already show heavy overlap. Other occlusions are caused by objects such as pipes, tomato stems and the ribbons used to delineate plots. There will always be fruits for which accurate measurements are impossible because of occlusion. This, however, is one of the reasons behind high-throughput phenotyping: if a greenhouse can be completely covered every night, or every other night, then errors will cancel out and a consistent trend will emerge from the data.
The aforementioned occlusion problem caused by the ribbons introduces another problem when assigning tomatoes to plots: for tomatoes that are occluded by ribbons it is unclear whether to assign those tomatoes to the plot anterior or posterior to the ribbon. Moreover, in some cases tomato trusses from a tomato plant anterior to a ribbon are posterior to the same ribbon, because of the diagonal alignment of the tomato plants, leading to further erroneous assignments. The introduction of ribbons to delineate plots can therefore be improved upon, for instance by marking each truss or even tomato with a label indicating its corresponding plot. However, these labels would not be visible in all images and would moreover introduce another labor-intensive step into the phenotyping process, which is undesirable.
Ultimately, one would want to go beyond plot-based comparisons and be able to segment and identify tomatoes and their corresponding plants. This would alleviate the above mentioned assignment problem as well, as each plant could easily be coded by a label at the pot, and trusses and tomatoes could be identified by their distance from the pot along the stem. However, stem tracking in typical greenhouse setups is complicated because stems of the same row more often than not overlap and/or cross.
In conclusion, our work shows that, although significant progress has been made on the individual components of an automatic phenotyping system, the setup in which such as system is deployed, in our case a breeding greenhouse, will often present significant challenges. Meeting these challenges goes beyond improvements of the individual components and requires careful consideration of the experimental setup. Further improvements in camera systems (e.g., better depth information, or wider fields of view), and decreasing costs for hardware (making it possible to employ, say, eight cameras instead of four so that parallax problems are minimized and image registration becomes much easier because of increased overlap) will undoubtedly help.