2. Materials and Methods
2.1. Overview of the Demonstration Trials
In this study, three demonstration trials targeting Lisianthus (Eustoma grandiflorum) were conducted in Namie Town, Fukushima Prefecture, between 2024 and 2025. These trials aimed to develop and validate a node-count estimation algorithm that utilizes zero-shot vision models. To evaluate the performance of the estimation models from multiple perspectives, growth images were collected under differing cropping seasons, growers, and cultivar conditions.
2.1.1. Cultivation Conditions
The demonstration trials were conducted in 2024 and 2025 in greenhouses managed by floricultural growers in Namie Town, Futaba District, Fukushima Prefecture, Japan. All greenhouses were pipe-frame structures covered with PO film and equipped with natural ventilation through side windows. They were equipped with circulation fans to promote air mixing and ensure a uniform temperature. No heating or cooling systems were installed, and the temperature management relied on ambient outdoor conditions. Photoperiod control was implemented in all trial plots using supplemental lighting.
Irrigation was primarily applied through tube irrigation, supplemented with hand watering when necessary, based on pF meters or soil moisture sensors installed by each grower. Compost application and bed shaping were performed 1–2 months prior to transplanting, and the bed height was standardized at 15–20 cm. The planting layout consisted of a four-row configuration with one row removed from the center, with 75 cm bed spacing (65 cm in Trial 2 only), 12 cm row spacing, and 12 cm plant spacing. Black-and-white mulch was used in all trial plots. Pest and disease control were conducted as needed, depending on the incidence, and weeding was performed manually.
2.1.2. Summary of the Trials
An overview of each trial is presented in
Table 1. The trials targeted two cropping types, seasonal cropping (spring transplanting and summer harvesting) and retarded cropping (early-summer transplanting and autumn harvesting). The number of participating growers ranged from one to four, with considerable variation in cultivation experience. Grower A was the most experienced producer and played a leading role in the local community. Grower B was a mid-career producer with the second-highest level of experience. In contrast, Grower C was in their second year of cultivation as of 2025, and Grower D was a newly established grower beginning cultivation that same year.
Five cultivars were used: Julius Lavender (JL; KANEKO SEEDS CO., LTD, Maebashi, Gunma, Japan), Celebrich White (CW; Sumika Agrotech Co., Ltd., Osaka, Osaka, Japan), Happiness White (HW; MIYOSHI & CO., LTD., Setagaya, Tokyo, Japan), Largo Marine (LM; MIYOSHI & CO., LTD., Setagaya, Tokyo, Japan), and NF Antique Pink (AP; NAKASONE Lisianthus Inc., Chikuma, Nagano, Japan), with different cultivars selected according to the cropping type. According to the maturity characteristics reported by the respective seed suppliers, JL is classified as early–medium, CW as medium, HW as medium–late, and LM as late. AP is primarily used to retard cultivation in this region.
The three demonstration trials conducted in this study had distinct objectives. The purpose of each trial was as follows.
Trial 1: Development of a Node-Estimation Model (2024 Seasonal Cropping Type)
Trial 1 was conducted to design a leaf-counting algorithm using zero-shot vision models and to optimize the parameters of the node-estimation formula. Fixed cameras installed in greenhouses in Namie Town were used to capture time-series images of Lisianthus growth, and an algorithm for detecting and counting leaf regions was constructed. In this trial, a basic analytical pipeline was established by integrating model components such as background separation, candidate leaf-region detection, and individual-leaf identification. Based on the resulting leaf-count data, regression analysis was performed to estimate the node numbers. The final model and optimized parameters were used as the baseline for Trials 2 and 3, respectively.
Trial 2: Evaluation of Estimation Accuracy (2024 Retarding Cropping Type)
Trial 2 was conducted to evaluate the generalization performance of the model developed in Trial 1 across different cropping types and cultivars. In this trial, node-count estimation was applied to images of other cultivars under the retarded cropping type, and the estimation accuracy was validated through comparison with manual measurements.
Trial 3: Verification of Reproducibility and Operational Suitability (2025 Seasonal Cropping Type)
Trial 3 was conducted the following year using the same seasonal cropping type to verify the reproducibility of the estimation method established in Trial 1. In this trial, node-count estimation was performed not only for the cultivar used in model development (HW) but also for three additional cultivars (JL, CW, and LM) and under conditions including the participation of a new grower (Grower D). This allowed us to assess the stability of the model across different cultivars and growers. Additionally, an operational test was conducted to share the estimation results with growers in real-time, enabling an examination of the model’s practicality and potential for implementation in real-world production settings.
2.2. Automatic Image Acquisition System for Growth Monitoring
In this study, an automated image acquisition system was developed to capture and record the growth process of
Lisianthus using fixed cameras installed in the field. The overall system configuration is illustrated in
Figure 1. The system was designed to automate the entire sequence of processes from image capture to storage and cloud transfer, and to stably collect image data usable for AI analysis at regular time intervals.
In each greenhouse, 1–4 compact network cameras (ATOM Cam2; ATOM Tech Inc., Yokohama, Kanagawa, Japan) were installed. The cameras were fixed at positions that captured side views of the target plants across the work aisle, ensuring that images were taken from a consistent angle throughout the trial period. The distance between the camera and the plants corresponded to a bed spacing of 75 cm (65 cm in Trial 2), and the camera height was fixed at 25 cm above the substrate surface. Although 10–13 plants were included in each imaging plot, cropping pre-processing, as described later, standardized the final analysis area to the field of view corresponding to the 10 plants.
Each camera was connected via Wi-Fi to a local network and was configured to be accessible from a Raspberry Pi (Models 3 B, 3 B +, and 3A+; Raspberry Pi Ltd., Cambridge, UK) on the same network. The IP addresses of each camera were crawled on a Raspberry Pi, and still images were obtained using the RTSP protocol. The hourly scheduled image capture was automated using the cron scheduler implemented in Raspberry Pi OS (Raspberry Pi Ltd., Cambridge, UK). Images were saved in the JPEG format at a resolution of 1920 × 1080 pixels (Full HD). Natural daylight was used for illumination, without artificial lighting. To mitigate the effects of diurnal and day-to-day variations in lighting intensity, representative daytime images were selected at a fixed time each day, as described in
Section 2.3.1. In addition, camera parameters including exposure, shutter speed, ISO sensitivity, and white balance were maintained at default automatic settings to adapt to gradual changes in ambient light conditions. Foggy or severely hazy conditions were rare during the experimental periods and were not treated as a separate factor in the analysis. However, under such low-contrast conditions, leaf detection performance may be degraded, which is recognized as a limitation of the current system and discussed in
Section 4.8.
Captured images were saved in automatically generated date-based folders for each camera and subsequently uploaded to a designated directory on Google Drive (Google LLC, Mountain View, CA, USA). The Google Drive API (Google LLC, Mountain View, CA, USA) was used to upload and automate cloud data transfer from the Raspberry Pi [
34]. This enabled the centralized cloud-based management of image data collected across different growers’ fields and provided an environment for direct access and processing from Google Colaboratory (Google LLC, Mountain View, CA, USA). Filenames were automatically assigned camera IDs and timestamps to facilitate the matching of grower, cultivar, and capture times during subsequent leaf-counting and node-estimation processes.
This image acquisition system enabled continuous recording of plant growth without manual intervention and established a foundation for the high-frequency and high-accuracy data collection necessary for comparative analysis across different fields and cultivation environments.
2.3. Leaf Counting Method Using Zero-Shot Models
In this study, we constructed a hybrid image-analysis pipeline that combines zero-shot vision models with an existing deep-learning classifier to automatically estimate leaf numbers from Lisianthus growth images. Leaf counting was performed on time-lapse images obtained from fixed cameras in the following three processing stages:
Extraction of representative daytime images
Preprocessing via lens distortion correction and cropping
Leaf-region detection and counting using a hybrid approach
These processes were implemented in a Python notebook executable on Google Colaboratory (Runtime version: 2025.10; Python 3.12.12) using an NVIDIA Tesla T4 GPU (NVIDIA Corp., Santa Clara, CA, USA). The notebook was designed to directly read and write image data stored on Google Drive using the automatic image acquisition system described in
Section 2.2, enabling efficient cloud-based analysis.
2.3.1. Extraction of Representative Daytime Images
Representative daytime images were automatically extracted from hourly time-lapse images captured using each camera in the field. First, the image directory on Google Drive was referenced, and nighttime images were filtered based on timestamp information. An image captured at a predetermined time (e.g., 10:00 a.m.) was selected as the daily representative image. The extraction process was automated using a Python script and organized using the camera ID (folder name). Consequently, the light condition variability caused by changes in solar elevation was minimized, enabling the construction of a time-series dataset suitable for comparison.
2.3.2. Preprocessing of Extracted Images
Image preprocessing aimed to correct camera-dependent imaging conditions and generate a uniform dataset comparable across the time series. To achieve this, lens distortion correction and field-of-view normalization were applied. These steps reduced the influence of differences in shooting positions or angles and ensured stable evaluation of daily growth changes. An outline of the processing is provided below.
Lens Calibration and Distortion Correction
Using the pre-acquired internal parameters of each camera (e.g., focal length, principal point) and distortion coefficients, lens distortion was corrected with the cv2.undistort() function in OpenCV (v4.12.0; OpenCV Foundation, Palo Alto, CA, USA). This mitigated the distortion that occurred from the center toward the periphery of the image and improved the stability of shape recognition for nodes and leaf regions located near the image edges.
Cropping and Field-of-View Normalization
Because each camera image contained multiple plants, the image width was automatically scaled based on the number of captured plants and standardized to the equivalent width of ten plants. Cropping was performed using the lower center as the reference point, and the resulting region was rescaled to the original resolution (1920 × 1080 pixels). This procedure reduced daily and camera-based variations in the field of view and generated standardized images suitable for leaf counting and node estimation.
2.3.3. Leaf Counting Procedure
Leaf counting was performed using a hybrid method that integrated text-guided zero-shot object detection, auxiliary classification, and depth-estimation models. The overall process flow is illustrated in
Figure 2. By combining leaf region detection via Grounding DINO (v1.0; IDEA Research, Shenzhen, Guangdong, China), region filtering using a YOLO (v8.0; Ultralytics Inc., Frederick, MD, USA) classifier, and monocular depth estimation via MiDaS (v3.1; Intel Corporation, Santa Clara, CA, USA), we constructed a pipeline capable of estimating leaf numbers for
Lisianthus with high accuracy and stability.
First, the candidate leaf regions in the images were detected in a zero-shot manner using Grounding DINO [
30]. Grounding DINO is characterized by its ability to recognize general leaf morphological features independent of the crop species, owing to large-scale pretraining. The text prompt “individual leaf on plant body” was used to specify the detection target. During the preliminary prompt design stage, the use of a single term such as “leaf” caused frequent misdetections in which the entire plant canopy was recognized as a leaf region. To suppress such overgeneralized detections, the phrase “on plant body” was explicitly added to constrain the spatial context of the target. In addition, because subsequent filtering steps focused on selecting individual leaves, the term “individual” was included to emphasize the detection of single-leaf structures rather than aggregated leaf regions. No few-shot learning or fine-tuning was performed for domain adaptation, and the model was used solely for zero-shot inference.
Subsequently, a YOLOv8-based classifier [
35] was introduced to categorize each detected region into three classes: “single leaf”, “multiple leaves”, and “non-leaf”. Only the regions classified as single leaves were counted as leaf instances. Regions with classification scores below a certain threshold were removed to prevent over-detection. Because the detected candidate regions included duplicates, non-maximum suppression (NMS) was applied to integrate redundant bounding boxes.
Furthermore, MiDaS was used to estimate a depth map for the entire image and evaluate the three-dimensional foreground–background relationships among the candidate regions [
36]. MiDaS is a deep learning model that estimates the relative depth from monocular RGB images and exhibits high zero-shot generalizability owing to its training on combined depth-estimation datasets. As MiDaS outputs relative depth values, outlier regions were removed by excluding those whose depth values fell outside ±1σ of the depth distribution. This reduces false detections caused by adjacent plants, beds, and background structures, such as stakes or windbreak nets. Although a wider threshold such as ±3σ was effective in suppressing distant background regions, it was insufficient to remove regions from adjacent plant rows; therefore, a ±1σ threshold was empirically adopted.
Finally, the outputs from the three models were integrated to visualize the valid leaf regions and automatically calculate the leaf numbers for each image. The calculated leaf counts were organized as data corresponding to ten plants per image and were used to compare daily growth progression, as well as to serve as input for the node-estimation model described in
Section 2.4.
2.4. Node Estimation and Accuracy Evaluation
In this study, the node number of Lisianthus (Eustoma grandiflorum) was automatically calculated based on the leaf count estimates obtained in the previous section, and the estimation accuracy was evaluated. Node number is a key indicator representing a plant’s growth stage and is widely used in this region as a criterion for judging growth progress and determining the appropriate timing for irrigation and temperature management. Traditionally, node counting relied on manual surveys or visual assessments by growers and extension officers. In this study, we propose a new method that statistically estimates node numbers from leaf counts, enabling nondestructive and continuous monitoring.
2.4.1. Node Estimation
Node estimation was performed using the leaf counts obtained using the hybrid approach.
Lisianthus is a dicotyledonous plant with an opposite phyllotaxy, forming one pair of opposite leaves at each node as the stem elongates. Therefore, the node number exhibits a linear relationship with the leaf number. In this study, the relationship between the leaf number and node number was expressed using the linear approximation shown in Equation (1):
where
is the estimated node number, and
is the 5-day moving average of the leaf number. Because the leaf counts obtained in
Section 2.3.3 contained short-term noise caused by variations in lighting conditions and detection instability, a 5-day moving average (
) was used to smooth the leaf count time series. The 5-day window was determined in Trial 1 to be the optimal balance, as longer windows obscured short-term developmental changes, whereas shorter windows caused unstable fluctuations. This window length is also consistent with the biophysical growth characteristics of
lisianthus; based on recorded node count data for HW, one node increase required approximately 6–7 days during the target growth stage, indicating that node progression occurs on a multi-day time scale rather than daily. This process suppresses irregular day-to-day fluctuations and enables a more stable representation of continuous leaf development.
The coefficients of the linear equation (0.09 and 1.03) were derived through least-squares regression using manually surveyed node data collected from the fields of the three growers in Trial 1. The observed node number was calculated as the mean of the 36 plants per grower. Field surveys were conducted weekly, with ten surveys performed during the monitoring period.
In this study, node number refers specifically to the number of bolting nodes. The number of bolting nodes is defined as the number of nodes formed after stem elongation (bolting), with an internode length of at least 1 cm and fully developed leaves [
24].
2.4.2. Node Number Validation
To evaluate the accuracy of the node-estimation model, manual node measurements were conducted in Trials 2 and 3 and compared with the automated estimates determined using Equation (1). The survey plants were selected from plots that exhibited moderate growth within each field. In Trial 2, the mean value of 36 plants was used for each cultivar–grower combination, and in Trial 3, the mean value of 15 plants was used. These values reflect the growth of representative plants.
Field surveys were conducted every two weeks, three times in Trial 2 and five times in Trial 3. The collected observational node data were used for comparative analysis of the estimated values. Because the number of data points in Trial 2 was extremely small (n = 3), performing a statistically meaningful regression analysis was deemed inappropriate; therefore, only the mean absolute error (MAE) and root mean square error (RMSE) between the estimated and observed values were reported. In Trial 3, considering the possibility that the estimation accuracy might vary by grower or cultivar, both an overall regression analysis, including all data, and separate regression analyses by grower and cultivar were conducted.
For each survey date, the estimated node number for the same day was matched with the observed value, and three evaluation metrics, mean absolute error (MAE), root mean square error (RMSE) and coefficient of determination (R2), were calculated. MAE represents the magnitude of the difference between the estimated and observed values, with smaller values indicating smaller errors. RMSE provides additional information by penalizing larger errors more strongly and is therefore sensitive to occasional large deviations between estimated and observed values. R2 expresses the degree to which the estimated values explain the variation in observed values and serves as an indicator of how well the model captures growth progression. Using MAE, RMSE, and R2 together, the model performance was comprehensively evaluated from the perspectives of absolute error and reproducibility of growth dynamics.
2.5. Evaluation of Accuracy–Complexity Trade-Off
In addition to estimation accuracy, the computational cost of the proposed pipeline was quantitatively evaluated to clarify the trade-off between accuracy and computational complexity. Because the proposed method integrates multiple large-scale zero-shot models, its computational burden is expected to be substantially higher than that of conventional single-stage detectors. Therefore, a baseline method based on a pure YOLOv5 detector (v5.0; Ultralytics Inc., Frederick, MD, USA) was introduced for benchmarking.
For both YOLOv5 and the proposed method, computational complexity was assessed using four indicators: the number of model parameters, floating-point operations (FLOPs), inference latency, and peak GPU memory usage. All benchmarks were conducted with an input resolution of 1280 × 1280 pixels and batch size 1. To ensure a fair comparison, inference latency was measured as the model-only execution time, excluding disk input/output operations. Each model was warmed up before measurement, and the reported values represent averages over 50 images. FLOPs were obtained using the PyTorch Profiler (PyTorch v2.8.0; Meta Platforms Inc., Menlo Park, CA, USA) by executing the full inference pipeline, and peak GPU memory was recorded.
The leaf-count and node-estimation accuracies obtained in
Section 2.3 and
Section 2.4 were then analyzed together with these computational metrics. This enabled a direct comparison of how much additional computational cost was required to achieve a given improvement in estimation accuracy, providing a quantitative basis for evaluating the practical trade-off between model complexity and performance in real-world greenhouse monitoring.
2.6. Sharing of Growth Information
In this study, an automated notification system was developed using the online communication platform Discord (Discord Inc., San Francisco, CA, USA) to share node-estimation results among growers, extension officers, and researchers for growth management and technical guidance [
37]. The system was implemented using a Python script, which was automatically executed upon completion of the node-estimation process. After each daily analysis, graphs showing node number progression for each cultivar and representative images were posted to a designated Discord channel, enabling stakeholders to access the same information immediately.
This notification function was integrated into a remote greenhouse-monitoring system that was previously developed and operated by the authors [
33]. In this system, environmental data, such as greenhouse air temperature, relative humidity, soil temperature, and solar radiation, are automatically delivered to Discord. Consequently, both growth information (node number trends and leaf count images) and environmental information can be viewed in a unified interface, allowing stakeholders to provide guidance and make management decisions based on the relationship between environmental fluctuations and plant responses.
The shared Discord channel was designed to facilitate real-time communication through comment threads, enabling discussion of growth conditions and adjustments to management strategies. These functions substantially improve the efficiency of on-site growth monitoring and information sharing, and further contribute to strengthening knowledge exchange and technical support among growers within the region.
3. Results
3.1. Construction of the Node Count Estimation Model (Trial 1: Seasonal Crop 2024)
In the 2024 seasonal cropping trial, a node-count estimation model based on zero-shot vision models was constructed, and its initial performance was evaluated using data from three fields in Namie Town (Growers A, B, and C). In this trial, a regression analysis was performed to model the relationship between the counted leaf number obtained by the proposed system and the manually observed node number. The relationship between the counted leaf number using the system (horizontal axis) and the manually observed node number (vertical axis) is shown in
Figure 3, and the estimation accuracy metrics are summarized in
Table 2. A strong linear relationship was observed with a coefficient of determination of
R2 = 0.930 and a mean absolute error (MAE) of 0.73 (n = 26). Trends were generally consistent across growers, and the model successfully reproduced the pattern of increase in node number, even during the later growth stages when leaf density increased. This indicates that the model effectively functioned as a quantitative indicator of the growth stage.
Examining errors by grower, Grower B and C showed MAE values of 0.68 and 0.57, respectively, indicating deviations within ±0.7 nodes from the observed values. In contrast, Grower A exhibited a slightly higher MAE of 0.83 with a tendency to underestimate the number of nodes, particularly in the later growth stages.
These results demonstrate that node-estimation using zero-shot vision models can achieve sufficient accuracy without requiring additional domain adaptation, such as few-shot learning or fine-tuning, and that a stable estimation performance can be maintained across growers under different imaging conditions. The model constructed in this trial was subsequently applied to accuracy verification trials (Trials 2 and 3), enabling further evaluation of reproducibility across diverse production environments.
3.2. Verification of Node-Estimation Accuracy (Trial 2: Retarding Crop 2024)
In Trial 2, the accuracy of the node-estimation model constructed in Trial 1 was evaluated using a dataset from different retarded cropping types. This trial targeted Grower A, and the model parameters obtained in Trial 1 were applied without retraining, allowing the assessment of the model’s generalization performance across seasonal and cultivar changes. The mean absolute error (MAE) was 0.45, indicating that the estimation errors remained within one node and that the estimated values were closely aligned with the observed measurements (RMSE = 0.46). In particular, the model consistently reproduced the node numbers during the intermediate growth stage at approximately four to six nodes.
These results show that the zero-shot-based node-estimation model developed in Trial 1 can maintain a stable performance across different cropping seasons without additional training. Based on these findings, Trial 3 expanded the number of growers and cultivars to evaluate the reproducibility under a broader range of conditions.
3.3. Re-Evaluation of Node-Estimation Accuracy (Trial 3: Seasonal Crop 2025)
In Trial 3, the node-estimation model constructed and validated in Trial 1 during the previous year was applied directly to the data from the following year to verify its reproducibility and generalization performance across multiple growers and cultivars.
As shown in
Figure 4, the overall estimated values exhibited good agreement with the observed node numbers, yielding
R2 = 0.768 and MAE = 1.14 (n = 46). Although the sample size increased compared to Trials 1 and 2, and the coefficient of determination decreased, estimation errors generally remained within approximately ±1 node, indicating that the model retained a certain level of reproducibility. However, in the early growth stages (approximately 1–4 nodes), the stability of leaf detection decreased, resulting in the estimated node numbers clustering around one, thereby tending toward underestimation. Using the mean absolute percentage error (MAPE), when the data from Trial 3 were stratified by growth stage, the early stage (observed node number <4, n = 14) exhibited larger relative errors (MAE = 1.16 nodes, RMSE = 1.30 nodes, MAPE = 43.4%) than the later stage (≥4 nodes, n = 32; MAE = 1.11 nodes, RMSE = 1.38 nodes, MAPE = 16.5%).
The regression results for each grower are shown in
Figure 5, and the accuracy metrics are listed in
Table 3. Although an underestimation was observed in the early growth stages for all growers, the estimated values for growers A and D increased in alignment with the 1:1 line during the mid- to late-growth stages, maintaining an approximately linear relationship. During the late growth stages, variability, likely attributable to cultivar differences, was observed. For Grower C, the estimated number of nodes tended to be lower throughout the entire period. At the latter half of the image analysis period (12 June), the average plant height of Grower C was 25.7 cm, compared with 30.2 cm for Grower A and 38.4 cm for Grower D (four-cultivar mean). Although initial plant height did not differ markedly at transplanting, Grower C consistently exhibited weaker growth during the analysis period, which likely contributed to lower node estimates. Overall, the MAE values for each grower ranged from 0.91 to 1.32 nodes, with errors for all growers remaining within approximately ±1 node.
Furthermore, cultivar-specific regression analyses for the four cultivars (JL, CW, HW, and LM) are presented in
Figure 6, and the accuracy metrics are listed in
Table 4. All cultivars exhibited a clear positive correlation between the estimated and observed node numbers, confirming that the linear model-based estimation approach was effective across cultivars. The MAE values for all cultivars were within 1.3 nodes, and no extreme degradation in performance attributable to cultivar characteristics was observed. For JL and CW, the MAE values were 1.30 and 1.17, respectively, indicating moderate accuracy. HW, the cultivar used to determine the model parameters, achieved an MAE of 1.11, indicating high consistency. LM, a late-maturing cultivar, was the only cultivar with an MAE below 1.0 (MAE = 0.93), and it exhibited the most stable performance among the four cultivars. Comparing the cultivar maturity groups, late-maturing cultivars tended to show lower errors and more stable estimates than mid-maturing cultivars (
Table 4). These findings suggest a possible association between node-estimation accuracy and cultivar maturity characteristics.
Finally, the node estimates obtained in Trial 3 were automatically integrated into the growth information–sharing system using Discord and utilized for feedback among growers, extension officers, and researchers.
3.4. Benchmark Comparison with a Single-Stage YOLO Detector
To characterize the computational and practical properties of the proposed pipeline, we compared it with a pure YOLOv5 detector, which directly estimates leaf counts using a single-stage detection framework, whereas the proposed method employs a multi-stage pipeline integrating Grounding DINO, a YOLO-based classifier, and MiDaS.
As shown in
Table 5, YOLOv5 is computationally lightweight, with 7.0 million parameters, 63 GFLOPs, and an average inference time of 34 ms per image. In contrast, the proposed method requires 518 million parameters and 860 GFLOPs, resulting in an inference time of 1811 ms and peak GPU memory usage of 4.8 GB. This difference mainly arises from the transformer-based architectures of Grounding DINO and MiDaS and the repeated classification of multiple candidate regions (on average 88 per image).
Despite the higher computational cost, the proposed method achieved substantially better node-estimation accuracy. As shown in
Table 6, in Trial 1 the coefficient of determination increased from 0.723 to 0.930 and the RMSE decreased from 1.73 to 0.91, corresponding to an approximately 47% reduction in error. In Trial 3, the proposed method also showed slightly higher
R2 and lower RMSE and MAE than YOLOv5, indicating comparable or better generalization.
Overall, these results reveal a clear accuracy–complexity trade-off: although the proposed method is much more computationally demanding than YOLOv5, it provides more accurate and robust node estimation under diverse field conditions.
3.5. Sharing of Growth Information via Discord Notifications
In Trial 3, an automated notification system using Discord was employed to share daily node-estimation data among growers, extension officers, and researchers. The notification content was automatically generated using a Python script and sent immediately after the analysis was completed to Google Colaboratory. Each notification included a graph of the node number progression up to the current day and analytical images showing the leaf detection results, allowing members of the channel to promptly grasp the growth status of each grower (
Figure 7). The notifications displayed the node-estimation results for each grower and cultivar in graphical form, enabling simultaneous confirmation of both numerical indicators and visual information.
Notifications were issued twice a week (every 2–3 days), reflecting the updated node estimates derived from camera images in each field. Following each notification, comments and questions from growers and extension officers were posted on the channel, leading to discussions directly related to growth management, such as irrigation volume and method, timing of additional fertilization, and pest and disease control strategies, based on the observed progression of growth in the images. Additionally, among growers cultivating the same cultivar across multiple fields, a graphical comparison of the growth progress served as an opportunity for mutual learning, allowing them to exchange information regarding differences in cultivation practices, such as irrigation and temperature management.
4. Discussion
4.1. Effectiveness of the Zero-Shot Approach for Node Estimation
The node-estimation model constructed in this study demonstrated the effectiveness of the zero-shot approach centered on Grounding DINO, yielding strong performance in Trial 1 with R2 = 0.930 and an MAE of 0.73 nodes. Furthermore, the trimodel hybrid architecture—combining Grounding DINO with MiDaS for depth estimation and a YOLO classifier—effectively suppressed greenhouse-specific noise (background structures, overlapping plants, and variations in lighting conditions) and contributed to stable leaf counting. In addition, the morphological characteristics of Lisianthus, which exhibit opposite phyllotaxy favorable for leaf count–based estimation, also supported the accuracy of the model. The fact that estimation errors generally remained within ±1 node suggests that the proposed method is suitable for practical use as a growth-stage indicator in production environments.
A previous study involving 3D modeling analysis reported a node-estimation error of 1.2 nodes [
24]. In Trial 3 of the present study, the overall analysis across all growers yielded an MAE of 1.14 nodes, and grower-specific analyses showed MAE values ranging from 0.91 to 1.32 nodes. These results indicate that the performance obtained in this study is comparable to or, under certain conditions, may exceed that of previous research.
4.2. Generalization Performance Across Cropping Seasons and Growers
The node-estimation model constructed in this study exhibited strong generalization performance across different cropping seasons and growers. When the model developed using the seasonal cropping data in Trial 1 was applied to the retarded cropping dataset in Trial 2 without additional training, it maintained a high accuracy (MAE = 0.45). Moreover, the model could be reused as-is in the seasonal cropping trial conducted the following year (Trial 3), where it achieved an accuracy of approximately ±1 node (MAE = 1.14) even under more diverse conditions involving multiple growers and cultivars. This demonstrates the practical applicability of the model for real-world deployment.
Furthermore, the fact that the estimation accuracy did not markedly deteriorate among growers with different levels of cultivation experience suggests that the model may be capable of absorbing a certain degree of variability in plant growth arising from differences in grower practices.
4.3. Influence of Growth Stage Differences on Node-Estimation Error
In Trial 3, node numbers tended to be underestimated during the early growth stage owing to delayed bolting and limited stem elongation. Previous studies have reported that, in young leaves, the accuracy of leaf segmentation decreases because of variations in leaf shape, orientation, and occlusion caused by overlapping tissues [
38]. Another study noted that dramatic morphological differences between juvenile and mature leaves could complicate the construction of unified models across different growth stages [
39]. In the present study, morphological bottlenecks, such as small leaf size, overlapping leaf structures, and short internode length in the early stages (below approximately four nodes), likely contributed to reduced estimation accuracy. Consistent with this interpretation, quantitative evaluation showed that relative errors were markedly higher during the early growth stage, whereas MAPE decreased substantially once plants exceeded approximately four nodes. Because node-based management becomes practically relevant mainly after bolting, this indicates that the proposed method is primarily reliable from the mid-growth stage onward, while estimates during the earliest stages should be interpreted with caution.
In addition, Trial 3 revealed that Grower C consistently showed lower estimated node numbers throughout the period and that variability increased in the later growth stages owing to cultivar-specific traits. In the case of Grower C, reduced growth vigor, characterized by lower plant height and limited stem elongation, likely increased leaf occlusion and decreased the visibility of individual leaves, leading to systematic underestimation of node numbers. These findings indicate that differences in growth characteristics contribute to node-estimation errors. This represents a practical limitation of the system when applied to real production environments, and highlights an important issue to be addressed in future model improvements.
4.4. Effects of Cultivar Differences and Maturity Characteristics
In phenotyping research, it is well-established that differences in leaf morphology and plant architecture across cultivars influence model performance [
40,
41]. Consistent with these findings, the cultivar-specific analyses in the present study demonstrated that inherent variations in leaf morphology and plant form among
Lisianthus cultivars can affect node-estimation accuracy.
A particularly notable result was that the late-maturing cultivar Largo Marine (LM) exhibited the smallest error (MAE ≈ 0.9). This may be attributable to the longer number of days required to form each node and the comparatively longer internode length in the late-maturing cultivars, which likely reduced leaf overlap and stabilized leaf recognition in the images. Prior research on other crops has reported that the maturity class influences leaf emergence and internode elongation [
42,
43].
In contrast, the early–medium cultivar Julius Lavender and the medium-maturing Celebrich White (CW) exhibited larger errors than Largo Marine (LM), likely due to occlusion caused by greater leaf overlap, which destabilized leaf detection. The cultivar Happiness White (HW), which was used in the model construction, also showed comparatively high accuracy in the subsequent year’s trial, suggesting that growth characteristics similar to those in the training environment contributed to improved performance.
Overall, these findings suggest a relationship between cultivar maturity class (early–mid–late) and node-estimation accuracy; late-maturing cultivars tend to exhibit more stable leaf detection and lower estimation errors.
4.5. Influence of Imaging Conditions on Estimation Accuracy
In Trial 1, the node-estimation error for Grower A was slightly larger than that for Growers B and C. One plausible explanation is the influence of the field of view of the camera, which is affected by the installation height. In Trial 1, the camera used by Grower A was installed at a slightly downward-facing angle compared to that of the other growers, which limited the ability to capture the increased number of leaves during later growth stages. This likely contributed to an underestimation of the number of nodes observed, particularly in the late growth period of Grower A.
In actual production fields, maintaining a constant field of view with fixed cameras can be challenging, owing to physical and spatial constraints. Nevertheless, when comparing across fields, standardizing imaging conditions as much as possible, along with applying preprocessing steps, such as lens distortion correction and cropping images to a “ten-plant equivalent width,” as implemented in this study, can partially compensate for differences in camera installation and improve consistency across growers.
4.6. Accuracy–Complexity Trade-Off of the Proposed Pipeline
Based on the benchmark results in
Section 3.4, the proposed multi-stage pipeline exhibits a clear trade-off between estimation accuracy and computational cost. Compared with the single-stage YOLOv5 detector, the proposed method requires substantially more parameters, FLOPs, and GPU memory, and therefore shows much higher inference latency. However, this increased computational burden yields a pronounced improvement in node-estimation accuracy, particularly in Trial 1, where the coefficient of determination increased from 0.723 to 0.930 and the RMSE was reduced by nearly half. In Trial 3, which involved different growers and cultivars in the following year, the proposed method also achieved slightly better
R2 and lower error metrics than YOLOv5, indicating that the accuracy gains were not limited to the training season.
This trade-off suggests that the proposed method is best suited for high-precision phenotyping and monitoring scenarios in which robustness and accuracy are prioritized over real-time performance. Owing to the zero-shot nature of Grounding DINO and MiDaS, the pipeline can be applied to new cultivars, growers, and cropping conditions without retraining the zero-shot models. Only minimal additional training is required for the auxiliary classifier, which is a major advantage for deployment in heterogeneous agricultural environments. In contrast, for applications with strict time or resource constraints, such as real-time field robotics or embedded systems, lightweight detectors such as YOLOv5 remain a more appropriate choice.
4.7. Practical Implementation and the Effects of Information Sharing on Communication
The growth monitoring system developed in this study demonstrated stable performance even with a low-cost hardware configuration combining a Raspberry Pi and an ATOM Cam2. Cloud integration via Google Drive facilitated seamless operations, from image capture to analysis and data storage, substantially reducing the barriers to adoption in production settings.
Moreover, the automated sharing of growth information via Discord enables growers, extension officers, and researchers to simultaneously reference the same data at the same time. Bidirectional communication facilitated through comment features improved the accuracy of growth assessment. Specifically, the near elimination of data transmission delays enables stakeholders to monitor daily growth conditions in real-time. Expert growers, extension officers, and researchers remotely observed the cultivation status and provided timely guidance when necessary.
Integrating node-number trends with environmental data collected from the existing environmental monitoring system [
5,
33]—such as temperature, humidity, and solar radiation—further enhanced the interpretation of growth delays and management differences, accelerating decision-making. These outcomes suggest that the system improves information-sharing efficiency among stakeholders and functions effectively as a practical field-ready technology for real-world implementation. From a practical perspective, the proposed system can be applied to real-world farming scenarios such as remote monitoring of crop growth progression, early identification of growth delays, and comparative assessment of cultivation practices across multiple fields or growers. By providing objective and continuous growth indicators derived from daily images, the system has the potential to reduce reliance on subjective visual assessments and to support growers with limited experience. In addition, the low-cost and remote-based nature of the system makes it particularly suitable for small-scale or labor-constrained farming operations, as well as for regions where farms are geographically dispersed and direct, frequent on-site technical guidance is difficult, such as areas undergoing agricultural revitalization.
The data sharing system constructed in this study is small-scale, decentralized, and flexible, offering advantages in enabling farmers to make decisions with enhanced transparency [
44]. However, the volume of collected data is substantial, and challenges remain regarding how to organize accumulated big data and share it in a form that is easily interpretable for growers [
45]. While the present study focuses on demonstrating the system implementation and operational feasibility, quantitative evaluation of the effectiveness of information sharing—such as its impact on skill acquisition by new growers—would require psychological and behavioral analyses, which are beyond the scope of this study and should be addressed in future work.
4.8. Limitations of the Demonstration Trials and Future Challenges
This study is based on three demonstration trials conducted in a specific region, Namie Town, Fukushima Prefecture, and therefore requires further validation in other regions, cropping types, and crop species. Additionally, the cultivation periods examined were limited to spring–summer and autumn, and data are lacking for winter conditions, where low temperatures may influence imaging characteristics and plant growth behavior.
Moreover, the proposed system assumes conditions under which plants generally maintain normal growth. The behavior of this model under abnormal growth conditions—such as lodging, excessive elongation, chemical injury, or pest and disease damage—has not been sufficiently evaluated. Future studies should not only examine the limits of node estimation under such conditions, but also consider integrating disease-detection algorithms and assessing the reliability of node estimates by comparing normal and abnormal growth patterns. Indeed, in Trial 3, analysis for Grower B was not possible because of severe growth suppression.
Because the experiments were conducted in greenhouse environments, the effects of extreme outdoor weather events were not directly investigated. However, although the study site experiences relatively stable natural light conditions, the logic for selecting representative daytime images to maximize analytical accuracy remains undeveloped. More robust methods for extracting and correcting images suitable for analysis under real-world production environments are required. In addition, rare visibility-degrading phenomena inside facilities—such as fog or haze—were not explicitly evaluated and remain important subjects for further detailed examination, as they may affect image contrast and detection stability under certain conditions.
Extension of the proposed framework to other crop species also represents an important future challenge. The authors conducted preliminary trials applying the same leaf-counting pipeline to stock (Matthiola incana (L.) R. Br.), another floricultural crop, and confirmed that the leaf-counting process functioned effectively regardless of crop species. However, node estimation required further improvement owing to occlusion effects specific to plant architecture. These observations suggest that, while the proposed approach has a certain degree of generality, estimation accuracy must be carefully evaluated for each crop species.
Given these limitations, additional validation across diverse regions, seasons, crop species, and environmental conditions is essential to improve the implementation and generalizability of the system.
5. Conclusions
This study presented a hybrid zero-shot–based framework for estimating node numbers and sharing growth information in lisianthus cultivation under real greenhouse conditions in Fukushima. By integrating a vision–language model (Grounding DINO), depth estimation (MiDaS), and a lightweight classifier (YOLO), the proposed system enabled robust leaf detection and node estimation without crop-specific retraining of the zero-shot models, while effectively suppressing background noise and occlusion.
Through three demonstration trials conducted across different cropping seasons, years, growers, and cultivars, the system achieved practical accuracy, with typical estimation errors within approximately ±1 node and stable generalization performance. Additional analyses showed that estimation reliability improves after the early growth stage (around four nodes), providing useful guidance for practical deployment.
Beyond phenotyping accuracy, the study demonstrated a low-cost, field-deployable information-sharing platform using consumer-grade cameras, Raspberry Pi devices, and cloud services. This infrastructure allowed growers, extension officers, and researchers to simultaneously monitor growth dynamics and exchange feedback in near real time, offering a practical foundation for digital support in floricultural production, particularly in regions with dispersed farms and limited on-site advisory capacity.
While further validation across regions, crop species, and stress conditions is required, the proposed framework provides a practical, generalizable, and extensible data-driven crop growth monitoring framework that can support real-world floricultural production. Ultimately, it can also help rebuild resilient agricultural communities that sustain people living and working in post-disaster regions.