Semi-Self-Supervised Domain Adaptation: Developing Deep Learning Models with Limited Annotated Data for Wheat Head Segmentation

Precision agriculture involves the application of advanced technologies to improve agricultural productivity, efficiency, and profitability while minimizing waste and environmental impact. Deep learning approaches enable automated decision-making for many visual tasks. However, in the agricultural domain, variability in growth stages and environmental conditions, such as weather and lighting, presents significant challenges to developing deep learning-based techniques that generalize across different conditions. The resource-intensive nature of creating extensive annotated datasets that capture these variabilities further hinders the widespread adoption of these approaches. To tackle these issues, we introduce a semi-self-supervised domain adaptation technique based on deep convolutional neural networks with a probabilistic diffusion process, requiring minimal manual data annotation. Using only three manually annotated images and a selection of video clips from wheat fields, we generated a large-scale computationally annotated dataset of image-mask pairs and a large dataset of unannotated images extracted from video frames. We developed a two-branch convolutional encoder-decoder model architecture that uses both synthesized image-mask pairs and unannotated images, enabling effective adaptation to real images. The proposed model achieved a Dice score of 80.7\% on an internal test dataset and a Dice score of 64.8\% on an external test set, composed of images from five countries and spanning 18 domains, indicating its potential to develop generalizable solutions that could encourage the wider adoption of advanced technologies in agriculture.


Introduction
Precision agriculture refers to the use of advanced technologies, including GPS guidance, control systems, sensors, robotics, drones, autonomous vehicles, and variable rate technology, to optimize farm management.It aims to reduce costs and increase yield in farming while ensuring sustainability and environmental protection [1].Deep learning (DL) methodologies can offer efficient, automated, and data-driven decision-making in agricultural practices.DL has demonstrated significant advancements in visual data analysis across various tasks, including image classification [2,3], object detection [4,5], semantic segmentation [6,7,8,9], and instance segmentation [10,11,12].Automated visual monitoring of agricultural fields can aid in the early identification of issues such as pest infestations, diseases, and nutrient deficiencies while optimizing resource utilization.Consequently, this approach facilitates timely interventions, which can lead to enhanced crop yields, improved quality of harvests, and increased overall operational efficiency, thereby ensuring sustainability.
However, the wide adoption of DL approaches for crop monitoring faces significant challenges.Agricultural fields are constantly changing environments.For example, a crop field in the early growth stages is substantially different from the same field in the later stages of growth.Additionally, these systems must operate accurately under various weather and lighting conditions.These challenges the generalizability of DL models because models trained on data from a specific growth stage of a field might not generalize well to the same crop at different growth stages.In the DL context, this phenomenon is known as a distribution shift [13,14].One could develop a large-scale dataset encompassing various crop growth stages and environmental conditions to alleviate this issue.However, this approach presents challenges in data collection and annotation.Collecting data from various growth stages of crop fields under various weather and lighting conditions is time-consuming.Further, annotating agricultural images is particularly challenging, often requiring pixel-level annotation.These images frequently contain numerous objects of interest-e.g., wheat spikes in a wheat field-making the data annotation process laborious.Domain adaptation techniques refer to the methodologies used to alleviate distribution shift and can be divided into supervised, semi-supervised, self-supervised, and unsupervised approaches depending on whether or not annotated data is used for domain adaptation [15,16].Supervised domain adaptation relies on labeled data from both source and target domains.Semi-supervised domain adaptation combines labeled data from the source domain with both labeled and unlabeled data from the target domain.Finally, unsupervised domain adaptation operates solely with unlabeled data in the target domain, focusing on learning features applicable across domains.
In this paper, we develop a semi-self-supervised domain adaptation approach based on a probabilistic diffusion model that operates without the need for manual annotation of data from the target domain.This alleviates the need for further manual data annotation, accelerating the model development process.Self-supervised learning refers to methodologies where supervisory signals are generated computationally from the input data, enabling learning data representation and extracting informative features without the need for manual annotation [17].This approach allows for utilizing largescale but unannotated datasets and has recently shown substantial progress in various image-processing tasks [17,18].Deep diffusion probabilistic models, as self-supervised techniques, have shown remarkable results in image-processing tasks, especially in Generative AI [19,20,21].In the following, we provide a mathematical description of these models.

Background
A Markov chain is a sequence of random variables X 1 , X 2 , X 3 , . . .that satisfies the Markov property, which states that the probability of the system transitioning to the next state depends solely on the current state of the system and not the preceding events/states.Markov property can be mathematically expressed as follows: In this equation, P denotes a probability distribution for the state of the system.X i is a vector-valued variable representing the state at the i th step, and x, x 0 , x 1 , x 2 , . . ., x n are specific state vector values of the system.
Given a data point x 0 from the actual data distribution, we can define a Markov chain-referred to as a forward diffusion process-by iteratively adding multivariate Gaussian noise vector to x 0 .More specifically, at state X t−1 , where the system has a specific state of x t−1 , we add a Gaussian noise-characterized by a mean vector of µ t and a covariance matrix of Σ 2 t -to x t−1 to transition to a state X t with a specific value of x t .The probability distribution for this transition can be described as q( , where β t is an scalar; I is the m × m identity matrix; and m is the cardinality of x i (0 ≤ i ≤ T ).Considering the Markov chain property, we have where q(x 1:T |x 0 ) is a short notation for q(x 1 , x 2 , . . ., x T |x 0 ).A state x t can be achieved by iteratively applying the Gaussian noise to x 0 , t times.Instead of an iterative process, this could also be achieved in a single step, utilizing a reparametrization trick, in which x t is rewritten as Where ϵ is an n-dimensional vector with standard normal distribution N (0, I).It can be inferred that this reparametrization preserves the distribution of x t , i.e., . By changing the notation as α t = 1 − β t , Equation 2 can be written as: As Equation 3 is a recursive equation, we can further expand the equation.
Since ϵ has standard normal distribution . Therefore, Equation 4 can be written as: By expanding Equation 5 further, we obtain the following Equation: Defining α t = α t α t−1 . . .α 1 , Equation 6 can be written concisely as follow: Given x 0 , Equation 7 allows generating x t in a computationally efficient manner in just one step, rather than iteratively adding noise in t steps.This is achieved by sampling from the Gaussian distribution N (x t ; . β i values are often defined using a linear [22] or cosine scheduler [22,23], with the cosine schedule leading to superior results as it smoothly transitions the noise added to the data across the diffusion steps, which is essential for controlling the quality and characteristics of the generated data in diffusion models.A Cosine scheduler can be defined as follows: Where β t represents the noise level at step t (higher β t values correspond to higher noise levels); t = 1 and t = T result in the minimum and maximum noise levels, respectively; T is the total number of steps in the diffusion process; and t is the current time step.All experiments in this study were conducted using β min = 0.0001 and β max = 0.02..
As T approaches infinity, x T converges to isotropic Gaussian distribution, meaning that all components of x t are normally distributed and decorrelated.Therefore, by having a procedure for learning the reverse distribution q(x t−1 | x t ), we can start with sampling x T from a normal distribution and follow the reverse distribution to arrive at x 0 .It should be noted that in the forward diffusion process, noise is added at each step; that is, x t is derived from adding noise to x t−1 according to a distribution of q(x t |x t−1 ).In the backward diffusion process, however, the aim is to reverse the forward process by removing the noise added to x t to recover x t−1 .This is achieved by a reverse transition using a distribution denoted as q(x t−1 |x t ).
In practical terms, the true reverse distribution q(x t−1 |x t ) is unknown and intractable to compute directly, as accurately estimating it would require complex calculations involving the entire data distribution.Instead, we approximate q(x t−1 |x t ) using a parameterized model, a deep neural network, denoted as p θ (x t−1 |x t ).Given that q(x t−1 |x t ) is also Gaussian for small enough values of β t in the forward diffusion process, we can choose p θ (x t−1 |x t ) to be a Gaussian distribution.In this case, the neural network is trained to parameterize the mean and variance of this Gaussian distribution.

Data and Methodology
This section provides an in-depth explanation of the data used in this research, along with detailed descriptions of the preprocessing methods, the model architecture, and the processes for model training and evaluation.We also use two sets of Ψ and Γ as the internal and external test sets, as introduced by Najafian et al. [8].The set Ψ comprises 100 image frames, which are randomly selected from a video clip of a wheat field and have been manually annotated to serve as an internal test set.The set Γ represents a subset of 365 manually annotated images from the GWHD dataset [24], which includes images from five countries and spans 18 domains, covering various growth stages of wheat.

Model Architecture
Figure 3 illustrates the convolutional neural network (CNN)-based model architecture employed in this research.This architecture is designed to facilitate representation learning through a diffusion process, in addition to performing the primary task of segmentation.The segmentation task is executed using computationally synthesized images along with their corresponding masks.Meanwhile, the representation learning aspect enables adaptation to real images using solely unannotated data, thereby reducing the reliance on manual annotation.
We use an encoder to develop a joint representation of both synthesized and real images.Additionally, the model architecture comprises an image decoder and a mask decoder.The former is designed to reconstruct the image from the output of the encoder, thereby enforcing learning features from real images.On the other hand, the mask decoder is tasked with developing a segmentation mask given the output of the encoder.
In implementing the encoder and decoders, we leveraged the Residual building blocks [25], each comprising two pairs of convolution and GroupNorm layers, supported by the Swish activation function [26] and integrated skip connections.Figure 4 shows a residual building block.
Figure 5 illustrates the encoder, consisting of one convolutional layer followed by 12 ResNet building blocks organized in six levels.Additionally, two skip-connection-free ResNet blocks serve as the network bottleneck.Moreover, down-sampling operations are performed after each pair of ResNet blocks using a convolution layer with a kernel size of three and a stride of two.The Swish activation function [26] is also consistently applied as the non-linearity across the encoder layers.Synthetic images undergo a series of image augmentations [27]; then they are fed into the encoder.The outputs of the encoder are subsequently passed to a mask decoder to generate accurate masks corresponding to the input images.To  calculate segmentation error, we use a linear combination of Binary Cross Entropy (BCE) loss [28] and Dice loss [28] (see Equation 8).
where y i is the true label of the pixel i, with a value of 1 if the pixel belongs to a wheat head and 0 otherwise.ŷi represents the model prediction for the pixel i being from a wheat head.N is the number of pixels per image, and B is the batch size.M s j is the ground truth mask for the synthetic image I s j , and M s j is the model prediction for I s j .We also generate noise-augmented images by applying Gaussian noise to the real image I j over t (1 ≤ t ≤ T ) timesteps, as detailed in Equation 7.These noise-augmented images are subsequently routed through the encoder, followed by the image decoder, aiming to reconstruct the original images.The error for this branch is calculated by comparing the original image with the reconstructed image using a linear combination of MSE [29], SSIM [30], and a perceptual loss function [31].The perceptual loss function calculates the mean absolute error between the pretrained ResNet18 output feature maps for the original image I j , and the reconstructed image Îj .We used the ResN et18_W eights.IM AGEN ET 1K_V 1 weights from the Torchvision Python package [32].We used the feature maps generated by the convolutional layer before the final average pooling layer of the ResNet18 model.
where B is batch size; N is the number of pixels in an image I j ; x ji is the actual value of the i-th pixel of I j ; xji is the predicted value for the i-th pixel of I j ; µ Ij and µ Îj are the mean intensity values of the images I j and Îj , respectively; These losses focus on variations between the reconstructed image and the original noise-free image from different perspectives: MSE loss focuses on pixel-level variations, SSIM loss focuses on local variations, and Perceptual loss focuses on high-level global variations.The reconstruction loss is calculated as a linear combination of these loss functions, as described in Equation 9, where λ M SE , λ SSIM , and λ P erceptual are constant values and model hyperparameters.

Model training and evaluation
Considering the size of the datasets, we train our models over a total of 50 training epochs.For each epoch, we iterate over all the data in the datasets.At each iteration, we select a batch of 32 synthesized image-mask pairs and a separate batch of 32 real images from our datasets.The image-mask pairs are fed into the segmentation branch of our model, which consists of an encoder and a mask decoder.The segmentation loss is then calculated according to Equation 8. Similarly, the real images are fed into the reconstruction branch of our model, comprising the encoder (shared with the segmentation branch) and an image decoder, where the reconstruction loss is calculated as per Equation 9.The total loss is computed by summing the segmentation loss and the reconstruction loss.Figure 3 illustrates this process.
This dual-stream approach ensures that the encoder effectively attends to both types of data, with each decoder specializing in a task corresponding to its branch-namely, segmentation or reconstruction.Model updates are orchestrated using the AdamW optimizer [33], with a learning rate set at 0.0001.For all experiments, the coefficients λ M SE , λ SSIM , λ P erceptual , λ BCE , and λ Dice are set to 1.After training for 50 epochs, we selected the model with the highest Dice score as the best model.During the evaluation process, we assessed the model performance using the Dice score and IoU.As a baseline for comparison, we used a model developed by [8].

Results
Table 1 presents a quantitative evaluation of the performance of our developed models, namely, model F η+ρ1 , trained on D η and D ρ1 , and model F η+ζ+ρ trained on D η+ζ and D ρ .These models were trained using both synthesized image-mask pairs and unannotated real images extracted from video clips of wheat fields.The performance of the baseline model S from Najafian et al. [8] is also reported in Table 1.The results demonstrate that the developed models, F η+ρ1 and F η+ζ+ρ , consistently outperformed the baseline model S.These models rely on computationally annotated synthetic images and real unannotated images extracted from the video frames of wheat fields.
Table 1: The performance of the trained models was evaluated on our internal and external test sets using the IoU and Dice scores.Model F η+ρ1 was trained on D η and D ρ1 and model F η+ζ+ρ is the result of fine-tuning model F η+ρ1 on datasets D η+ζ and D ρ .We also compared the performance of these two models with the model developed in [8].All of these models rely exclusively on synthesized image-mask pairs and/or unannotated images.Table 2 presents the performance of the proposed models across various domains of the GWHD dataset.The results indicate that the proposed models generally outperform model S in most domains, exhibiting lower variance and greater stability.

Model Evaluation
Table 2: The performance of the models on each of the 18 domains of the GWHD dataset.Model S, which was trained using the synthesized dataset, was reported from [8].Model F η+ρ1 was developed using dataset D η -consisting of 8,000 computationally annotated synthesized images-and dataset D ρ1 , which includes 5,296 unannotated real images.Model F η+ζ+ρ was developed using dataset D η+ζ -consisting of 16,000 computationally annotated synthesized images-and dataset D ρ , which includes 10,592 unannotated real images.We trained model F η+ρ1 from scratch, while model F η+ζ+ρ was resulted from fine-tuning of model F η+ρ1 using dataset D η+ζ and dataset D ρ .

Discussion
Precision agriculture aims to integrate advanced technologies to address various challenges in the agricultural sector, enhancing productivity, efficiency, and profitability while minimizing waste and environmental impact.Deep learning approaches play a crucial role in enabling automated decision-making capabilities.By automating the processing of visual data from agricultural fields, these decision-making processes can be significantly improved and scaled up.
However, the development of DL-based techniques encounters challenges stemming from the dynamic nature of agricultural fields, characterized by diverse growth stages, inconsistent weather conditions, and variable lighting.As such, models developed based on one snapshot of these fields often are not generalizable across different growth stages or environmental conditions.Furthermore, developing large-scale annotated datasets across various growth stages and environmental conditions is time-consuming and expensive.These challenges hinder the development of generalizable DL-based solutions for various agricultural tasks.Consequently, developing methodologies that allow DL-based solutions to generalize across different conditions (domains) could facilitate the widespread adoption of these technologies in the agricultural sector.
In response to these challenges, we developed a semi-self-supervised domain adaptation technique that employs deep convolutional neural networks and a probabilistic diffusion process.This approach was developed with minimal data annotation, requiring only three manually annotated images.We classify this method as semi-supervised because it leverages a small amount of annotated data (three images) alongside a significant volume of unannotated data.Additionally, this method can be considered self-supervised as it utilizes computationally annotated data for model development.
The proposed model demonstrated substantial improvements in performance compared to the baseline model from recent work by Najafian et al. [8], while also showing lower variance in model performance across 18 different domains from the GWHD dataset.These 18 domains represent various growth stages of wheat fields and different environmental conditions.
In this study, we devised a two-phase model training process.Initially, model F η+ρ1 was developed by synthesizing a large number of computationally annotated samples resulted from a single sample for image synthesis and validation, showcasing nearly a 20% improvement over a recent work by Najafian et al. [8] on the target domain.Subsequently, in the second phase, model F η+ζ+ρ leveraged additional computationally generated data, further enhancing the performance difference to 28%.In this research, we adopted an extreme strategy by utilizing only two annotated images for training and one for validation.However, we recommend incorporating more manually annotated samples from diverse domains to further enhance the robustness of the proposed approach.Additionally, in this research, we primarily relied on default hyperparameters.Tuning model hyperparameters could lead to further improvements in model performance.

Conclusion
This research introduces a semi-self-supervised domain adaptation methodology characterized by a dual-stream encoderdecoder model architecture, specifically designed for the downstream task of semantic segmentation of wheat heads.We synthesized a large-scale, computationally annotated dataset from three manually annotated images.We also used unannotated image frames extracted from wheat field videos captured at two different growth stages.The model architecture and training strategy, utilizing both image segmentation and reconstruction, were strategically designed to mitigate challenges associated with domain shift while minimizing the need for extensive data annotation.The external evaluation of our proposed approach on a subset of the GWHD dataset demonstrated a substantial improvement in model performance over a recent work.This underscores the utility of our proposed approach in alleviating domain shift, allowing for the development of generalizable models with minimal manual data annotation, which, in turn, could enable the widespread adoption of DL-based approaches in the agricultural sector.

Figure 1 :
Figure1: Three manually annotated image-mask pairs were utilized for data synthesis.We developed two training sets by synthesizing computationally annotated images using manually annotated images from the left (I η ) and the middle (I ζ ), producing 8, 000 images based on I η and 8, 000 images based on I ζ .Hereafter, we refer to the 8, 000 images developed based on (I η ) as dataset D η .We refer to the set comprising the whole 16, 000 images as D η+ζ .Additionally, we created a validation set by synthesizing 4, 000 images, with 2, 000 from the image on the right (I τ ) and 2, 000 images based on I ζ .Hereafter, we refer to this set of 4, 000 images as D ζ+τ .Dataset D ζ+τ was made to allow for a balanced representation of wheat field images from the early and late growth stages.All computationally annotated samples were synthesized following the methodology described by Najafian et al.[8].

Figure 1
Figure 1 illustrates the three manually annotated images (I η , I ζ , and I τ ) used for synthesizing computationally annotated datasets.Utilizing the methodology presented in our previous work [8], we computationally synthesize three datasets: D η a set of 8, 000 images drived from I η , D η+ζ a set of 16, 000 images derived from I η and I ζ , and D ζ+τ a set of 4, 000 images derived from I ζ and I τ .Figure2illustrates examples of synthesized images and their corresponding segmentation masks.In addition to these manually and computationally annotated images, for the training process, we utilize 10, 592 unannotated image frames extracted from two video clips of wheat fields, with 5, 296 frames from each.Hereafter, we refer to these datasets as D ρ1 and D ρ2 .We refer to the dataset resulting from combining D ρ1 and D ρ2 as D ρ .

Figure 2 :
Figure 2: Examples of computationally synthesized images and their corresponding segmentation masks.

Figure 3 :
Figure 3: Schematic Representation of the Model Architecture.The encoder focuses on developing a joint image representation for both synthesized and real images, while the mask decoder aims at generating segmentation masks, and the image decoder aims at reconstructing the real images, forcing the encoder to adapt to the real images.

Figure 5 :
Figure 5: Encoder model architecture is designed by combining convolutional layers, ResNet blocks, and GroupNorm layers.Also, in each of the two decoding streams, we utilize concatenation instead of addition.

σ 2 Ij and σ 2 Îj 1 and c 2
are the variances of the images I j and Îj , respectively; σ Ij Îj is the covariance between images I j and Îj ; c are constants introduced to stabilize the division.θ(I j ) and θ( Îj ) are ResNet18 features extracted from images I j and Îj , respectively.

Figure 6
Figure6presents a qualitative evaluation of the performance of models F η+ζ+ρ , identified as the best-performing model, and model S on several different domains.In general, model F η+ζ+ρ demonstrates superior performance compared to model S. We also reported a case, as shown in the middle column, where model S performs better.

Figure 6 :
Figure 6: Showcasing the prediction performance of model F η+ζ+ρ (highlighted in a red box in the upper row) in comparison with the results obtained by model S [8] (highlighted in a blue box in the lower row) on samples from the Global Wheat Head Detection dataset [24].
ResNet block comprises three groups of operations, including convolution, GroupNorm layers, and the Swish activation function for nonlinearity.It also incorporates skip connections to enhance feature propagation.

Table 1
, our final model, F η+ζ+ρ , shows 28.1% improvement in Dice score and 25.2% improvement in IoU when compared to model S from Najafian et al.The increased performance showcases the utility of the proposed semi-self-supervised domain adaptation technique.