1. Introduction
The water hyacinth (*Eichhornia crassipes*) is a species indigenous to South America. However, due to its attractive flowers, it has been introduced to other regions accross multiple continents such as Asia, Europe, Africa, and America where it is now considered an invasive species because it can proliferate rapidly in lakes with warm climates that receive an abundance of nutrients [
1,
2]. The rapid proliferation of water hyacinths in non-native ecosystems can have detrimental effects, as they out-compete other aquatic plants for resources and disrupt the food chain, leading to declines in fish populations [
3]. The dense growth of this plant forms a thick layer that prevents light from penetrating and lowers oxygen levels in the water, hindering the growth of other aquatic life [
4].
The most common problems caused by water hyacinth infestation include water clarity reduction, decreased phytoplankton production, lower oxygen levels, reduced light penetration in the water, formation of floating mats on the water surface, and negative impacts on aquatic biodiversity [
5,
6,
7]. The rapid growth of the water hyacinth plant in Cangkuang Lake has decreased the area available for bamboo rafting, hindered transportation, and reduced its visual appeal [
2]. In Lake Tana, Ethiopia, water hyacinth has created numerous problems, not only for ships that are unable to navigate but also for the local population living near the lake who rely on it for fishing, tourism, navigation, and irrigation. Lake Tana is the largest freshwater lake in Ethiopia and has been declared a World Heritage Site due to its biodiversity [
8,
9,
10]. Due to its negative impact on navigation, irrigation, and power generation, water hyacinth has earned the reputation of being the most problematic invasive aquatic plant in the world [
11]. Minychl et al. [
7] identified that environmental factors such as nutrient-rich sedimentation, wind direction, and water depth are key in the expansion of this species in Lake Tana, highlighting the importance of monitoring these variables to design appropriate control and management strategies. Human activities such as agriculture and deforestation near Lake Tana contribute to the generation of sediments in the lake. Approximately 1.09 million tons of this sediment are transported out of the lake each year [
12]. This sediment, originating from the surrounding basins and deposited in Lake Tana, is rich in nutrients, which intensifies the eutrophication process in the lake.
The problems caused by these invasive species must be studied to develop different solutions. Tadesse et al. [
1] demonstrated the effectiveness of remote sensing with Sentinel-2 satellites to monitor both water quality and the proliferation of water hyacinth in Lake Tana. These findings emphasize the importance of using advanced monitoring technologies in managing ecosystems affected by invasive species. These platforms yield valuable insights for researchers investigating environmental impacts and devising potential solutions to challenges reflected in the acquired data [
13]. However, the spectral information derived from satellite imagery is constrained in its capacity to address specific challenges due to the limited range of available resolution. For instance, the Sentinel-2 satellite, operated by the European Space Agency, possesses three distinct spatial resolutions (10, 20, or 60 m/pixel) across a spectrum of wavelengths extending from the visible (VIS) to the shortwave infrared (SWIR) [
14]. Although satellite remote sensing has the potential to monitor rivers and lakes, its effectiveness is limited by factors such as low spatial resolution, long revisit times, and cloud obstruction. Unmanned aerial vehicles (UAVs), on the other hand, offer a means of acquiring high-resolution imagery while affording precise control over both the spatial and temporal dimensions of data collection [
15].
In recent years, the use of drone applications have expanded to more areas, from facial recognition [
16] to agriculture [
17]. Shahi et al. [
18] conducted a survey to review progress in crop disease detection using unmanned aerial vehicles. The study concludes that more research employs multispectral sensors with statistics-based methods and ML-based methods, while RGB sensors are more commonly used with DL-based methods. However, DL-based methods show a better performance in crop disease estimation. One specific task performed with drones and multispectral images is plant recognition. Narmilan et al. [
19] used a drone with a multispectral camera to detect the mouse-ear hawkweed plant. They used machine learning algorithms such as eXtreme Gradient Boosting (XGB), Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbors (KNN), achieving an accuracy of approximately 100%. However, before training, they used software for image pre-processing and employed a MicaSense multispectral camera, which is expensive. Additionally, not having an onboard computer to perform the entire detection task, as in [
20], increases detection time, as the images must be processed and inferred on a desktop computer after being collected by the drone. Therefore, this is not a viable option due to the high cost and processing time. The detection of invasive species or plants is carried out with drones, as it is an optimal tool for monitoring large areas. László et al. [
21] proposed the identification and monitoring of two plant species using only RGB images and image processing techniques through different software and libraries. However, they concluded that, although the overall accuracy is close to 99%, this figure can be misleading due to the large number of background pixels classified as correct. Monitoring plants using drones also allows for the detection of plant diseases [
22,
23] using multispectral images. Peihua et al. [
24] implemented and trained existing DL models to detect plants infected with pine wilt disease, utilizing Sentinel-2 satellite spectral images and RGB drone images. However, the accuracy is below 90%, which may be due to many factors, such as using different image sources or the deep learning algorithms used. The detection of invasive species with multispectral images has many benefits since more information can be extracted from the wavelengths of non-visible light that plants reflect from sunlight. Elena et al. [
25] used multispectral images from the Sentinel-2 satellite and trained machine learning algorithms such as K-means, Random Forest, and Convolutional Neural Networks (CNNs). The CNN algorithm achieved the highest accuracy, but the computational cost and running time are not shown, which are important data to consider for implementation.
Water hyacinth detection is crucial for addressing the problems caused by invasive species. Recent studies have highlighted the potential of unmanned aerial vehicles (UAVs) equipped with multispectral cameras to enhance the identification and monitoring of aquatic plants across diverse ecosystems. These advancements align with the purpose of this study: to develop robust methods for aquatic species recognition using deep learning techniques and multispectral data.
The study by Bonggeun Song and Kyunghun Park [
26] employed a UAV equipped with a MicaSense RedEdge multispectral camera, which captures information in five spectral bands (blue, green, red, near-infrared, and RedEdge). Vegetation indices such as NDVI and GNDVI proved effective in distinguishing aquatic plants from other surfaces in a reservoir. This approach underscores the precision provided by multispectral imaging for detecting and mapping aquatic plants, particularly in areas inaccessible to traditional methods.
Kevin Musungu et al. [
27] implemented a UAV-based system with a Parrot Sequoia multispectral camera, which includes bands such as near-infrared, red, green, and RedEdge. This system facilitated species classification in a wetland within the Fynbos Biome using algorithms such as Random Forest and SVM, achieving high accuracy in species discrimination. Spectral indices like NDWI and CIRE played a crucial role in this process, demonstrating the utility of multispectral cameras for mapping biodiverse habitats.
Similarly, Md. Abrar Istiak et al. [
28] introduced the AqUavplant dataset, created using a DJI Mavic 3 Pro drone equipped with a professional triple-lens camera. This configuration enabled the capture of high-resolution images with a GSD of 0.04–0.05 cm/pixel, facilitating the identification and semantic segmentation of 31 aquatic plant species. The dataset includes both native and invasive species, serving as a foundation for training machine learning models for automated mapping of aquatic biodiversity.
Finally, António J. Abreu et al. [
29] developed the LudVision system to detect the invasive species Ludwigia peploides in a reservoir in Portugal. They utilized a DJI P4 Multispectral drone, which integrates six sensors (RGB and five monochromatic bands). This system generated precise maps through semantic segmentation, demonstrating the capability of multispectral cameras to monitor and mitigate the impact of invasive species on aquatic ecosystems.
These studies consolidate the role of UAVs equipped with multispectral cameras as scalable and precise tools to address challenges in aquatic ecosystem management, including invasive species detection and biodiversity conservation. The integration of these techniques into our work expands the horizon of applications in aquatic ecosystem management through artificial intelligence and multispectral remote sensing solutions.
In [
30], the use of UAVs equipped with multispectral cameras is highlighted as an effective method for detecting and monitoring invasive aquatic plants like water hyacinth, demonstrating how advanced technology can aid in managing these ecological threats. Recently, there has been an increase in research utilizing onboard computers. Anis et al. [
20] employed an Nvidia Xavier NX onboard computer along with a cloud-based system. The onboard computer, together with the Pixhawk flight controller, is used to detect and track different objects using DeepSort [
31]. The cloud system is used for data manipulation and storage. This combination is very practical for automating remote sensing and semi-automatic tracking tasks; thus, this framework is used for a variety of tasks, and specific tasks require particular solutions, although the approach of this system can be adapted to other applications.
In this study, we implemented a low-cost multispectral camera using four cameras: one RGB camera and three cameras with specific filters to capture only infrared light at different wavelengths. The objective is to precisely detect water hyacinths and explain the class map activations that contributed to the semantic segmentation using the normalized vegetation difference index method. For understanding the process of features selection from RGB images, and the infra component images, we customized U-Net for semantic segmentation, detailed in
Section 4. This architecture was adapted to train the network with multiple channels and obtain more precise outputs. This study was conducted in a controlled environment in Shiga Prefecture, Japan, where 250 images were captured. These images were divided into training, testing, and validation sets. Unlike previous studies that use more expensive multispectral cameras limited to certain applications, our low-cost solution allows for greater flexibility in detecting invasive plants like water hyacinths. Additionally, by combining multiple channels of infrared light with a deep learning approach, we improved the accuracy of semantic segmentation, surpassing the limitations observed in works that relied on more expensive hardware or less advanced techniques such as NDVI or PCA analysis. In summary, semantic segmentation with U-Net provides a more robust and precise solution for detecting water hyacinths, leveraging the capabilities of multichannel image processing and the ability to capture fine details in images, which overcomes the limitations of other traditional methods. The main contributions of this research are as follows:
Develop a customized multispectral camera for capturing multispectral images of water hyacinths for analysis.
Develop a semantic segmentation of a UNet-based model for segmenting multispectral images of water hyacinth.
Analytical observations and explanation of the features extracted from water hyacinth viewed with the multispectral camera.
The importance of this work is that it can be adapted not only for detecting vegetation or objects, but also for studying the design of infrared wavelengths to be incorporated into multispectral cameras meant for segmentation tasks. This approach can help analyze the diverse performance of the infrared channels to include in the camera design, especially the cameras to be mounted on UAVs.
We organized this paper as follows: the next
Section 2 describes the relevance of our and some of the related works, showing differences and the new approach we used.
Section 3 presents the approaches of the works and their implementation.
Section 4 presents the experiments carried out to analyze the performance of the model with the various components of the multispectral images.
Section 5 discusses the results and findings and finally the paper is concluded in
Section 6.
2. Related Works
Multispectral imaging is a technique that includes capturing images with a specialized camera across specified wavelength ranges within the electromagnetic spectrum. By using filters to isolate different wavelengths of light, multispectral cameras enable the extraction of spectral features that can identify materials or objects based on their absorption or reflection characteristics. Unlike traditional RGB imaging, which is limited to the visible spectrum, multispectral imaging divides the spectrum into multiple bands based on their wavelengths, revealing information invisible to the human eye. Multispectral imaging is useful for applications that require detailed spectral information for diverse tasks.
The utility of multispectral imaging extends across numerous fields. In medical imaging, for instance, it has been used for the segmentation and detection of pathological conditions, showcasing its effectiveness in extracting meaningful features for classification and detection tasks. Similarly, its adoption has surged in areas such as vegetation mapping [
32,
33], food quality control, and even facial recognition, where the ability to analyze data across different wavelengths has proven advantageous [
34,
35]. In remote sensing, multispectral imaging has become a cornerstone for analyzing satellite imagery to study weather patterns, vegetation health, and water bodies. Its ability to preserve spectral information while capturing spatial structures and topographic variations makes it especially useful in environmental monitoring.
Recently, to precisely detect the vegetation cover for a specific region, multispectral images from either a satellite source or a drone are employed since the surface radiance is different for various plants. The index of reflectance offers a clear distinction and ability to detect or perform the segmentation tasks accurately. As shown in
Figure 1, the reflectance properties of vegetation, influenced by leaf structure and chlorophyll, distinguish the various types and monitoring productivity. Assmann et al. [
36] highlight multispectral drone sensors like the Parrot Sequoia, which capture the Normalized Difference Vegetation Index (NDVI) for vegetation analysis while addressing challenges like solar angle and calibration to ensure data consistency.
However, the use of multispectral imagery is not without challenges. The high dimensionality of the data, stemming from multiple spectral bands, complicates feature extraction. Identifying the most relevant bands for a given application often requires sophisticated algorithms to avoid redundancy and minimize data loss. This step is critical in ensuring that the extracted features effectively represent the desired objects or materials. For example, methods like wavelet-based fractal analysis have been used to detect mineral deposits from multispectral images taken by drones [
37].
When applied to vegetation mapping and invasive species detection, multispectral imaging has demonstrated significant potential. Bayable et al. [
38] leveraged multispectral satellite data to detect water hyacinths in Lake Tana, Ethiopia, using machine learning algorithms such as Random Forest and SVM. These approaches, while achieving high accuracy, primarily relied on spectral indices like NDVI and lacked the integration of advanced segmentation methods such as U-Net. Similarly, Flores-Rojas et al. [
39] mapped the seasonal distribution of water hyacinths in San Jose Dam, Mexico, using the NDVI from Landsat 8 imagery. This study highlighted the value of multispectral data for understanding seasonal variations but did not incorporate deep learning models or the analysis of RGB and infrared bands in tandem. Building on these studies, the integration of deep learning models like U-Net offers the potential to overcome these limitations by providing pixel-level segmentation capabilities and enhancing accuracy. Furthermore, combining RGB and infrared bands can enrich feature extraction, enabling a more robust analysis of vegetation structures and dynamics.
Traditionally, multispectral imaging has required extensive pre-processing to manage the vast amounts of data generated by multiple spectral bands. Compression has been a cornerstone of these pre-processing steps, particularly for satellite-based multispectral images, where the balance between storage efficiency and preserving essential spectral information is critical. These images are often heavily compressed to facilitate transmission and storage, but this compression introduces significant challenges that hinder their effectiveness in precision-demanding applications.
In a review by Paoletti et al. [
40], some algorithms for processing satellite imagery aim to determine which bands are less critical and discard them entirely. However, this approach is highly undesirable as it amounts to simply throwing away valuable data that could hold spectral information for other types of evaluation. To extract meaningful features from multispectral images, strategies for feature extraction that consider all relevant data captured across various wavelengths are required.
To effectively utilize multispectral images for object detection and recognition, feature extraction must focus on isolating the correct information from images taken at different wavelengths. This process has been explored in applications like wavelet-based fractal and multifractal analysis for detecting mineral deposits using drone-acquired multispectral images [
37]. However, extracting relevant features from a multispectral dataset remains challenging, as these datasets consist of multiple image layers captured at different wavelengths, each contributing unique but complex spectral information.
Furthermore, methods like PCA combined with spatial transforms (DCT or wavelets), or more recent techniques such as inter-band prediction and HEVC compression, aim to reduce the enormous file sizes generated by satellite sensors. While these methods succeed in lowering storage and transmission costs, they often sacrifice spectral fidelity, disrupting the correlation between bands and diminishing the quality of extracted features [
41,
42,
43]. For instance, Seltsam et al. [
41] highlighted that PCA-VVC compression fails to preserve spectral quality when handling images with high variability in brightness and color.
Beyond compression challenges, satellite images face inherent limitations such as misalignment between spectral bands caused by calibration and registration discrepancies, and lower spatial resolution. These factors negatively impact precision in applications like invasive species detection and crop monitoring [
43,
44]. Finelli et al. pointed out that the non-linear dependencies across spectral bands in satellite images make linear prediction-based encoding strategies inefficient [
45]. Similarly, methods that combine sparse coding with spectral unmixing often introduce spectral distortions, reducing the utility of reconstructed images [
44].
These challenges have driven a shift towards alternative data sources that can overcome the limitations of satellite imagery. One merging solution is the use of multispectral images captured by drones. Drones offer higher spatial resolution, better control over data acquisition conditions, and reduced dependence on compression of the captured data, thus preserving both spectral fidelity and inter-band correlations. Drone-acquired data not only eliminate the need for algorithms that discard entire bands but also provide greater flexibility in the design of feature extraction techniques to specific application needs.
However, the processing of multispectral image data faces challenges not only related to spectral fidelity and band alignment but also concerning the selection of appropriate algorithms for feature extraction and classification. In this context, machine learning (ML) and deep learning (DL) approaches have played a crucial role in the analysis of multispectral images and the detection of invasive species such as Eichhornia crassipes.
Traditional ML methods, such as Random Forest (RF) and Support Vector Machines (SVMs), have been widely employed due to their ability to handle high-dimensional data and deliver accurate results with limited datasets. For instance, Bayable et al. [
38] successfully detected Eichhornia crassipes infestations using Sentinel-2 satellite images and RF, achieving accuracy rates exceeding 95%. Nevertheless, these methods heavily rely on manual feature selection and are less effective in scenarios with high spectral variability or where more complex analyses are required [
30,
46].
On the other hand, deep learning (DL) approaches have demonstrated greater efficacy in segmentation tasks and real-time detection. Models such as YOLOv5 have enabled the integration of data from multiple sources, including UAVs and satellites, enhancing accuracy in the detection and temporal monitoring of invasive species [
46]. Furthermore, combining UAV and satellite imagery has proven to be an effective tool for mapping dynamic invasions and evaluating control strategies [
47].
Although DL approaches, such as YOLOv5, require larger datasets and greater computational capacity, they have shown remarkable superiority in terms of precision and flexibility. This makes them the preferred option for large-scale applications and real-time monitoring, particularly when combining data from UAVs and satellites. Conversely, ML methods like Random Forest remain a viable alternative in resource-constrained contexts or for local studies, achieving accurate results with smaller datasets [
47,
48]. This progression towards more advanced methods suggests that the integration of DL with multispectral and UAV data represents a promising direction for addressing the challenges of invasive species detection and environmental monitoring. A comparative table of the various methods for detecting water hyacinths is presented in
Section 7.
In this study, we leverage multispectral images captured by a custom-built drone camera. By retaining higher spectral fidelity and preserving inter-band correlations, drone-acquired data enable more accurate feature extraction and improve the performance of segmentation tasks. This approach addresses the shortcomings of satellite imagery and represents a significant step forward in achieving precision in environmental monitoring and vegetation analysis.
U-Net has emerged as a robust and versatile architecture for image segmentation across a wide array of domains, owing to its encoder–decoder design and skip connections that facilitate the learning of rich hierarchical features. Originally proposed by Ronneberger et al. [
49], U-Net was designed for biomedical image segmentation but has since demonstrated exceptional adaptability to diverse applications, achieving state-of-the-art results in various tasks.
One of the key strengths of U-Net is its ability to effectively learn features that enhance segmentation performance, even in challenging scenarios with limited training data. For instance, the architecture’s success in high-resolution satellite image segmentation has been well-documented. U-Net has been employed to extract building footprints [
50], road networks [
51], and has been adapted for segmentation tasks in medical imaging [
52], achieving high accuracy in complex and densely populated environments.
Beyond remote sensing, U-Net has also been applied in real-time agricultural applications to distinguish crops from weeds, enabling advancements in precision agriculture [
53]. Similarly, its capability to segment underwater images for marine life detection [
54] and historical map images for digital archiving [
55] further highlights its adaptability to non-medical domains.
The architecture has also proven valuable in industrial and commercial contexts. For example, U-Net has been used to segment defects in industrial products during quality inspections, significantly improving the automation of manufacturing processes [
56]. Additionally, its application in fashion image segmentation has facilitated virtual try-on systems and e-commerce solutions by accurately segmenting clothing items [
57].
Moreover, U-Net’s contributions to disaster response have been noteworthy. The architecture has been employed to segment aerial images of disaster-affected regions, enabling the efficient identification of impacted areas and aiding in response planning [
58]. Its success in these diverse applications underscores the architecture’s ability to learn and generalize features effectively, even in scenarios with significant variations in data distribution and image characteristics.
In conclusion, U-Net’s flexible design and advanced feature learning capabilities make it an exemplary segmentation method across domains ranging from medical imaging to remote sensing, agriculture, industrial inspection, and more. Its widespread adoption and continual innovation reflect its potential to address increasingly complex segmentation challenges in the future.
3. Water Hyacinth Segmentation with Multispectral Imaging
The next section outlines the capturing of multispectral images using a custom-built multispectral camera mounted on a drone and semantic segmentation tasks for detecting the water hyacinth. Our custom-built camera can detect variations in surface reflectance across plants and vegetation, BUT necessitates advanced image processing to accurately identify regions covered by water hyacinths. Furthermore, we detail the application of a U-Net segmentation model to enhance feature extraction and provide interpretability of the detection process.
We developed the study using MMEngine [
59] adapting it for the processing of the curated dataset, segmentation tasks, and evaluation of the models. MMEngine is a library for training deep learning models based on PyTorch. It provides well-designed pipelines for developing deep learning research studies.
3.1. Multispectral Camera
We designed and constructed a unique multispectral camera setup consisting of four units: an RGB camera, a NoIR camera without an infrared cutter, and two infrared cameras equipped with filters for 720 nm and 850 nm wavelengths, respectively, as shown in
Figure 2. The selection of these two wavelengths are based of the available filters and the amount of light the filters allow to pass through them. The 720 nm IR filter allows for more light to pass through them than the 850 nm IR filter. This configuration allows for the capture of high-resolution, multispectral images across six distinct channels, offering significant flexibility and precision over traditional imaging systems. The multispectral camera system comprises four distinct units with a resolution of
pixels each. The RGB camera captures visible light in three channels (red, green, and blue), while the NoIR camera operates without an infrared filter, capturing a broader range of wavelengths. The two infrared cameras are equipped with band-pass filters centered at 720 nm and 850 nm, respectively. These wavelengths were chosen due to their effectiveness in detecting vegetation reflectance, particularly for distinguishing water hyacinth from other aquatic vegetation and non-vegetative surfaces. The selection of the 720 nm and 850 nm wavelengths is grounded in their ability to capture specific reflectance properties of vegetation in the near-infrared spectrum. The 720 nm band is particularly sensitive to photosynthetically active vegetation, effectively capturing reflectance associated with healthy vegetation such as water hyacinth. The 850 nm band, on the other hand, enhances differentiation in regions with mixed vegetation and water or other non-vegetative elements. Together, these bands provide comprehensive spectral coverage, enabling robust segmentation and identification of water hyacinth, even in complex environmental conditions. When combined with RGB and NoIR channels, this configuration ensures improved precision for segmentation tasks.
To capture the multispectral images, we utilized a custom-built multispectral camera system mounted on a drone. The setup comprises four distinct cameras strategically arranged in a triangular configuration with a centrally placed camera. This arrangement was designed to maximize the intersection over union (IoU) among all four cameras, ensuring optimal overlap of imaging areas for seamless integration of data across channels. The triangular positioning also minimizes occlusions and distortions, facilitating precise spatial alignment of the captured images.
As depicted in
Figure 3, the system includes four cameras: an RGB camera, a NoIR camera, and two near-infrared cameras equipped with filters for wavelengths of 720 nm and 850 nm. Each captured image contains a circular marker at the center of its field of view, serving as a reference point for alignment and calibration across channels. These markers are essential for synchronizing the spatial representation of data from all cameras, enabling consistent multispectral analysis.
To address potential challenges caused by the triangular arrangement of cameras, corrections were applied to minimize perspective distortion and ensure spatial alignment between channels. While the cameras were designed and positioned to remain as parallel as possible using a custom 3D-printed housing, slight variations in alignment were managed through pre-processing steps. Specifically, a median blur filter was applied to reduce noise, followed by thresholding operations (“cv.threshold” and “cv.adaptiveThreshold”) to highlight key features under varying lighting conditions. Circular markers detected using the Hough Circles algorithm (“cv.HoughCircles”) served as fixed reference points for aligning the images from all channels. These pre-processing steps ensured the validity of spatial alignment across channels, as confirmed through visual inspection of overlays and precise IoU calculations (
Figure 4).
The image acquisition process involved capturing grayscale images from each camera to streamline further processing. Pre-processing steps were applied to enhance the quality and uniformity of the images, including median filtering to reduce noise and adaptive thresholding to isolate key features such as the circular markers. The Hough Circle Transform method was employed to accurately detect and extract the coordinates of the markers, ensuring precise localization within each image.
Using the identified marker coordinates, the images were spatially aligned and integrated into a single four-channel composite image. This integration was achieved using a zero-matrix canvas, with each channel corresponding to one camera (RGB, NoIR, 720 nm, and 850 nm). The alignment process involved adjusting each image to match the shared spatial reference defined by the circular markers, ensuring uniformity in the combined dataset, as described in Algorithm 1.
Figure 4 shows the intersection over union of the four images per channel.
After constructing the composite image, the IoU was calculated to determine the overlapping area among all four channels. The IoU calculation provided a set of coordinates defining the shared region of interest across all cameras. Using these coordinates, each individual image (from cameras labeled 17, 67, 133, and 249) was cropped to focus on the overlapping area. This process resulted in a refined dataset where all images were consistently cropped, ensuring spatial correspondence across the four spectral channels.
The final output consists of four separate folders, each containing the cropped images for a specific camera channel. As depicted in
Figure 5, by standardizing the dimensions and alignment of the images, the dataset is optimized for further analysis, particularly for applications requiring pixel-level accuracy, such as segmentation and feature extraction. This approach ensures that the multispectral data are both reliable and consistent, facilitating precise detection of water hyacinth regions and enabling robust downstream processing.
Algorithm 1 Multispectral Image Alignment and Pre-Processing |
- 1:
Import libraries: OpenCV, PIL, NumPy - 2:
Load images: Paths for RGB, 720 nm, 850 nm, and NoIR cameras - 3:
Preprocessing: - 4:
Apply median filtering to reduce noise - 5:
Perform binary and adaptive thresholding - 6:
Circle Detection: - 7:
Detect circular markers using Hough Circle Transform - 8:
Extract center coordinates and radii - 9:
Image Alignment: - 10:
Create composite matrix with four channels - 11:
Align images using marker centers as reference - 12:
IoU Calculation: - 13:
Compute intersection and union across channels - 14:
Calculate IoU for aligned images - 15:
Crop Images: - 16:
Extract IoU region coordinates - 17:
Crop and save aligned images by channel - 18:
Output Results: - 19:
Save IoU values and region coordinates - 20:
Save and display processed images
|
3.2. Dataset
The multispectral imagery was constructed from our custom-built camera. The images were taken at Lake Ibanai, Shiga Prefecture, in Japan, during a season when water hyacinth plants were beginning to decay due to the onset of winter. While this limits the dataset’s diversity in terms of representing different seasons and growth stages of water hyacinths, it introduces a challenging scenario for the detection task, as the plants are often sparse or partially decayed. This highlights the robustness of the proposed method in detecting water hyacinths under difficult conditions. Furthermore, the lake environment includes a mix of vegetation types, which closely resembles real-world scenarios where water hyacinths coexist with other plant species. This ensures that the dataset remains relevant for practical applications. For future studies, we plan to expand the dataset by including samples from different seasons and geographic locations, such as Ethiopia, to enhance diversity and ensure broader generalization of the model. The dataset includes images captured in four different spectral bands: RGB, NIR (720 nm and 850 nm), and NoIR. Each image has a resolution of pixels. To ensure spatial alignment, we processed the images to find overlapping regions and stitched them to construct a six-channel multispectral image (three RGB channels, two NIR channels, and one NoIR channel). Additionally, the IR 720 nm, IR 850 nm, and NoIR bands were combined to create a single IR dataset for analysis of the spectral contributions.
For our dataset, we captured images using a novel camera design with four synchronized cameras equipped with specific filters to capture the various spectral bands, as shown in
Figure 2. The dataset is divided into training, validation, and test subsets, as summarized in
Table 1. Each spectral band was equally distributed across the subsets to maintain consistency in training and evaluation.
Figure 6 illustrates the distribution of the dataset across the training, validation, and test subsets for each spectral band. Additionally, the pixel-wise class distribution in the ground truth masks is presented in
Figure 7, highlighting the proportions of each class: Background (4.08%), Non_veg (30.82%), Dried_veg (41.71%), and Water_hya (23.39%).
Despite the detailed distribution of the dataset shown in
Table 1 and
Figure 7, we acknowledge that the relatively small size of the dataset, particularly the 173 training images, poses a limitation to the generalizability of the model. This constraint was primarily due to the limited time available for aerial data collection during the sampling period at Lake Ibanai, Japan.
To address this limitation, we plan to expand the dataset in future work. Specifically, we aim to collect additional samples of water hyacinth from Japan and conduct a field trip to Ethiopia to capture images in different seasons and environments. This expansion will increase the diversity and robustness of the dataset, improving the model’s ability to generalize to broader contexts.
3.3. Model Architecture
Several models have been proposed for semantic segmentation, where regions of interest are predicted based on the features learned by the models. One such model is the U-Net model. The U-Net architecture is a semantic segmentation-based model due to its symmetric encoder–decoder design. The encoder progressively reduces the spatial dimensions of the input while capturing semantic features, whereas the decoder reconstructs the segmentation map by up-sampling these features. Skip connections between the encoder and decoder ensure the preservation of spatial details by directly passing low-level features from the contracting path to the expansive path. This design allows U-Net to effectively combine high-level contextual information with fine-grained details, producing accurate segmentation masks.
Compared to other semantic segmentation models such as Fully Convolutional Networks (FCNs) and SegNet, U-Net offers distinct advantages. FCNs, while foundational in segmentation tasks, lack skip connections, which are critical for retaining fine-grained spatial details necessary for precise vegetation mapping. SegNet, although efficient in memory usage, relies on less detailed decoder pathways, which can reduce segmentation accuracy at the pixel level. U-Net’s encoder–decoder structure, coupled with skip connections, provides an optimal balance between contextual understanding and spatial precision, making it particularly well-suited for the segmentation of water hyacinth regions in multispectral images [
60].
The selection of the U-Net model is based on its encoder–decoder structure and its reported performance. The encoder–decoder structure enables detailed observation of the feature extraction process for region prediction. It also allows for pixel-level analysis, making it possible to assess how surface vegetation reflectance for the water hyacinth plant contributes to the performance of the semantic segmentation process.
Table 2 presents the architectural configuration of the U-Net model used in this research: the four encoding layers, namely Encoder 1, Encoder 2, Encoder 3, and Encoder 4, down-sample the input using a
kernel and filters of sizes 64, 128, 256, and 512. A bottleneck layer with filter size 1024 is used to force the model to learn a compression of the input as to pass only useful information to the up-sampling process with the decoders. Similarly, four decoding layers, namely Decoder 4, Decoder 3, Decoder 2, and Decoder 1, are used to up-sample the compressed data to obtain the predicted mask for the semantic segmentation task. The encoding layers are composed of two convolutional layers, namely an activation layer and a maxpooling layer. The decoding layer also consists of convolutional layers and activation layers. An illustration of the various layers and their connection based on the skip connections is presented in
Figure 8.
In this study, we designed three different versions of the U-Net architecture to accommodate the , and input channels corresponding to various components of the multispectral images, RGB images, and infrared images. These architectures were tailored to effectively leverage the spectral characteristics and improve the semantics segmentation task for water hyacinth detection.
3.4. Training Configuration
The training pipeline employs a supervised learning approach, optimizing the network parameters using the AdamW optimizer with an initial learning rate of . This optimizer was chosen for its ability to handle sparse gradients and adapt to different data distributions effectively. Gradient clipping with a maximum norm of 1.0 was applied to prevent exploding gradients during training, ensuring stability. The training was conducted over 20 epochs, with a batch size of 4, balancing computational resource constraints and gradient stability. The loss function alternated between cross-entropy loss and dice loss depending on the target data distribution. Dice loss was particularly beneficial for addressing class imbalance, a common issue in semantic segmentation tasks involving environmental datasets.
This approach aimed to simulate diverse environmental conditions and improve model robustness. Validation was performed after each epoch using a separate dataset, and metrics such as intersection over union (IoU), pixel accuracy, precision, recall, and F1-score were computed to evaluate performance. The adjustment of hyperparameters was performed iteratively during the training process. For instance, the learning rate was fine-tuned based on the performance metrics observed during validation. Early stopping criteria were also considered to avoid overfitting and ensure generalization.
The model was trained using optimizers like Stochastic Gradient Descent (SGD) or Adam, with learning rate schedules or warm-up strategies to ensure efficient convergence. Validation was conducted on a separate dataset at regular intervals to monitor the model’s performance, enabling fine-tuning of hyperparameters and ensuring robustness and generalization to unseen data.
5. Results
In this section, we present the results of the feature analysis and experimentation with U-Net models, along with proposed improvements for the segmentation of water hyacinth using multiprecision imagery.
Table 3 provides a comparative analysis of predicted masks generated by U-Net models across different data types: RGB, IR at 720 nm and 850 nm, NoIR, combined IR, and multispectral data. We assess how each data type influences segmentation performance by visually comparing predicted masks to ground truth annotations.
The ground truth and predicted masks use a color-coded system to represent different classes, with the following meanings:
(0, 0, 0): Class 0, representing the background (black).
(255, 0, 0): Class 1, representing non-vegetation regions (red).
(0, 255, 0): Class 2, representing dried vegetation (green).
(0, 0, 255): Class 3, representing water hyacinth (blue).
From this comparison, it is evident that all U-Net models demonstrate reasonable capability in delineating regions of interest. However, qualitative inspection reveals notable misclassification of pixels, impacting the accuracy of region delineation and class segmentation.
Quantitative performance metrics, including mean intersection over union (mIoU), accuracy, precision, recall, and F1-score, are summarized in
Table 4. The RGB and multispectral images yield comparable results, with multispectral data slightly outperforming RGB across most metrics. Individual IR bands yield moderate performance, with accuracy values ranging from 55% to 75% and regional segmentation between 60% and 80%. Precision, recall, and F1-score are notably lower across IR bands, indicating challenges in achieving fine-grained segmentation.
A key observation is the diminished performance associated with higher wavelength IR data, particularly at 850 nm, which exhibits the lowest mIoU and accuracy. This suggests that higher IR wavelengths may not significantly enhance segmentation accuracy for water hyacinth, potentially due to spectral overlap or reduced contrast at these wavelengths.
6. Ablation Studies
To further analyze the contribution of spectral features to water hyacinth segmentation, we conducted ablation studies using the U-Net model. These experiments aimed to isolate and evaluate the impact of different spectral channels (RGB, IR at various wavelengths, and combined multispectral data) on model performance.
Given that the primary objective of this study is vegetation cover detection through multispectral imaging, it is crucial to understand the nature of the features extracted at different wavelengths. The goal is to assess which spectral bands provide the most significant information for accurate segmentation and to identify the activations triggered within the model layers by different spectral inputs. These features arise due to variations in reflectance across different spectral bands.
The results of these ablation studies are discussed in detail in
Section 5, where we present comparative visualizations and quantitative performance metrics for different spectral configurations. This analysis provides a deeper understanding of the contribution of each spectral band and informs potential improvements to the segmentation pipeline.
We conducted pixel-wise matching between the NDVI-derived masks (
Figure 9) and the predicted segmentation masks (
Figure 10) corresponding to the image presented in
Figure 11. To quantitatively assess segmentation performance, the regions occupied by water hyacinth were identified and counted. This process allowed for a precise evaluation of how accurately the regions were segmented by the model. The results, including the count of segmented regions and pixel-level accuracy, are summarized in
Table 5.
We also analyzed the U-Net size and parameters for each of the spectral bands considered in this research. We observed the sizes, parameters, and flops using each of the U-Net models to determine the efficiency of the model with the various spectral bands and how they correlate to the performance of the models. We presented the results in
Table 6. The results indicate that the number of spectral bands within the multispectral bands do not significantly affect the number of change parameters or the flops of the U-Net model. Adding more spectral bands to the multispectral images may not significantly change the cost of the computations. The U-Net models for RGB, IR 720 nm, IR 850 nm, NoIR, and Combined IR have the same computational requirements, while the six-channel multispectral model has a negligibly higher difference of 0.001M flops.
7. Discussion
The use of the NDVI provides vegetation-specific spectral information that enhances the performance of standard segmentation metrics, leading to improved detection of water hyacinth. By leveraging NDVI-derived masks as a reference, we conducted a comparative analysis using intersection over union (IoU) to evaluate the precision of the predicted segmentation masks. The results, detailed in
Table 5 (
Section 6), demonstrate the degree of alignment between the NDVI-derived masks and the model predictions. A higher IoU indicates superior segmentation performance and a closer match to the actual water hyacinth regions.
The pixel-wise accuracy for RGB, IR 720 nm, IR 850 nm, combined IR, and multispectral images is summarized in
Table 5. Notably, the multispectral images achieved a 3% higher accuracy compared to RGB, 22% higher than IR 720 nm, 43% higher than IR 850 nm, and 25% higher than combined IR data. Among the tested wavelengths, the IR 850 nm channel exhibited the lowest accuracy. This result suggests that higher wavelength infrared data may be less effective for water hyacinth segmentation in UAV-based remote sensing applications. One potential explanation is that, at higher altitudes, longer wavelengths may not capture sufficient spectral variability, limiting the model’s ability to differentiate vegetation from the background. Another factor could be the interference caused by fluorescence at certain wavelengths, which reduces the quality of the spectral information acquired at altitude.
Additionally, we observed discrepancies between the predicted masks and the ground truth data. Certain vegetation details captured by the NDVI and multispectral imaging were absent from the manually annotated ground truth dataset. This indicates that the current ground truth may lack sufficient granularity, potentially limiting model performance. A promising avenue for improvement involves integrating NDVI-derived masks directly into the ground truth to enhance the training and evaluation pipelines. This approach could lead to more accurate segmentation by providing the model with richer, vegetation-specific annotations. Another factor contributing to the diminished performance of the 850 nm band may be mechanical. During data collection, the 850 nm camera experienced an impact, possibly misaligning the focus and causing slightly blurred images. While not the primary reason, this likely contributed to the reduced accuracy. Additionally, the poor performance of the 850 nm band can also be attributed to spectral overlap between vegetation and water at higher infrared wavelengths, reducing contrast and differentiation capability. The region covered by water hyacinths in this study has minimal interference from other plants, with surrounding areas consisting mostly of sparse vegetation or weeds, which our approach managed effectively. However, this represents a limitation, as the dataset does not capture ecosystems with higher vegetation diversity. For future studies, we aim to collect data in more diverse regions outside Japan, where water hyacinths are less controlled by climate and regulations. Misclassifications, especially at boundaries between dried vegetation and water hyacinths, can stem from spectral similarities. Sparse background vegetation may also interfere with segmentation.
Future efforts will focus on several enhancements to address the current limitations. These include advanced data augmentation techniques, such as spectral transformations and synthetic data generation, to improve dataset diversity and account for more complex environmental conditions. Additionally, we plan to incorporate more robust cameras without infrared filters into the system, which, although more expensive, are expected to enhance image quality and segmentation performance. Finally, we propose integrating NDVI data as a primary source of ground truth in the model training process. By aligning the segmentation pipeline more closely with vegetation-specific spectral data, we aim to achieve further improvements in accuracy and robustness, particularly in complex environments where traditional RGB segmentation may fall short.
8. Conclusions
This study presents an effective approach for accurately detecting regions of water hyacinth using a custom-built multispectral camera and the U-Net architecture. By leveraging U-Net’s unique encoder–decoder structure with skip connections, we were able to effectively combine high-level contextual information with fine-grained spatial details, resulting in highly accurate segmentation masks.
The experimental results demonstrate several key advantages of the U-Net architecture in the context of water hyacinth detection:
Pixel-Level Accuracy: U-Net’s ability to perform pixel-wise segmentation enabled precise identification of water hyacinth regions, achieving a mean intersection over union (mIoU) of 97% for multispectral images.
Adaptability to Multispectral Data: The architecture was tailored to handle multispectral images with varying channel configurations (e.g., RGB, NIR, and RGB+NIR), showcasing its flexibility in incorporating different spectral bands to improve segmentation performance.
Performance Gains over Traditional Methods: Compared to RGB-only approaches, the integration of spectral information through U-Net yielded a 3% improvement in accuracy for multispectral configurations. This highlights the network’s ability to exploit spectral features for vegetation-specific segmentation tasks.
Robustness Across Spectral Bands: U-Net demonstrated consistent performance across different wavelengths, particularly excelling in lower infrared bands, which enhanced its capability to differentiate water hyacinth from other vegetation.
Additionally, this study emphasizes the importance of leveraging NDVI-derived masks as reference data to validate the segmentation results. By combining U-Net’s capabilities with vegetation-specific spectral information, we addressed limitations inherent in manually annotated datasets, paving the way for more accurate and reliable segmentation pipelines. Future work can build upon these findings by integrating NDVI-derived masks directly into the training pipeline to enrich ground truth data, further improving segmentation accuracy and robustness in complex environments. This research underscores the potential of U-Net as a powerful tool for semantic segmentation in remote sensing applications, particularly in addressing ecological challenges such as the management of invasive species like water hyacinth.
Our findings indicate that IR channels at lower wavelengths outperform those at higher wavelengths, with the multispectral configuration yielding the highest accuracy. This underscores the importance of selecting the appropriate spectral bands for vegetation segmentation tasks. The integration of multispectral imaging, combined with the strategic use of specific IR channels, demonstrated a measurable improvement in segmentation accuracy and precision over traditional RGB approaches.
Moreover, the analysis reveals that incorporating NDVI-derived masks as part of the training and evaluation pipeline holds potential for further enhancing model performance. By leveraging vegetation-specific spectral information, the accuracy of segmentation models can be significantly improved, addressing limitations observed in manually annotated ground truth datasets. The comparison summarized in
Table 7 highlights the advancements achieved in this study. Traditional machine learning methods such as Random Forest (RF), Support Vector Machine (SVM), and Decision Tree (e.g., CART) have been widely applied to water hyacinth detection using datasets like Landsat 8 and Sentinel 2. However, these methods rely on handcrafted features and achieve variable accuracy depending on the dataset and environmental conditions. In contrast, the proposed approach leverages a custom-built multispectral camera and the U-Net architecture, achieving superior pixel-wise segmentation performance with a mean intersection over union (mIoU) of 97%.
The multispectral camera, specifically designed for this study, enables the capture of high-resolution images across six distinct channels, including RGB, NIR 720 nm, and NIR 850 nm. This allows for the integration of both spectral and spatial information, which is critical for detecting water hyacinth in complex environmental scenarios. When combined with the U-Net model’s ability to retain fine-grained spatial details through its skip connections, the proposed method demonstrates significant improvements in segmentation accuracy compared to traditional methods. This approach not only validates the effectiveness of the U-Net model but also showcases the innovation brought by the customized multispectral imaging system in addressing ecological challenges such as invasive species management.
For future research, we propose the integration of NDVI data into the label-generation process to refine the creation of ground truth masks. This approach can serve to enrich the training datasets and improve the effectiveness of deep learning models for semantic segmentation tasks focused on vegetation detection, with particular emphasis on invasive species such as water hyacinth.