OptimalNN: A Neural Network Architecture to Monitor Chemical Contamination in Cancer Alley

: The detrimental impact of toxic chemicals, gas, and oil spills in aquatic environments poses a severe threat to plants, animals, and human life. Regions such as Cancer Alley exemplify the profound consequences of inadequately controlled chemical spills, significantly affecting the local community. Given the far-reaching effects of these spills, it has become imperative to devise an efficient method for early monitoring, estimation, and cleanup, utilizing affordable and effective techniques. In this research, we explore the application of U-shaped neural Network (UNET) and U-shaped neural network transformer (UNETR) neural network models designed for the image segmentation of chemical and oil spills. Our models undergo training using the Commonwealth Scientific and Industrial Research Organization (CSIRO) dataset and the Oil Spill Detection dataset, employing a specialized filtering technique to enhance detection accuracy. We achieved training accuracies of 95.35% and 91% by applying UNET on the Oil Spill and the CSIRO datasets after 50 epochs of training, respectively. We also achieved a training accuracy of 75% by applying UNETR to the Oil Spill dataset. Additionally, we integrated mixed precision to expedite the model training process, thus maximizing data throughput. To further accelerate our implementation, we propose the utilization of the Field Programmable Gate Array (FPGA) architecture. The results obtained from our study demonstrate improvements in inference latency on FPGA.


Introduction
Cancer Alley spans 85 miles in southeastern Louisiana, stretching from New Orleans to Baton Rouge along the Mississippi River, with a population of approximately 45,000 [1].This region hosts around 150 plastic plants, chemical facilities, and oil refineries, and this number continues to grow despite the evident environmental impact.The air in Cancer Alley is characterized by toxic emissions and ranks among the most polluted in the United States [1].Approximately 50 toxic chemicals, including benzene, formaldehyde, ethylene oxide, and chloroprene, contribute to air pollution, with chloroprene being particularly concerning.The pollution in Cancer Alley has severe consequences for residents, many of whom eventually require nebulizers for survival.The recent coronavirus pandemic further exacerbated their plight because of their compromised health.Despite efforts by the Environmental Protection Agency (EPA) to regulate industries in the area and enhance the living standards of its residents, individuals in this region still face a 95% higher risk of cancer from air pollution compared with the rest of America [2,3].The potential for catastrophic damage to land, marine, and coastal ecosystems underscores the importance of early detection and cleaning of oil and chemical spills in Cancer Alley to minimize environmental harm [4,5].
Chemical spill incidents exhibit a distinctive appearance in satellite images generated by Synthetic Aperture Radar (SAR) technology.This distinctiveness arises from the short gravity waves they induce, altering radar backscatter intensity and creating unique dark formations in SAR images [6][7][8].Exploiting this characteristic enables the segmentation of the resulting SAR images and facilitates the training of a neural network model on the acquired data.The field of segmentation, as outlined in the literature [9], encompasses various types, including foreground segmentation, panoptic segmentation, semantic segmentation, and instance segmentation.Real-time segmentation primarily aims to predict masks over objects within an image frame with low latency.
Image segmentation is the process of dividing an image into distinct segments, thus enhancing the ease of analysis and comprehension [9].This technique finds applications in critical fields such as healthcare, transportation, and pattern recognition.Various image segmentation algorithms fall into categories such as basic threshold-based, graph-based, morphological-based, edge-based, clustering-based, Bayesian-based, and neural networkbased segmentation.Each of these algorithms comes with its own set of advantages and disadvantages, tailored to specific applications.Numerous studies have explored the topic, including the recent publication by A. Kirillov et al. [9], which introduces the "segment anything" framework by Meta AI.Their study implements a prompt-based segmentation tool trained on the most extensive segmentation dataset to date, utilizing 256 GPUs.The SA-1B dataset, created using Meta's custom data engine, comprises 1 billion masks and 11 million images collected from various countries and continents worldwide.Models trained on the SA-1B dataset demonstrate a capacity to generalize across a wide range of data.However, the model, a transformer model, has notable drawbacks, requiring a substantial amount of energy for training.Additionally, the study reveals that training accuracy improves with larger datasets, indicating a need for more data to train more precise models [10].In another study, Olaf et al. [11] developed the U-NET, a specialized neural network for image segmentation tasks.The training strategy in their study relies on data augmentation during training, enabling effective training on a minimal number of images compared with the previous study [10].The U-NET achieves high accuracy of approximately 92 percent and 77.56 percent when trained on the PhC-U373 and DIC-HeLa datasets, respectively, for image segmentation tasks.Subsequently, Oktay et al. introduced the Attention U-Net [12], a more energy-efficient adaptation of the U-NET.This model efficiently learns to focus on target structures of varying shapes and sizes within the dataset, maintaining prediction accuracy without significant energy costs.In addition to these segmentation-focused studies, others, documented in references [13], have developed neural network models specifically for monitoring oil spills, while [14,15] delve into the development of neural network models for healthcare-related applications.
In this research, we employed neural network models based on the U-NET architecture for image segmentation, specifically targeting chemical and oil spills.Our model was trained using the CSIRO dataset and the Oil Spill Detection dataset.Furthermore, we introduced mixed precision to streamline the model training process, optimizing data throughput on both the CPU and GPU platforms.As an additional acceleration strategy, we advocate for the adoption of FPGA architectures, leveraging frameworks like the FINN Xilinx framework [16,17] and HLS4ML [18] to synthesize bitstreams for machine learning models quickly.The structure of this manuscript unfolds as follows: Section 2 provides an expansive description of the segmentation approach employed in this study.Section 3 delves into the intricacies of neural networks, while Section 4 elucidates the concept of mixed precision.Section 5 elucidates the methodology applied in training the neural network models under investigation.Section 6 unveils the preliminary simulation results conducted on both CPU (central processing unit) and GPU (graphics processing unit).Following that, Section 7 outlines our FPGA optimizer architectures and presents the corresponding simulation results on the FPGA platform.Section 8 expounds on the results and challenges of this study.Finally, Section 9 serves as the conclusion of this study.

Segmentation of Chemical Spills
Segmentation is a computer vision task that involves the classification of pixels within an image into classes.The image segmentation task involves extensive pixel-based processing, emphasizing the necessity of a thorough understanding of the data before selecting an appropriate model.Preceding the model training phase, we conducted a detailed examination of the images to enhance our comprehension of the datasets used in this study.Figure 1 illustrates the color distribution of a randomly selected open-source sample of an RGB image and a sample from the Oil Spill dataset across both the RGB and HSV color spaces.The diagram showcases the color distribution of our data sample compared to a typical RGB image.As shown in Figure 1, color spaces offer insight into the dispersion or concentration of content across the color channels of our images.Leveraging this understanding, we can determine the most suitable technique for segmenting different components of the image.In Figure 1, the distribution of samples from the Oil Spill dataset follows a linear pattern, differing from that of the golden fish.However, it displays varying color intensities, offering guidance on the optimal approach for crafting a segmentation model to capture the various segments of the images.

Segmentation of Chemical Spills
Segmentation is a computer vision task that involves the classification of pixels within an image into classes.The image segmentation task involves extensive pixel-based processing, emphasizing the necessity of a thorough understanding of the data before selecting an appropriate model.Preceding the model training phase, we conducted a detailed examination of the images to enhance our comprehension of the datasets used in this study.Figure 1 illustrates the color distribution of a randomly selected open-source sample of an RGB image and a sample from the Oil Spill dataset across both the RGB and HSV color spaces.The diagram showcases the color distribution of our data sample compared to a typical RGB image.As shown in Figure 1, color spaces offer insight into the dispersion or concentration of content across the color channels of our images.Leveraging this understanding, we can determine the most suitable technique for segmenting different components of the image.In Figure 1, the distribution of samples from the Oil Spill dataset follows a linear pattern, differing from that of the golden fish.However, it displays varying color intensities, offering guidance on the optimal approach for crafting a segmentation model to capture the various segments of the images.Semantic segmentation and instance segmentation stand out as the two predominant forms of segmentation used today.In instance segmentation [9], individual objects are identified and segmented within an image, with each instance assigned a unique label or color.Semantic segmentation, on the other hand, categorizes each pixel in an image into one of several predefined classes, where objects belonging to the same class share the same label or color.This contrasts with instance segmentation, which treats each object instance as a separate entity within the image.For this study, semantic segmentation is employed.

Neural Network
In this research, we employed neural networks based on the U-NET architecture for the segmentation task.The U-NET architecture, as described in previous works [11,12], utilizes techniques such as data augmentation, convolution, pooling, upscaling, and downscaling to achieve its distinctive U-shaped network structure.Because of its ability to attain high training accuracy in a shorter time, the U-NET is well-suited for large-scale Semantic segmentation and instance segmentation stand out as the two predominant forms of segmentation used today.In instance segmentation [9], individual objects are identified and segmented within an image, with each instance assigned a unique label or color.Semantic segmentation, on the other hand, categorizes each pixel in an image into one of several predefined classes, where objects belonging to the same class share the same label or color.This contrasts with instance segmentation, which treats each object instance as a separate entity within the image.For this study, semantic segmentation is employed.

Neural Network
In this research, we employed neural networks based on the U-NET architecture for the segmentation task.The U-NET architecture, as described in previous works [11,12], utilizes techniques such as data augmentation, convolution, pooling, upscaling, and downscaling to achieve its distinctive U-shaped network structure.Because of its ability to attain high training accuracy in a shorter time, the U-NET is well-suited for large-scale oil and chemical spill detection, offering a more power-efficient training process compared with transformer models.
Presently, various neural network models are utilized in segmentation tasks, including the robust SegmentAnything transformer model by Meta AI, the Vanilla U-NET model, the Attention U-NET model, and others.However, transformer models are less power-efficient for this specific task, requiring extensive training on large datasets to achieve comparable accuracy to the U-NET, which can achieve satisfactory results with minimal training.The U-NET implementation in this study is tailored to accommodate the distinct datasets, adapting to variations in image sizes across the datasets.

Architecture
The U-NET architecture adopted in this study closely resembles the configuration described in the previous work [11].However, unlike the previous work [11], our study employs oil spill datasets for training the model.Illustrated in Figure 2, the U-NET consists of both a contracting and expansive path.The contracting path iteratively employs convolution, followed by rectified linear unit (ReLU) and max-pooling operations.In our architecture, the convolution layers' feature channels extract features in the form of feature maps, which are subsequently propagated down the network.
oil and chemical spill detection, offering a more power-efficient training process compared with transformer models.
Presently, various neural network models are utilized in segmentation tasks, including the robust SegmentAnything transformer model by Meta AI, the Vanilla U-NET model, the Attention U-NET model, and others.However, transformer models are less power-efficient for this specific task, requiring extensive training on large datasets to achieve comparable accuracy to the U-NET, which can achieve satisfactory results with minimal training.The U-NET implementation in this study is tailored to accommodate the distinct datasets, adapting to variations in image sizes across the datasets.

Architecture
The U-NET architecture adopted in this study closely resembles the configuration described in the previous work [11].However, unlike the previous work [11], our study employs oil spill datasets for training the model.Illustrated in Figure 2, the U-NET consists of both a contracting and expansive path.The contracting path iteratively employs convolution, followed by rectified linear unit (ReLU) and max-pooling operations.In our architecture, the convolution layers' feature channels extract features in the form of feature maps, which are subsequently propagated down the network.The transformer variation of UNET is named UNETR [19].In UNETR, the downscaling or encoding portion of the network is replaced with a transformer encoder, and the upscaling or decoding portion of the network maintains the U-shape, as shown in Figure 3.The UNETR transformer encoder is directly connected to the decoder via skip connections, instead of an attention layer, at different resolutions to compute the final three-dimensional (3D) semantic segmentation output.Skip connections, just like in the UNET model, help the network preserve information about features from the original input at each convolution level.Unlike the local modeling capacities of convolutional neural networks (CNNs) transformers encode images as a sequence of 1D patch embeddings and utilize self-attention modules to learn the weighted sum of values that are calculated in the hidden layers.
The encoder shown in Figure 3 below and in Figure 1 [19] comprises a positional encoding layer, a stack of encoding layers that make up the encoder, and the causal The transformer variation of UNET is named UNETR [19].In UNETR, the downscaling or encoding portion of the network is replaced with a transformer encoder, and the upscaling or decoding portion of the network maintains the U-shape, as shown in Figure 3.The UNETR transformer encoder is directly connected to the decoder via skip connections, instead of an attention layer, at different resolutions to compute the final three-dimensional (3D) semantic segmentation output.Skip connections, just like in the UNET model, help the network preserve information about features from the original input at each convolution level.Unlike the local modeling capacities of convolutional neural networks (CNNs) transformers encode images as a sequence of 1D patch embeddings and utilize self-attention modules to learn the weighted sum of values that are calculated in the hidden layers.Transformers used for image recognition tasks are commonly called vision transformers.Therefore, our UNETR architecture in Figure 3 can be more precisely referred to as a vision UNETR model.Figure 4 shows how images are encoded by vision transformers [20] for image-related classification tasks.The input image is split into patches that constitute a linear sequence of tokens similar to words in the case of the Bidirectional Encoder Representations from Transformer (BERT) model [21].The Multiheaded Self Attention (MSA) block involves computing self-attention on each head and finally concatenating the results as shown in (1) to (13).The computation on each head can be parallelized.By observing the data structure of the transformer, we came up with our design of a hardware accelerator.The encoder shown in Figure 3 below and in Figure 1 [19] comprises a positional encoding layer, a stack of encoding layers that make up the encoder, and the causal attention and feed-forward layers.The encoder reads input signals and generates representations of the input data after it has learned the sequence representations of the input volume and effectively captured the global multi-scale information.The decoder, on the other hand, generates output word by word, based on the output signal representation, in the form of tokenized patches generated by the encoder.The vanilla decoder model comprises a stack of decoder layers and a positional encoder.The decoder layers contain global self-attention/cross-attention and feed-forward layers.The encoder and decoder are used to build the transformer model.In this work, we replace the decoder section with a U-NET decoder.
Transformers used for image recognition tasks are commonly called vision transformers.Therefore, our UNETR architecture in Figure 3 can be more precisely referred to as a vision UNETR model.Figure 4 shows how images are encoded by vision transformers [20] for image-related classification tasks.The input image is split into patches that constitute a linear sequence of tokens similar to words in the case of the Bidirectional Encoder Representations from Transformer (BERT) model [21].The Multiheaded Self Attention (MSA) block involves computing self-attention on each head and finally concatenating the results as shown in (1) to (13).The computation on each head can be parallelized.By observing the data structure of the transformer, we came up with our design of a hardware accelerator.
(1)   In the equations above, X in (1) represents the input, R denotes real numbers, H, W and D denote the height, weight, and depth of our image frames, while C represents the number of input image channels.Xv in (2) represents the flattened uniform nonoverlapping patches version of X. P in (2) denotes the resolution of each patch, and N is the length of the sequence, estimated using (3).Epos in (4) denotes learnable positional embedding, while E in (5) represents projected patch embedding.K in ( 4) and ( 5) denotes the size/dimension of the embedding space.Z in ( 6)-( 8) represents the output sequence of the query (q) and the corresponding key (k) and value (v) pairs.A in (9) represents the attention weights/scores, and k represents the key.Kh in (10) is the scaling factor used to maintain the number of parameters to a constant value with different key values.In (11), v denotes the values of the input sequence and is used to calculate SA in sequence z.MSA represents multiheaded self-attention and is represented by (12).Wmsa in (13) is the multiheaded trainable parameter weights.

Mixed Precision Architecture
The mixed precision architecture is an optimization technique harnessing the computational power of GPU cores, resulting in 2 to 4 times faster computation and a 50% reduction in memory usage.This approach creates a potent compute engine without necessitating alterations to the hardware architecture [22].Specifically, Volta cores in NVIDIA GPUs, with a data throughput of 123 teraflops, experience significant benefits from this architecture [22].By employing 16-bit precision instead of 32-bit precision, computing throughput in Volta cores can be enhanced by a factor of 8, memory throughput can be doubled, and the data unit input size can be halved [22].
In our implementation, we opted for mixed precision over a constant 16-bit precision to address potential imprecision in weight updates associated with FP16.This precision choice is crucial, as cumulative errors could significantly impact the final predictions.Mixed precision allows us to achieve nearly the same training and prediction accuracy as FP32 without altering hyperparameters.NVIDIA libraries, optimized for tensor cores, derive significant advantages from this architecture [22].
Major machine learning frameworks like PyTorch and TensorFlow have seamlessly integrated the mixed precision feature into their frameworks, facilitating the implementation of automatic mixed precision with just a few lines of code, as illustrated in Figures 5 and 6.For further customization, the mixed precision method can be manually added to different sections or lines of code.
puting throughput in Volta cores can be enhanced by a factor of 8, memory throughpu can be doubled, and the data unit input size can be halved [22].
In our implementation, we opted for mixed precision over a constant 16-bit precisio to address potential imprecision in weight updates associated with FP16.This precision choice is crucial, as cumulative errors could significantly impact the final predictions Mixed precision allows us to achieve nearly the same training and prediction accuracy a FP32 without altering hyperparameters.NVIDIA libraries, optimized for tensor cores, de rive significant advantages from this architecture [22].
Major machine learning frameworks like PyTorch and TensorFlow have seamlessly integrated the mixed precision feature into their frameworks, facilitating the implementa tion of automatic mixed precision with just a few lines of code, as illustrated in Figure and Figure 6.For further customization, the mixed precision method can be manually added to different sections or lines of code.
In our framework, we utilized the APEX AMP (automatic mixed precision) PyTorch extension to implement mixed precision seamlessly with minimal code in Nvidia A10 GPU. Figure 6 illustrates the concept of mixed precision, where FP16 and FP32 values ar cast to preserve accuracy.A scale factor of 128, commonly used for loss scaling, is em ployed to maintain values and accuracy, serving as a constant in our study.

Training the Model
This phase of this project is the most resource-intensive.We conceptualized the U NET model, conducted a profiling analysis on our model utilizing the PyTorch profilin library, implemented mixed precision, and scrutinized the resource utilization on bot CPU and GPU.

Datasets
In this study, we utilized the CSIRO Sentinel dataset and the Oil Spill Detection da taset.The Oil Spill dataset [23], renowned for its non-commercial nature, has been widel In our framework, we utilized the APEX AMP (automatic mixed precision) PyTorch extension to implement mixed precision seamlessly with minimal code in Nvidia A100 GPU. Figure 6 illustrates the concept of mixed precision, where FP16 and FP32 values are cast to preserve accuracy.A scale factor of 128, commonly used for loss scaling, is employed to maintain values and accuracy, serving as a constant in our study.

Training the Model
This phase of this project is the most resource-intensive.We conceptualized the U-NET model, conducted a profiling analysis on our model utilizing the PyTorch profiling library, implemented mixed precision, and scrutinized the resource utilization on both CPU and GPU.

Datasets
In this study, we utilized the CSIRO Sentinel dataset and the Oil Spill Detection dataset.The Oil Spill dataset [23], renowned for its non-commercial nature, has been widely adopted in numerous studies owing to its well-organized structure and ease of use in model training [24,25].Conversely, the CSIRO Sentinel dataset [26] is expansive and open source but lacks pre-segmented ground truth labels, adding a layer of complexity to our task.The CSIRO dataset comprises 5630 binary images categorized into two classes, denoted as "0" and "1," where "0" represents images without any oil features (resembling clean seas), while "1" includes images featuring oil.The images in the CSIRO dataset have dimensions of 400 × 400 pixels, distinguishing them from those in the Oil Spill dataset.
The CSIRO dataset is generated via synthetic aperture radar (SAR) sensors [27].These sensors are active microwave satellite instruments that operate day and night in any weather conditions, with wide swaths (>100 km) that can cover large areas of the ocean.They transmit repeated, regular short pulses of radio waves at a rate of about 100 microseconds and record the strength, phase, and travel time of the returning signal.Oil spill signatures in the generated image typically appear as dark patches because of the decreased radar backscatter compared with the much brighter surrounding seawater.These images can be used to assess the frequency and spatial distribution of oil spills.The information provided by SAR imagery used to create these data was found to be far superior to that from optical and thermal satellite imagery [27].

Preprocessing
Before initiating the model training, we generated ground truth labels for the CSIRO Sentinel dataset, marking a notable achievement due to the dataset's substantial size.As the dataset comprised unlabeled images, we employed LabKit software [28] to label the images manually.LabKit, an open-source tool, facilitates semi-manual image segmentation through selected samples and a random forest algorithm.The segmentation output from LabKit is in TIFF format, which prompted us to extract cropped images (mask) of the segmented samples.Subsequently, we resized these cropped images in paint software to obtain the final output, as illustrated in Figure 7.
J. Low Power Electron.Appl.2024, 14, x FOR PEER REVIEW 9 of 18 images manually.LabKit, an open-source tool, facilitates semi-manual image segmentation through selected samples and a random forest algorithm.The segmentation output from LabKit is in TIFF format, which prompted us to extract cropped images (mask) of the segmented samples.Subsequently, we resized these cropped images in paint software to obtain the final output, as illustrated in Figure 7.While exploring alternative tools, we experimented with LabelStudio [29] and QuPath [30].LabelStudio proved less suitable for this task, generating multiple images for various segments and making the output challenging to utilize.

Feature Extraction
The U-NET neural network leverages convolution and augmentation to extract features.To enhance its performance, we incorporate a Gaussian filter to reduce background noise effectively.This filtering process aids in better distinguishing between background pixels and surrounding pixels.While exploring alternative tools, we experimented with LabelStudio [29] and QuPath [30].LabelStudio proved less suitable for this task, generating multiple images for various segments and making the output challenging to utilize.

Feature Extraction
The U-NET neural network leverages convolution and augmentation to extract features.To enhance its performance, we incorporate a Gaussian filter to reduce background noise effectively.This filtering process aids in better distinguishing between background pixels and surrounding pixels.

Classification
Given the similarity between the Oil Spill dataset and the CSIRO dataset, we adhered to the color labeling standards employed in various studies for SAR images [6,24,31], as depicted in Figure 8.Although there are emerging datasets [32,33], one of the objectives of this research is to establish a standardized color representation.This standardization aims to provide a consistent framework across diverse datasets, enabling researchers in the field to access a more extensive and uniformly labeled dataset for training purposes.While exploring alternative tools, we experimented with LabelStudio [29] and QuPath [30].LabelStudio proved less suitable for this task, generating multiple images for various segments and making the output challenging to utilize.

Feature Extraction
The U-NET neural network leverages convolution and augmentation to extract features.To enhance its performance, we incorporate a Gaussian filter to reduce background noise effectively.This filtering process aids in better distinguishing between background pixels and surrounding pixels.

Classification
Given the similarity between the Oil Spill dataset and the CSIRO dataset, we adhered to the color labeling standards employed in various studies for SAR images [6,24,31], as depicted in Figure 8.Although there are emerging datasets [32,33], one of the objectives of this research is to establish a standardized color representation.This standardization aims to provide a consistent framework across diverse datasets, enabling researchers in the field to access a more extensive and uniformly labeled dataset for training purposes.

Simulation Results
After training our model for a total of 50 epochs using NVIDIA A100 GPU on Google Colab, we obtained 91% and 93% training accuracy for the two datasets, as shown in Table 1 below.

Simulation Results
After training our model for a total of 50 epochs using NVIDIA A100 GPU on Google Colab, we obtained 91% and 93% training accuracy for the two datasets, as shown in Table 1 below.The results in Table 1 indicate a training accuracy of approximately 75% with the UNETR model, which is noticeably lower than the approximate 95% accuracy achieved with the UNET model.Vision transformer models typically demand a greater volume of data samples to match the accuracy levels of deep neural network models such as UNET.Therefore, employing a larger dataset might yield noticeable improvements in the training accuracy of the UNETR model.A limitation of the Oil Spill dataset that could have influenced the training and testing accuracies may lie in the composition of the available samples.The dataset prominently features images containing ocean, oil look-alike, and land samples.The emphasis on these features may have skewed the inference results towards more positive outcomes for these three classes.
In Figures 9 and 10, we show the plot of training and testing accuracies over time for the Oil Spill and the CSIRO datasets over 50 epochs.
enced the training and testing accuracies may lie in the composition of the available samples.The dataset prominently features images containing ocean, oil look-alike, and land samples.The emphasis on these features may skewed the inference results towards more positive outcomes for these three classes.
In Figures 9 and 10, we show the plot of training and testing accuracies over time for the Oil Spill and the CSIRO datasets over 50 epochs.

FPGA Accelerator
Integrating an FPGA accelerator serves the purpose of power optimization, and improved streaming efficiency.The SegmentAnything model, for instance, employs around 250 GPUs during training, resulting in significant power consumption.In this study, we save our model in ONNX format for compatibility with FINN.The FINN framework [16] incorporates the Brevitas library, allowing the generation of FPGA accelerators using pretrained models.ONNX, as an open-source format, is employed to represent machine learning models.The FINN framework takes the ONNX file and generates an FPGA model for each layer of the network, establishing communication between layers through AXI streams.accuracy of the UNETR model.A limitation of the Oil Spill dataset that could have influenced the training and testing accuracies may lie in the composition of the available samples.The dataset prominently features images containing ocean, oil look-alike, and land samples.The emphasis on these features may have skewed the inference results towards more positive outcomes for these three classes.
In Figures 9 and 10, we show the plot of training and testing accuracies over time for the Oil Spill and the CSIRO datasets over 50 epochs.

FPGA Accelerator
Integrating an FPGA accelerator serves the purpose of power optimization, and improved streaming efficiency.The SegmentAnything model, for instance, employs around 250 GPUs during training, resulting in significant power consumption.In this study, we save our model in ONNX format for compatibility with FINN.The FINN framework [16] incorporates the Brevitas library, allowing the generation of FPGA accelerators using pretrained models.ONNX, as an open-source format, is employed to represent machine learning models.The FINN framework takes the ONNX file and generates an FPGA model for each layer of the network, establishing communication between layers through AXI streams.

FPGA Accelerator
Integrating an FPGA accelerator serves the purpose of power optimization, and improved streaming efficiency.The SegmentAnything model, for instance, employs around 250 GPUs during training, resulting in significant power consumption.In this study, we save our model in ONNX format for compatibility with FINN.The FINN framework [16] incorporates the Brevitas library, allowing the generation of FPGA accelerators using pretrained models.ONNX, as an open-source format, is employed to represent machine learning models.The FINN framework takes the ONNX file and generates an FPGA model for each layer of the network, establishing communication between layers through AXI streams.
The Brevitas framework [34], which works with the FINN builder, is used in the development of an FPGA accelerator for our model.The Brevitas framework is a PyTorch library for neural network quantization, with support for both post-training quantization (PTQ) and quantization-aware training (QAT) [34].It offers quantized implementations of the most common PyTorch layers used in deep neural networks (DNNs) under brevitas.nn.This includes QuantConv1d, QuantConv2d, QuantConvTranspose1d, QuantConvTrans-pose2d, QuanMultiheadAttention, QuantRNN, QuantLSTM, and several others.For each of these layers, the quantization of different tensors (input, weight, bias, outputs, and other factors) can be individually tuned according to a wide range of quantization settings [34].Brevitas enables fine-grain quantization-aware training [16].
Another tool used to generate our accelerator is ONNXRuntime [35], which is used for integration with standard ONNX-based toolchains.ONNX is prebuilt in PyTorch as torch.ONNX.It also has the transformers.onnxpackage, which converts transformer models to ONNX-format models.Open Neural Network eXchange (ONNX) is an open standard format for representing machine learning models.This module captures the computation graph from a native PyTorch torch.nn.Module model and converts it into an ONNX graph, which can be exported and consumed by several runtimes that support ONNX.The ONNX standard supports down to 8-bit quantization, but another version named Quantized ONNX (QONNX) supports expressing down to 1-bit quantization for both weights and activations [16].
Finally, the FINN framework [16] is a quantization-aware framework used for the generation of custom FPGA dataflow accelerators or to register transfer language models (RTL model).It is designed to work with the ONNX model.FINN uses ONNX as an intermediate representation for neural networks, as such almost every FINN component uses ONNX and its Python API.FINN supports two specialized variants of ONNX, namely, QONNX and FINN-ONNX.FINN also provides a ModelWrapper class, a thin wrapper around the ONNX model to make it easier to analyze and manipulate ONNX graphs.This wrapper provides many helper functions, while still giving full access to ONNX protobuf representation.
FINN supports three types of mem_mode attributes for the node MatrixVectorActivation [16].This mode controls how the weight values are accessed during the execution phase.The mode setting has a direct influence on the resulting circuit.The three settings for the mem_mode supported in FINN are "const", "decoupled", and "external".Each comes with its own advantages and disadvantages.Figure 11 shows the design flow employed in the design of our accelerator.
tation graph from a native PyTorch torch.nn.Module model and converts it graph, which can be exported and consumed by several runtimes that su The ONNX standard supports down to 8-bit quantization, but another v Quantized ONNX (QONNX) supports expressing down to 1-bit quantiz weights and activations [16].
Finally, the FINN framework [16] is a quantization-aware framewor generation of custom FPGA dataflow accelerators or to register transfer lan (RTL model).It is designed to work with the ONNX model.FINN uses O termediate representation for neural networks, as such almost every FIN uses ONNX and its Python API.FINN supports two specialized varia namely, QONNX and FINN-ONNX.FINN also provides a ModelWrapp wrapper around the ONNX model to make it easier to analyze and man graphs.This wrapper provides many helper functions, while still giving ONNX protobuf representation.
FINN supports three types of mem_mode attributes for the node Ma vation [16].This mode controls how the weight values are accessed during phase.The mode setting has a direct influence on the resulting circuit.The for the mem_mode supported in FINN are "const", "decoupled", and "e comes with its own advantages and disadvantages.Figure 11 shows the de ployed in the design of our accelerator.A significant challenge encountered during our model development was ensuring the proper functioning of the software stack.We attempted to utilize HLS4ML [18] as an alternative to FINN, primarily designed for Keras, but faced similar compatibility issues.Both frameworks exhibited instability during accelerator development.However, they hold promise for significantly enhancing the speed of FPGA bitstream development for machine learning models in the future, given their ongoing development.To overcome the hurdles associated with developing and verifying bitstreams using FINN and HLS4ML, we opted to design accelerators for the UNET and UNETR models from scratch using High-Level Synthesis (HLS).The resulting accelerators are depicted in Figures 12 and 13.
hold promise for significantly enhancing the speed of FPGA bitstream development fo machine learning models in the future, given their ongoing development.To overcom the hurdles associated with developing and verifying bitstreams using FINN and HLS4ML, we opted to design accelerators for the UNET and UNETR models from scratc using High-Level Synthesis (HLS).The resulting accelerators are depicted in Figures 1  and 13.

FPGA Inference Results
We generated an FPGA design for our model via HLS and verified the design on th Pynq Z1 board.Currently, resource usage by our FPGA design suggests low power con sumption by the Pynq Z1 board [36].Table 2 shows the resource usage profile of ou UNET and UNETR models, as well as the inference latency achieved.

FPGA Inference Results
We generated an FPGA design for our model via HLS and verifie Pynq Z1 board.Currently, resource usage by our FPGA design sugge sumption by the Pynq Z1 board [36].Table 2 shows the resource u UNET and UNETR models, as well as the inference latency achieved.

FPGA Inference Results
We generated an FPGA design for our model via HLS and verified the design on the Pynq Z1 board.Currently, resource usage by our FPGA design suggests low power consumption by the Pynq Z1 board [36].Table 2 shows the resource usage profile of our UNET and UNETR models, as well as the inference latency achieved.

Discussion
The semantic segmentation technique utilized in this study assigns a class label to each pixel in the image samples from our dataset.Figure 8 illustrates the names of the various classes present in our dataset, while Figure 14 reveals that the ocean (background) constitutes most of the samples.This distribution suggests that our models are more likely to predict class 0 (ocean) accurately because of its predominance among the samples compared with the other classes.
The semantic segmentation technique utilized in this study assigns a class label to each pixel in the image samples from our dataset.Figure 8 illustrates the names of the various classes present in our dataset, while Figure 14 reveals that the ocean (background) constitutes most of the samples.This distribution suggests that our models are more likely to predict class 0 (ocean) accurately because of its predominance among the samples compared with the other classes.In Figure 15, we evaluate our classification model's performance using a confusion matrix, specifically focusing on the UNET model.The results indicate that all classes perform well except for class 3 (ship).The UNET model struggles to distinguish between the ocean (class 0) and ships (class 3), frequently misclassifying ships as the ocean.This issue can be attributed to the fact that class 3 has the fewest samples (22,981) in the dataset, as shown in Figure 14, which may be insufficient for the model to generalize effectively with only 50 training epochs.Figure 16 illustrates the differences between the test mask and the predicted mask after training the model for 50 epochs.Finally, we performed inference on FPGA and displayed the results in Figure 17.In the real world, the impacts of chemical spills and contamination are not only prevalent in Cancer Alley but also in other lessdeveloped parts of the world.The results of this study will have far-reaching implications and reduce the cost of monitoring contamination and effectively detecting chemical spills.In Figure 15, we evaluate our classification model's performance using a confusion matrix, specifically focusing on the UNET model.The results indicate that all classes perform well except for class 3 (ship).The UNET model struggles to distinguish between the ocean (class 0) and ships (class 3), frequently misclassifying ships as the ocean.This issue can be attributed to the fact that class 3 has the fewest samples (22,981) in the dataset, as shown in Figure 14, which may be insufficient for the model to generalize effectively with only 50 training epochs.Figure 16 illustrates the differences between the test mask and the predicted mask after training the model for 50 epochs.Finally, we performed inference on FPGA and displayed the results in Figure 17.In the real world, the impacts of chemical spills and contamination are not only prevalent in Cancer Alley but also in other less-developed parts of the world.The results of this study will have far-reaching implications and reduce the cost of monitoring contamination and effectively detecting chemical spills.In Figure 15, we evaluate our classification model's performance matrix, specifically focusing on the UNET model.The results indicate t form well except for class 3 (ship).The UNET model struggles to distin ocean (class 0) and ships (class 3), frequently misclassifying ships as the can be attributed to the fact that class 3 has the fewest samples (22,981 shown in Figure 14, which may be insufficient for the model to generali only 50 training epochs.Figure 16 illustrates the differences between the predicted mask after training the model for 50 Finally, we pe on FPGA and displayed the results in Figure 17.In the real world, the im spills and contamination are not only prevalent in Cancer Alley but developed parts of the world.The results of this study will have far-reac and reduce the cost of monitoring contamination and effectively detecti    To compare the performance of our model, we looked at related studies that app neural network models for image segmentation of an oil spill dataset or a related dat as shown in Table 3. C. Li et al. [37] perform image segmentation using dual stream NET (DS-UNET) on two datasets, namely, the Palsar and sentinel datasets.Their s further measures model performance according to three metrics, namely, the dice sim ity coefficient (DSC), the average Hausdorff distance (HD), and the F1 score.Ano   To compare the performance of our model, we looked at related studies that applied neural network models for image segmentation of an oil spill dataset or a related dataset, as shown in Table 3. C. Li et al. [37] perform image segmentation using dual stream U-NET (DS-UNET) on two datasets, namely, the Palsar and sentinel datasets.Their study further measures model performance according to three metrics, namely, the dice similarity coefficient (DSC), the average Hausdorff distance (HD), and the F1 score.Another To compare the performance of our model, we looked at related studies that applied neural network models for image segmentation of an oil spill dataset or a related dataset, as shown in Table 3. C. Li et al. [37] perform image segmentation using dual stream U-NET (DS-UNET) on two datasets, namely, the Palsar and sentinel datasets.Their study further measures model performance according to three metrics, namely, the dice similarity coefficient (DSC), the average Hausdorff distance (HD), and the F1 score.Another study by A.V., Maria Anto et al. [38] uses a convolutional neural network (CNN) for oil spill detection.
They achieved 85% testing accuracy.The study by J. Fan and C. Liu [39] addresses two problems including the scarcity of sufficient oil spill data and the difficulty in detecting oil spills in an environment where there is an oil spill look-alike.Their study [39] uses multitask generative adversarial networks (MTGANs) to detect and semantically segment oil spill data.They applied their model to three datasets, namely, the Sentinel-1 dataset, ERS-1/2, and the GF-3 Satellite datasets.In [40], the study by X. Kang et al. uses a selfsupervised spectral-spatial transformer network (SSTNet) for feature extraction using custom hyperspectral oil spill database (HOSD) data.The training technique applied in this method involved a large number of training epochs to achieve a model that can generalize with high accuracy.Another study by J. Fan et al. [41] built a framework using a multi-feature semantic complementation network (MFSCNet) for oil spill localization segmentation of SAR images obtained via Sentinel-1 satellite data.The study by Mahmoud, A.S. et al. [42] applies a novel deep learning UNET model based on the Dual Attention Model (DAM).This model, named DAM-UNet, integrates a dual attention model to selectively highlight the relevant and discriminative global and local characteristics of oil spills in SAR images.It does this using a channel attention map and a position attention map.Finally, Dong et al. [43] propose the application of three deep learning-based marine oil spill detection methods, namely, a direct detection method based on transformer and UNet, a detection method based on Fast and Flexible CNN (FFDNet) and TransUNet with denoising before detection, and a detection method based on integrated multi-model learning.The performance benefits of the proposed method are then verified by comparing them with semantic segmentation models such as UNet, SegNet, and DeepLabV3+.When compared with our work, these approaches mostly require more training to obtain better accuracy as shown in Table 3. Apart from FPGAs, other alternative hardware used to perform inference includes various Application Specific Integrated Circuits (ASICs) and neuromorphic hardware for event-based datasets.Since this study focuses on CPUs, GPUs, and FPGA, Table 4 compares results from related FPGA implementations for image segmentation using UNET or other related networks.When performing machine learning inference on images using specialized nonreconfigurable hardware, latency and throughput can present significant challenges.FPGAs address these issues effectively because of their ability to be reconfigured and programmed with different architectures, thereby enhancing inference performance without requiring new hardware purchases.Additionally, FPGAs consume less power compared with GPUs and CPUs.These benefits make FPGAs the preferred choice for resource-intensive tasks like image segmentation, as demonstrated in this study.

Conclusions
In this study, we utilized the UNET and UNETR neural network architectures to perform semantic segmentation of chemical spills, leveraging two distinct datasets.This study has profound application in the real world as it can be challenging to detect and separate oil spill look-alikes from actual oil spills in the field.A notable aspect of our work is the development of reusable labeled ground truth images specifically tailored for the CSIRO dataset, a task previously unexplored.Our implementation integrates mixed precision techniques to enhance computational efficiency across both CPU and GPU platforms.Furthermore, we engineered an FPGA optimizer for the neural networks using High-Level Synthesis (HLS).Despite initial setbacks with tools like FINN and HLS4ML, we successfully devised a custom FPGA implementation using Vivado HLS.Our findings reveal a significant discrepancy in resource utilization between the UNETR and UNET models, primarily because of their divergent sizes.Consequently, the implementation of UNETR necessitates targeting alternative Pynq (software)-compatible FPGA boards boasting ample LUTs and DSP resources, such as the ZCU 102 and Alveo boards.Ultimately, our experiments demonstrate that the UNET model surpasses the UNETR model in terms of prediction accuracy on both CPU and GPU platforms.Moreover, owing to its more efficient resource utilization, the UNET model emerges as the preferred choice for this task.Finally, the results obtained from our study demonstrate improvements in inference latency on FPGA and ~94% prediction accuracy using UNET and ~77% prediction accuracy using UNETR.

Figure 1 .
Figure 1.RGB (Red, Green, Blue) image and dataset sample across the RGB and HSV (Hue, Saturation, Value) color spaces.

Figure 1 .
Figure 1.RGB (Red, Green, Blue) image and dataset sample across the RGB and HSV (Hue, Saturation, Value) color spaces.
other hand, generates output word by word, based on the output signal representation, in the form of tokenized patches generated by the encoder.The vanilla decoder model comprises a stack of decoder layers and a positional encoder.The decoder layers contain global self-attention/cross-attention and feed-forward layers.The encoder and decoder are used to build the transformer model.In this work, we replace the decoder section with a U-NET decoder.

Figure 4 .
Figure 4. Architecture of the vision transformer.

Figure 4 .
Figure 4. Architecture of the vision transformer.

Figure 5 .
Figure 5. Diagram showing the implementation of mixed precision using the PyTorch apex am framework.

Figure 5 . 1 Figure 6 .
Figure 5. Diagram showing the implementation of mixed precision using the PyTorch apex amp framework.J. Low Power Electron.Appl.2024, 14, x FOR PEER REVIEW 8 of 1

Figure 6 .
Figure 6.Diagram showing the concept of the mixed precision architecture.

Figure 8 .
Figure 8. Labeling convention used to create labels for the dataset.

Figure 8 .
Figure 8. Labeling convention used to create labels for the dataset.

Figure 9 .
Figure 9. Plot of training vs. testing accuracy using the UNET model on the Oil Spill (left) and CSIRO datasets (right).

Figure 10 .
Figure 10.Plot of training vs. testing accuracy using the UNETR model on the Oil Spill (left) and CSIRO datasets (right).

Figure 9 .
Figure 9. Plot of training vs. testing accuracy using the UNET model on the Oil Spill (left) and CSIRO datasets (right).

Figure 9 .
Figure 9. Plot of training vs. testing accuracy using the UNET model on the Oil Spill (left) and CSIRO datasets (right).

Figure 10 .
Figure 10.Plot of training vs. testing accuracy using the UNETR model on the Oil Spill (left) and CSIRO datasets (right).

Figure 10 .
Figure 10.Plot of training vs. testing accuracy using the UNETR model on the Oil Spill (left) and CSIRO datasets (right).

Figure 11 .
Figure 11.Diagram showing the generated FPGA architecture using Vivado HLS.

Figure 11 .
Figure 11.Diagram showing the generated FPGA architecture using Vivado HLS.

Figure 12 .
Figure 12.Architectures to handle the transformer component of our UNETR model.

Figure 13 .
Figure 13.Architecture that handles the CNN components of our UNETR model.

Figure 12 .
Figure 12.Architectures to handle the transformer component of our UNETR model.

Figure 12 .Figure 13 .
Figure 12.Architectures to handle the transformer component of our UNETR

Figure 13 .
Figure 13.Architecture that handles the CNN components of our UNETR model.

Figure 14 .
Figure 14.Distribution of data classes within the Oil Spill dataset.

Figure 14 .
Figure 14.Distribution of data classes within the Oil Spill dataset.

Figure 14 .
Figure 14.Distribution of data classes within the Oil Spill dataset.

Figure 15 .
Figure 15.Confusion matrix showing the performance of our UNET model.

Figure 15 .
Figure 15.Confusion matrix showing the performance of our UNET model.

Figure 16 .
Figure 16.Showing the test mask and the predicted mask.

Figure 16 . 18 Figure 15 .
Figure 16.Showing the test mask and the predicted mask.

Figure 16 .
Figure 16.Showing the test mask and the predicted mask.

Table 2 .
FPGA resource usage and power consumption.

Table 2 .
FPGA resource usage and power consumption.
"N/A" denotes that the amount of memory used varies.

Table 2 .
FPGA resource usage and power consumption.