A Novel Approach to Maritime Image Dehazing Based on a Large Kernel Encoder–Decoder Network with Multihead Pyramids

: With the continuous increase in human–robot integration, battleﬁeld formation is experiencing a revolutionary change. Unmanned aerial vehicles, unmanned surface vessels, combat robots, and other new intelligent weapons and equipment will play an essential role on future battleﬁelds by performing various tasks, including situational reconnaissance, monitoring, attack, and communication relay. Real-time monitoring of maritime scenes is the basis of battle-situation and threat estimation in naval battlegrounds. However, images of maritime scenes are usually accompanied by haze, clouds, and other disturbances, which blur the images and diminish the validity of their contents. This will have a severe adverse impact on many downstream tasks. A novel large kernel encoder–decoder network with multihead pyramids (LKEDN-MHP) is proposed to address some maritime image dehazing-related issues. The LKEDN-MHP adopts a multihead pyramid approach to form a hybrid representation space comprising reﬂection, shading, and semanteme. Unlike standard convolutional neural networks (CNNs), the LKEDN-MHP uses many kernels with a 7 × 7 or larger scale to extract features. To reduce the computational burden, depthwise (DW) convolution combined with re-parameterization is adopted to form a hybrid model stacked by a large number of different receptive ﬁelds, further enhancing the hybrid receptive ﬁelds. To restore the natural hazy maritime scenes as much as possible, we apply digital twin technology to build a simulation system in virtual space. The ﬁnal experimental results based on the evaluation metrics of the peak signal-to-noise ratio, structural similarity index measure, Jaccard index, and Dice coefﬁcient show that our LKEDN-MHP signiﬁcantly enhances dehazing and real-time performance compared with those of state-of-the-art approaches based on vision transformers (ViTs) and generative adversarial networks (GANs).


Introduction
Maritime situational real-time monitoring enables naval battlefield threat appraisal.The core task of real-time monitoring of offshore ship targets is to obtain real-time highdefinition (HD) images of the naval surface, which have promising applications in fishery management, maritime rescue, offshore traffic monitoring, naval battlefield situational awareness, and other fields.However, aerial images possess large amounts of data and are susceptible to various factors, such as complex sea surface conditions, extreme weather, poor lighting conditions, and loading errors from imaging detectors.Some disturbances in natural environments, such as haze, sea clutter, and clouds, will negatively influence decisions based on visual features in a real-time system.In this work, we will limit our discussion to interference in a hazy environment under low and ultra-low altitudes.

Mathematical Model of Atmospheric Scattering
There are two root causes for haze in sea surface images: (1) The light reflected by the target is absorbed and scattered by suspended particles during transmission, resulting in energy attenuation as well as reductions in image brightness and contrast; and (2) sunlight, skyglow, and other sources of light are scattered by particles to form stray light, causing blurred images and unnatural colors.The atmospheric scattering model commonly used in computer vision and computer graphics is as follows [1]: where I(x) is the observed image intensity at pixel x, and J(x) is the scene radiance.Generally, we can regard haze near ground as homogeneous.The transmission t(x) represents the ability that light interactions with the atmosphere can be expressed as: It describes the part of the light that is not scattered and reaches the camera.β(λ) denotes the scattering coefficient.According to Rayleigh's law [2] of atmospheric scattering, we know about the relationship between the scattering coefficient β and the wavelength λ: where 0 ≤ γ ≤ 4, and γ depends on particles size suspended in the gas.In clear days, γ = 4.The molecules' scattering is selective, and the blue wavelengths are scattered more compared to other visible wavelengths.On the other hand, fog and haze scatter all visible wavelengths more or less the same way, γ ≈ 0 [3].In this paper, we assume the scattering coefficient β(λ) = β to be constant since the camera usually has narrow spectral bands in most cases.d is the distance between the object and the imaging system.We can infer from (2) that light reflected from the object surface is attenuated exponentially with the scene depth d.A is the global atmospheric light and can be regarded as a constant if images are taken under the same weather condition.The first term J(x)t(x) on the right-hand side of (1) is called direct attenuation, which describes the scene radiance and its attenuation in the atmosphere.The latter term A[1 − t(x)] is called airlight [4], which is the addition of a white atmosphere veil.

Contributions
A large kernel encoder-decoder network with multihead pyramids (LKEDN-MHP) combining convolutional neural networks (CNNs) and vision transformers (ViTs) can powerfully narrow the performance gap between CNNs and ViTs.Notably, CNNs possess translational invariance and a local receptive field, whereas ViTs have a global receptive field and a self-attention mechanism.First, we introduce a guidance map, combined with hazy images, to estimate the thickness of haze directly and generate a realistic haze-free image in the data input stage.Then, the preprocessed images are input into the global aggregation and local convolution paths.Multiple large kernel convolution layers are used to further obtain a larger effective receptive field (ERF), forcing the model to focus on capturing the local region's fine-grained and deep semantic information in the local convolution path.In contrast, in the global aggregation path, global average pooling (GAP) is used for each channel, followed by an upsampling process.The global features are merged with local features generated by the two paths to obtain a wealthy semantic representation space, allowing for superior performance in downstream tasks.Shading and texture perceptions are the essential visual cues for recognizing objects and interpreting scenes.Therefore, the proposed LKEDN-MHP jointly integrates the principles of the hazefree approach from global and local paths for input hazy maritime images.We design a multihead pyramid for predictions of reflectance, shading, and semanteme in the local path.Since the reflectance component contains the factual color information of the scene, it encourages the reflectance prediction subnetwork to learn more intermediate features, which helps restore proper color information in the degraded images.Furthermore, the intermediate features can serve as complementary features to the semantic subnetwork and thus improve the dehazing results with high color contrast.Similarly, the shading prediction subnetwork can provide wealthy complementary features beneficial to texture enhancement, thus improving the dehazing results with fine details.
The main contributions of this paper include the following: • We propose a novel CNN architecture for the maritime image dehazing task, namely, the LKEDN-MHP, which combines global and local information.Predictions of reflection, shading, and semanteme are connected to provide rich complementary information for the dehazing task so that the high-quality haze-free images generated by the LKEDN-MHP have natural colors and fine details for a human eye.

•
We propose an improved pure CNN paradigm and demonstrate that using a few large convolutional kernels instead of a stack of small kernels can be more powerful.
The LKEDN-MHP uses multiple large kernels to obtain a larger ERF, similar to the multihead self-attention (MHSA) method of ViTs.By combining the advantages of CNNs and ViTs, the CNN paradigm is enhanced further to improved dehazing and articulated hazy images.

•
We establish a 3D digital twin system to verify the performance of the LKEDN-MHP.We build 3D models of oceanic scenarios, moving ocean-going vessels, inshore ships, and maritime reconnaissance aircraft.Furthermore, the real offshore reconnaissance scene is restored, which includes a haze rendering method to simulate the impact of haze in the real world.A description of research and analysis of modern naval warfare, monitoring, and warning is given.This approach creates considerable numbers of datasets related to naval warfare situations, which will be helpful for ViT technologyrelated development of guidance and precision strike weapons.
The remainder of the paper is organized as follows: Section 2 briefly reviews related work, focusing on the methods of image dehazing based on CNNs and ViTs.The proposed method, LKEDN-MHP, is described in detail in Section 3. Section 4 reports our experiments and analysis of the results.Finally, Section 5 summarizes our main conclusions and highlights future work.

Related Work
Deep learning (DL) algorithms are extensively implemented in multiple fields and have achieved incredible results, such as facial expression recognition and human pose estimation [5,6].The state-of-the-art dehazing algorithm is wholly based on data-driven learning [7], including the haze-free image generation method based on CNNs and ViTs.ViTs have been appreciated by many researchers due to their powerful attention mechanism and flexibility in time series modeling, but they still cannot achieve supremacy in the field of image processing [8][9][10].

Image Dehazing Approaches Based on CNNs
DL approaches can fit training data well with GPU hardware acceleration capability, and the haze-free technique based on CNNs has been ubiquitously studied.The state-of-theart dehazing network based on DL implemented an end-to-end framework and achieved better performance than traditional methods, including the Enhanced Pix2pix Dehazing Network (EPDN) [11], dense attentive dehazing network (DADN) [12], and multiscale boosted dehazing network MSBDN approaches [13].These algorithms are trained endto-end on haze-free images directly.With recent advancements, generative adversarial networks (GANs) for image style transfer, such as Pix2pix and CycleGAN [14,15], have become increasingly popular.Image dehazing can also be considered style transformation: An image is transferred from a hazy domain to a haze-free domain.Engine et al. attempted to combine cycle consistency and perceptual losses in the CycleGAN framework [16].
In addition, approaches derived from image segmentation, such as the feature pyramid network (FPN) approach, have been demonstrated to be influential in image dehazing applications.Image segmentation networks usually use an encoder-decoder to learn the embedded representation of features after input data are mapped to higher dimensions.Chaurasia and Culurciello proposed an efficient semantic segmentation architecture based on a fully convolutional encoder-decoder framework [17].Their encoder uses a ResNet18 model for feature encoding and avoids spatial information loss by reintroducing residuals from each encoder to the output of its corresponding decoder [18].

Image Dehazing Approaches Based on ViTs
ViTs have recently become active in visual processing due to their powerful attention mechanism and flexibility in time series modeling, including object detection, image classification, and semantic segmentation [19,20].The MHSA mechanism plays a vital role in ViTs and has been well documented.MHSA is flexible [21], has powerful performance (less inductive bias) [22], is robust to distortion [23,24], or is able to model long-term dependencies [25,26].However, some works have questioned the necessity of MHSA and attributed the high performance of ViTs to proper building blocks or sparse dynamic weights [27][28][29][30].
To the best of our knowledge, the computational complexity of ViTs increases exponentially with image pixel size, which restrains their application in dehazing tasks.Many versions of ViTs have been proposed to alleviate the problem of high computing costs.For example, the pyramid vision transformer (PVT) applies a transformer to lower-resolution features, significantly reducing computational costs [31].Another solution proposed in Swin Transformer [32] is locally grouped self-attention, where the input features are divided into a grid of nonoverlapping windows and then the visual transformer works only within each window.LocalViT [33] applies a local mechanism to ViTs by introducing a CNN into the feed-forward network.CvT [34] applies convolutional token embedding to relate local context and a convolution projection layer to provide efficiency benefits.In addition, it has been shown that the fusion of global and local information is significant to upstream visual tasks.However, ViTs do not focus on local textures [8] and cuts images into sequences as input, causing them to lack transitional invariance when used in image processing tasks.Approaches based on ViTs have many limitations in interpretability [7] and are not strong candidates for image dehazing.These limitations prompted us to investigate a more powerful approach for addressing the issue of maritime image dehazing.

Proposed Approach
In ViTs, MHSA is commonly designed as a fusion of global features [8,31,35] but with large kernels [9,32,36]so that each output from a single MHSA layer can gather information from a large region.However, large kernels are not popularly used in CNNs (except for the first layer [18]).Instead, a stack of many small spatial convolutions [18,24,37,38] (e.g., 3 × 3) is typically used to enlarge the ERF in state-of-the-art CNNs.ViTs and CNNs do not have to be applied independently.An aim of this paper is to analyze the respective advantages of the two algorithms and improve standard image dehazing performance.Therefore, based on the above principles, the LKEDN-MHP is proposed in this paper to address the issue of extensive computation and global-local ERFs.

Architecture Specification
GANs, such as the EPDN and FD-GAN algorithms, are extensively implemented in single-image dehazing tasks.The LKEDN-MHP architecture proposed in this paper is shown in Figure 1.
According to Figure 1, the LKEDN-MHP has five components: (1) Block 1 , (2) Block 2 , (3) a multihead pyramid, (4) self-attention, and a (5) global aggregation path.The input dataset is a 512 × 512 × 3 RGB image.Noticeably, the guidance map is combined with the hazy image as the input of the local convolution path to estimate haze thickness and the haze-free image directly.
• Block 1 refers to the beginning layers.Since we target a high-performance backbone of downstream dense-prediction tasks, and commonly, the input data size of the downstream dense-prediction task is significant, we hope to capture more details by implementing several convolutional layers in the initial stage.The number of channels C 1 in Block 1 is 64.After processing the first 3 × 3 layer with 2 × downsampling, we design a 7 × 7 DW layer to capture low-level patterns, with a 1 × 1 conv and another 7 × 7 DW layer for downsampling, where the process of DW convolution is shown in Figure 2.    • In the self-attention block, the attention score is generated by the GAP layer and the full connection of the two layers, and then the attention score is added to the fused MHP output to force the model to focus on critical information.To effectively integrate complementary features, we introduce the self-attention block.The selfattention block significantly improves the effectiveness of feature fusion by adaptively boosting the weights of the appropriate complementary channel while eliminating unrelated channels.The self-attention block highlights essential information, cuts redundant information, and optimizes feature fusion performance.

•
The global aggregation path provides the global feature to the local convolution path, and the local-global hybrid feature can be obtained by elementwise aggregation of features that contain the attention score and features obtained through GAP, a full connection layer, and upsampling.The global aggregation path improves efficiency while being able to aggregate global information.The hybrid output of the local and global paths is converted to the original size by 3 × 3 convolution and upsampling to obtain the generated HD haze-free image.
The discriminator takes a generated haze-free image or GT as its input, and then the output is obtained by the encoder and fully connected layers.The multihead pyramid encoder is employed to encourage the discriminator to have the same capacity to extract and analyze sophisticated features such as generators, thus causing the two networks to compete to improve their performance.

Loss Function
Since adversarial loss has achieved significant success in the image restoration field, for the sake of both pixel quality and human perception, we adopted a combination of adversarial loss, MSE loss, and perceptual loss [14,42,43] in the proposed scheme.Adversarial loss is defined as [14]: where B is the number of samples in the mini-batch, D(•) is the output of the discriminator, G(•) is the generated image, and z is the input hazy image.The MSE loss function is shown as [42]: where K is the number of pixels in the generated image and R is the GT.The perceptual loss used to measure perceptual similarity in feature space is shown as [43]: where P is the number of elements in the feature map ϕ in layer conv3-3 of the VGG 16 model.By combining all the related loss functions, the integral loss function optimizing the generator can be formulated as [10,42,43]: The critical loss function of the discriminator is shown as [44]: For our present scheme, the weights of integral loss and critical loss functions are set as α 1 = 500, α 2 = 500, α 3 = 500, α 4 = 1.

Datasets and Experimental Setup
With the constant influence of novel technologies such as 3D simulation incorporated into research, the limited perception of previous vision approaches has been overcome, and a new world of 3D interaction has arisen.Realistic 3D virtual dynamic displays and immersive experiences have gradually become more common.We perform 3D simulation with hazy and haze-free environments to display the target captured under the offshore situation, as shown in Figure 5.
A total of 18,540 images with different angles, diverse target scales, and various lighting situations were collected, each with a size of 1920 × 1080, including 9648 hazy images and 8892 GT images.The training set, test set, and validation set are divided into 7000 pairs of hazy and hazy-free images, 756 hazy images and 1892 pairs of hazy and hazy-free images, respectively.To improve the performance of the proposed LKEDN-MHP, data augmentation is adopted.First, each hazy image is randomly clipped into five patches with the same width-to-height ratio as the origin.Then, the horizontal flip method is applied to double the number of training samples and create a new geometric texture in the training dataset.To verify the universality of the LKEDN-MHP, we add a public outdoor-haze dataset (O-Haze) [45].

Experimental Settings
The experiments were performed on a GeForce GTX TITAN X graphics card with a network input size of 512 × 512 × 3 and 1000 epochs.In the first 800 epochs, the learning rate is fixed at 10 −4 , and in the final 200 epochs, we linearly decay the learning rate to zero.The peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) were used to measure the dehazing performance of the LKEDN-MHP.The PSNR, however, is chosen as our primary performance measure because we consider the pixel quality of the restored image that can benefit other computer vision tasks such as object detection.The network is trained with a batch size of one sample, which empirically improves the validation results in the image restoration task [46].In the MHP, each stage has three architectural hyperparameters: the number of LK blocks N, the channel dimension C 3 , and the kernel size K.The model with the above parameters is called LKEDN-MHP-B (with B for Base), and the more comprehensive model is called LKEDN-MHP-L (with L for Large).Table 1 shows the setting of architectural hyperparameters for each stage in the experiment.The performance of the LKEDN-MHP proposed in this paper is compared with that of other state-of-the-art algorithms on the self-built dataset and O-Haze dataset [45].To evaluate the superiority of the proposed method, state-of-the-art algorithms based on CNN and ViT were made a qualitative comparison.We believe that the high performance of LKEDN-MHP is mainly because of the large ERFs we build via large kernels, as compared in Figure 7.In many classical computer vision tasks, the Dice score and Jaccard index are used to evaluate the segmentation performance [47].The results compared with the state-of-the-art methods are shown in Figure 8 and Table 2.  Figure 8 shows that our scheme achieves the best dehazing results on both the selfbuilt dataset and the O-Haze dataset.As shown in Table 2, in the experiment performed on the self-built maritime hazy image dataset obtained by the simulation environment based on the digital twin, the PSNR of LKEDN-MHP-B reached 35.63 dB.The PSNR of LKEDN-MHP-L in our scheme is further improved by 2.94 dB, and the SSIM reaches the optimal value of 0.9896, indicating the best performance in the maritime dehazing task.To thoroughly verify the reliability and universality of the scheme, the proposed approach was tested on the public dataset O-Haze, and it performed best on this dataset, with the large model PSNR reaching 39.16 dB.
Compared with those of the state-of-the-art algorithms, the PSNR and SSIM of our large model reached 39.16 dB and 0.9896 on the O-Haze dataset, respectively.In addition, the large model under the PSNR and SSIM yielded values of 35.63 dB and 0.9146 on the self-built dataset, respectively, which were better than those of the state-of-the-art algorithms.Compared with the public dataset, the self-built maritime hazy image dataset tested in this paper was more challenging.It provides rich data for object detection and other downstream tasks under extreme weather conditions, which is of great significance to the study of maritime situational reconnaissance, monitoring and detection tasks.The experimental results show that the proposed LKEDN-MHP algorithm yields promising results on the self-built and natural outdoor hazy datasets.

Ablation Study
We conducted some ablation studies to prove the effectiveness of our method and the rationality of selecting parameters on the dataset of self-built.

•
The number of LK blocks N. In Table 3, we compare performances with different number of LK blocks N. The rest of the hyperparameters refer to the baseline.The large conv kernel size K and 5 × 5 re-parameterization in LK block.In Table 4, we compare performances with different large conv kernel size K and 5 × 5 reparameterizationin LK block.The rest of the hyperparameters refer to the baseline.We introduced a 5 × 5 kernel for re-parameterization in each LK block.In this way, we make the very large kernel capable of capturing small-scale patterns, hence improving the performance of the model.Table 4 shows that directly increasing the kernel size from 13 to 31 reduces the accuracy, while re-parameterization addresses the issue.[13,13,13,13,13,13,13,13] 28.79 [13,13,13,13,13,13,13,13] 32.74 [31,29,27,25,21,19,17,13] 31.89[31,29,27,25,21,19,17,13] 38.57 • The channel number C 3 .In Table 5; we compare performances with different number of channels C 3 in each Stage.The rest of the hyperparameters refer to the baseline.

Conclusions
This paper proposes a novel large kernel encoder-decoder network with multihead pyramids (LKEDN-MHP) for maritime image dehazing tasks.The proposed LKEDN-MHP scheme utilizes a transmission map extracted by using a guidance map as additional input to the network to achieve improved performance.The architecture is inspired by U-NET and includes several improvements to the network, including a multihead spatial pyramid cell and Swish activation to achieve optimal dehazing performance.To restore the actual hazy offshore situation, 3D simulation was used for ocean scene model construction, haze rendering, and offshore target animation.High-definition images with different angles, illumination conditions, locations, and other parameters were collected by our simulation of the natural scene.To improve performance, data augmentation methods such as random clipping and flipping were used during preprocessing.The real public outdoor haze dataset (O-Haze [49]) was supplemented to verify model performance.Experiments on the datasets showed that the proposed LKEDN-MHP algorithm was superior to the state-of-the-art algorithms in terms of the PSNR and SSIM indicators.
Future Work.The database and the improved algorithm are two indispensable parts of the maritime image dehazing task.Our future work will focus on the following two aspects.

•
Prospective studies based on digital twin technology will be of great significance in effectively promoting the transformation and reconstruction of research techniques under the new technology in future war situations.We will focus on refining the rendering result of the target under hazy to generate a more realistic world.

•
Our work has proven that a large kernel could help to obtain a large ERF.We will focus on improving the architecture to take full advantage of effective information on a maritime image dehazing task.

Figure 1 .
Figure 1.The architecture of the LKEDN-MHP.The guidance map is concatenated with the original input in the local path.The predictions of reflectance, shading, and semanteme are generated by the multihead pyramid.We use self-attention to mark notable information and then merge the local and global features.The generated haze-free image and the GT are the input of the discriminator.

Figure 2 .
Figure 2. The process of DW convolution.Take the number of input data channels 5 and the kernel size 3 × 3 as an example.Each channel is convoluted by a filter.As shown in Figure2, in contrast to the conventional convolution operation, a DW convolution kernel is responsible for one channel, i.e., a channel is convoluted by only one kernel.It is believed that large-kernel convolutions are computationally expensive because the kernel size quadratically increases the number of parameters and floating point operations (FLOPs).This drawback can be significantly overcome by applying DW convolutions[39].• Block 2 uses the identity shortcut and DW large kernel convolution.After DW convolution, we desire to use a 1 × 1 convolution layer (pointwise, PW) as a standard component to increase depth and provide more nonlinearities and information communications across channels.Except for the large convolution layers, which provide a sufficient receptive field and have the ability to aggregate spatial information, the model's representational capacity is closely related to depth.The number of channels C 2 in Block 2 is 128.Inspired by the feed-forward network (FFN) widely used in transformers and MLPs[40], we use a similar CNN-style block consisting of a shortcut, batch normalization (BN)[41], double 1 × 1 layers, and Swish activation functions.Compared to the classic FFN, which uses layer normalization[38] before the fully connected layers, BN has the advantage that it can be fused into convolution for efficient inference.•Themultihead pyramid contains three heads: reflection prediction, shading prediction, and semantic prediction.The reflection and shading heads provide rich complementary shade and texture information for image dehazing tasks, enabling the network to generate high-quality haze-free images with natural colors and fine details.The architectures of Head 1 and Head 2 are shown in Figure3, and the architecture of Head 3 is shown in Figure4.According to Figures3 and 4, Head 1 and Head 2 contain six stage blocks, and the encoder and decoder each contain three stage blocks.Each stage contains several LK blocks, and the DW convolution in each LK block uses a 5 × 5 kernel for reparameterization.A ConvFFN block is placed after LK Block, while eight stage modules are included in Head 3 , and elementwise fusion is performed with the outputs of Head 1 and Head 2 .

Figure 3 .
Figure 3.The architecture of Head 1 and Head 2 .Each stage contains several LK blocks, and the DW convolution in each LK block uses a 5 × 5 kernel for re-parameterization.

Figure 4 .
Figure 4.The architecture of Head 3 .An elementwise fusion is performed with the outputs of Head 1 and Head 2 .

Figure 5 .
Figure 5.Samples of datasets.Left: Hazy input.Right: Ground-truth.The datasets contain multi targets and a variety of situations to be as realistic as possible.

Figure 6 .
Figure 6.The flow of information over different layers of the LKEDN-MHP.

Figure 7 .
Figure 7.The ERFs of EDN-GTM, DehazeFormer, and LKEDN-MHP, respectively.A more widely distributed bright area indicates a larger ERF.We measure the effective receptive field of different layers as the absolute value of the gradient of the center location of the feature map with respect to the input.Results are averaged across all channels in each map for 32 randomly-selected images.

Figure 8 .
Figure 8.The results obtained with the state-of-the-art algorithms on the self-built dataset and O-Haze dataset.The top four rows show the dehazing results on the self-built dataset, and the bottom two rows show the dehazing result on the O-Haze public dataset.

Table 1 .
The setting of architectural hyperparameters.

Table 2 .
Quantitative defogging results of the self-built data set and O-Haze data set.

Table 3 .
Ablation experiments for the number of LK blocks N on the self-built dataset for maritime image dehazing task.

Table 4 .
Ablation experiments for the large conv kernel size K and 5 × 5 re-parameterization in LK blocks.

Table 5 .
Ablation experiments for the channels C 3 in each Stage on the self-built dataset for the maritime image dehazing task.