Lightweight Multi-Scales Feature Diffusion for Image Inpainting Towards Underwater Fish Monitoring

Wang, Zhuowei; Jiang, Xiaoqi; Chen, Chong; Li, Yanxi

doi:10.3390/jmse12122178

Open AccessArticle

Lightweight Multi-Scales Feature Diffusion for Image Inpainting Towards Underwater Fish Monitoring

¹

School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China

²

School of Humanities and Science, Stanford University, Stanford, CA 94305, USA

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(12), 2178; https://doi.org/10.3390/jmse12122178

Submission received: 30 October 2024 / Revised: 21 November 2024 / Accepted: 22 November 2024 / Published: 28 November 2024

(This article belongs to the Section Marine Aquaculture)

Download

Browse Figures

Versions Notes

Abstract

In the process of gradually upgrading aquaculture to the information and intelligence industries, it is usually necessary to collect images of underwater fish. In practical work, the quality of underwater images is often affected by water clarity and light refraction, resulting in most fish images not fully displaying the entire fish body. Image inpainting helps infer the occluded fish image information based on known images, thereby better identifying and analyzing fish populations. When using existing image inpainting methods for underwater fish images, limited by the small datasets available for training, the results were not satisfactory. Lightweight Multi-scales Feature Diffusion (LMF-Diffusion) is proposed to achieve results closer to real images when dealing with image inpainting tasks from small datasets. LMF-Diffusion is based on guided diffusion and flexibly extracts features from images at different scales, effectively capturing remote dependencies among pixels, and it is more lightweight, making it more suitable for practical deployment. Experimental results show that our architecture uses only

48.7 %

of the parameter of the guided diffusion model and produces inpainting results closer to real images in our dataset. Experimental results show that LMF-Diffusion enables the Repaint method to exhibit better performance in underwater fish image inpainting. Underwater fish image inpainting results obtained using our LMF-Diffusion model outperform those produced by current popular image inpainting methods.

Keywords:

aquaculture; diffusion models; inpainting; unconditional DDPM

1. Introduction

Aquaculture is one of the important sources of global marine and freshwater food supply. By farming various aquatic organisms, such as fish, shellfish, and crustaceans, it can meet people’s demand for protein and nutrient-rich foods, ensuring food security and stability of supply. In the process of gradually upgrading aquaculture to the information and intelligence industries, it is usually necessary to collect images of underwater fish. Researchers use these images to display underwater fish and analyze the density and distribution of fish [1,2]. Considering factors such as cost, portability, and underwater corrosion protection, in practical production applications, the resolution of most underwater cameras is relatively low. The underwater environment is usually dimly lit, and the shooting process is easily affected by water refraction and scattering. In the real natural environment, substances like sediment and algae in the water can cause turbidity, leading to issues such as dark images, color distortion, and excessive noise in underwater photography [3,4,5]. Therefore, the image quality of underwater fish in natural environments highly depends on the conditions. When the lighting is suitable and the water is clear, the image quality of underwater fish is relatively high, making it easier to analyze fish school density and distribution. However, the water quality can vary significantly due to seasonal changes, precipitation, and surrounding human activities, leading to substantial differences in the quality of underwater images captured at different times. As shown in Figure 1, not only will different fish block each other when the fish swarm is dense, but especially when the water quality is turbid, the fish body in the underwater image cannot be fully displayed.

Image inpainting is to redraw unknown areas based on the information from neighboring regions and the overall structure of the image to recover unknown information. When using existing image inpainting methods for underwater fish images, the results were not satisfactory. The usual practice of existing image inpainting methods are typically trained on large-scale datasets of over 10,000 images [6,7,8,9], using random masks on images as input for the model training [10,11,12,13,14,15]. Some of these algorithms utilize free-form mask generation [16], and some of these are trained on datasets with irregular masks [15]. However, due to the limited availability of underwater fish image datasets for training, these methods are not appropriate for our application. If the above-mentioned methods were to be used for underwater fish image inpainting, the masks in the training set need to cover a part of the fish body. It is obvious that random masks are unlikely to meet this requirement. We try to use the recent methods Diverse Structure Inpainting (DSI) [17], Aggregated Contextual Transformations (AOT) [18], and Repaint [19] for underwater fish image inpainting. DSI and AOT are unable to inpaint the complete shape of the fish. Compared to the two methods above, Repaint yields inpainting results closer to real images. Since it is drawn using the resampling process, no masks need to be entered to pretrain, simplifying the model training process. But it still has shortcomings, such as the unnatural transition between fish bodies and underwater backgrounds, leading to instances like a fish having two heads in the inpainting image. Moreover, models of the Repaint method have a large number of parameters, which are not practical for real-world deployment. To solve the above problems, we present Lightweight Multi-scales Feature Diffusion (LMF-Diffusion). This is an architecture based on the guided diffusion model, building upon the foundation of an Attention-based UNet [20]. To enhance the learning capability of models on small datasets, we aim for the model to capture long-range dependencies among pixels and extract features from images at different scales. Therefore, the Multi-scale Cross-axis Attention (MCA) [21] module is introduced into the architecture. Depthwise separable convolutions [22] are also used in LMF-Diffusion, which reduce the parameters of the model and make it more suitable for real-world deployment. In summary, our main contributions can be concluded as follows:

LMF-Diffusion is proposed. Employing LMF-Diffusion models for image inpainting tasks has shown promising results in underwater fish image inpainting, producing results that closely resemble real images.
Within LMF-Diffusion, MCA is integrated to allow the model to effectively capture distant dependencies among pixels and extract features from images at various scales and orientations.
To reduce parameters, depthwise separable convolutions are used. Models trained using LMF-Diffusion possess only $48.7 %$ of the parameters compared to the guided diffusion model.

The remainder of this paper is organized as follows. Section 2 provides an overview of the relevant research. Section 3 introduces the collection of the image dataset, followed by the proposed LMF-Diffusion. Section 4 describes the experimental setup. Section 5 is the results part, where the results of different experiments are presented. Section 6 analyzes and discusses the experimental results. Finally, the last section provides a summary of the entire paper.

2. Related Work

2.1. Image Inpainting

Image inpainting is the process of inpainting missing areas in an image based on the neighboring position information and the overall structure of the image, following certain rules to repaint the unknown areas and recover the missing information [23]. The current scope of image inpainting tasks is broad, including rectangular mask inpainting, irregular mask inpainting, object removal, watermark remova, old photo colorization, and so on [6,24].

Traditional image inpainting algorithms are mainly classified into structure-based image inpainting algorithms [25,26,27] and texture-based image inpainting algorithms [28,29,30,31]. Structure-based methods use geometric techniques to restore missing parts of an image, assuming that the damaged and known parts of the image have similar content. It finds the most matching feature blocks in the known areas of the image and then replicates information at the pixel level to fill in the missing areas. Texture-based image inpainting algorithms gradually spread pixel information around the damaged pixels in the image and synthesize new textures to fill in the missing parts, which are more suitable for restoring backgrounds with structured textures and removing small objects from images. However, the inpainting effect is related to the size of the area to be mended. Traditional image inpainting algorithms perform well in simple regular-texture structures or small-area mask complementation tasks but have poor results in images with complex textures for larger areas.

Deep learning-based image inpainting methods, such as generative networks [32], VAE, GAN, and the denoising diffusion probabilistic model (DDPM), can effectively learn and model the distribution of real data from training data. The training of the VAE-based image inpainting method [33] is usually stable, but the resulting results are likely to be fuzzy. The GAN-based image inpainting method [34,35,36,37,38,39] can inpaint higher-quality images. Increasing the width and depth of the network can better fit the data features, but too large a network depth can cause gradient explosion. Therefore, current image inpainting network models use thin and deep network structures to reduce and control the number of parameters, and employ multi-scale features or skip-connection residual structures to address gradient vanishing issues. The recent Repaint method [19] uses the DDPM prior and repeated sampling, leveraging a pre-trained unconditional DDPM as a generative prior to sampling the uncovered areas based on the given image information, altering the reverse diffusion iteration to complete the image inpainting task.

2.2. Denoising Diffusion Probabilistic Models

First and foremost, Jascha et al. put forth the concept of the diffusion probabilistic model [40], which operates as a Markov chain known for its ability to convert a known distribution into a data distribution. It was upon this foundation that, in 2020, Jascha et al. took a step further with the introduction of the DDPM [41]. They established a connection between the diffusion probabilistic model and denoising score matching with Langevin dynamics. Building upon the DDPM, Prafulla et al. increased the number of attention heads, using BigGAN residual blocks for both upsampling and downsampling the activations to propose the guided diffusion model [42]. Notably, guided DDPMs outperformed GANs in unconditional image synthesis tasks.

Diffusion models are deep generative models consisting of two processes: the forward process and the reverse process. The diffusion model has been widely applied to various artificial intelligence tasks, such as image generation [43,44,45], image super-resolution [46,47], image inpainting [19,48], target recognition [49]. The forward process entails a series of iterative steps, during which minor details of noise are introduced to each initial input image

x_{0}

. At each step, the intensity of the Gaussian noise

q (x_{t} | x_{(t - 1)})

varies, progressively altering the input data until it converges to pure Gaussian noise

x_{T}

. On the other hand, the reverse diffusion process symbolizes the inverse of the forward diffusion process. It follows a similar iterative pattern, steadily eliminating the reverse noise

p (x_{(t - 1)} | x_{t})

to reconstruct the original image. This entire process is exemplified in Figure 2.

2.3. Attention Mechanisms

In the course of the development of deep learning, the attention mechanism has given rise to many variants. Self-attention has shown good performance, but its use leads to a sharp increase in model parameters and computational complexity. Jonathan et al. proposed axial attention [24] that applies self-attention separately in the vertical and horizontal directions of images, significantly reducing computational complexity. Ashish et al. introduced the multi-head attention mechanism, using multiple attention heads in parallel, each independently learning and outputting, then combining these outputs to enhance the model’s learning ability. Additionally, more improved attention mechanisms have been proposed. For example, Zhu et al. designed Bi-Level Routing Attention [11], a dynamic and query-aware sparse attention mechanism that filters out most irrelevant key-value pairs at a coarse regional level, retaining fine-grained details and reducing computational load.

In the task of image inpainting, attention mechanisms have been widely utilized. To meet the requirements of edge consistency and semantic integrity in image inpainting, Yu et al. [14] employed contextual attention to optimize the inpainting of blurry results. Li et al. [50] proposed GAP-Net, which leverages self-attention mechanisms to guide feature mapping in the transformation process from latent feature space to image information, resulting in more realistic image inpainting. The guided diffusion mentioned earlier [42] adjusted the use of the multi-head self-attention mechanism based on the original DDPM, significantly enhancing the model’s learning ability.

Although DDPM can achieve better sample quality than GAN in many image generation tasks, the model has high computational complexity and a large number of parameters in DDPM training. What is more, previous image inpainting approaches based on the guided diffusion model are difficult to adapt to fish image inpainting tasks, and they have low content awareness so that the boundaries of a fish body in image inpainting results are not naturally transitioned enough. The rich colors and textures of koi make image inpainting more challenging. Different from the existing image inpainting methods mentioned above, we propose LMF-Diffusion to address the aforementioned issues and conduct ablation studies of our method.

3. Materials and Methods

3.1. Koi Dataset

To evaluate the performance of image inpainting methods on underwater images in practical applications, an underwater koi image dataset was constructed. This dataset was collected from an outdoor fish pond in Guangdong, China, where the majority of the fish have diverse colors and patterns. As shown in Figure 3, a regular underwater camera was submerged to capture video footage of the pond during the daytime. Multiple video recordings were made in the same water area during the summer and autumn, with the total duration of the footage exceeding one hour. From this extensive footage, image segments displaying the complete shapes of fish were extracted, resized to

256 \times 256

pixels, and compiled to form the dataset. A regular underwater camera is one of the most popular models currently available for purchase online, which is beneficial for the practical application and promotion of this research, and is typically used to help anglers explore underwater conditions. Due to the open and flowing nature of the water where the data are collected, it is not possible to estimate how many fish are present. From the gathered images, it is obvious that the sizes of the koi in the images vary and are mixed throughout the water. The same fish occupies different proportions of the frame depending on its distance from the camera.

In building the dataset, 1500 images for the training set and around 100 images for the test set were selected. It was strictly required that the shape of the fish in each image included in the dataset must be fully visible, ensuring the representativeness and effectiveness of our dataset.

It is worth noting that our dataset encompasses a wide range of diversity, including various colors, patterns, angles, and poses of koi images. Such diversity will aid in evaluating the performance of image inpainting methods across different scenarios and provide ample challenges and opportunities for the generalization ability of the models. During the data collection process, efforts were made to ensure the richness and diversity of the data, aiming for the dataset to play a more representative and reliable role in the evaluation and application of underwater image inpainting methods.

3.2. LMF-Diffusion

3.2.1. LMF-Diffusion Model Architecture

Underwater lighting conditions are complex, such as light attenuation and color shifts, which pose challenges for image inpainting tasks. Unlike images of common objects on land, in underwater fish images, fish postures vary and fish heads at similar angles in different images may correspond to completely different fishtail postures, making it more challenging for models to capture image features. At the same time, limited by factors such as lighting, seasons, underwater camera equipment, and so on, capturing clear and complete underwater fish body images suitable for model training is extremely difficult. Specifically, out of around 70,000 underwater fish images captured, only 1600 were selected to compose the dataset. This poses higher demands on the learning capability of the model, requiring more accurate and reliable inpainting results based on a limited number of real images.

When using Repaint for underwater fish image inpainting, it is necessary to first pretrain the unconditional DDPM model using the guided diffusion architecture [38]. Utilizing the prior model obtained with the guided diffusion architecture for image inpainting results in relatively rigid fish body boundaries in the inpainting images. Issues such as significant discrepancies in koi patterns compared to real images and incorrectly inpainting the fish body area as the fish head arise, as shown in Figure 4. Clearly, using this method for underwater fish image inpainting makes it difficult to inpaint the natural form of the fish. Moreover, the model pretrained with the guided diffusion architecture has a large number of model parameters, thus occupying a significant amount of storage space, which is not convenient for practical applications. LMF-Diffusion is proposed to improve the above issues.

The structure of the LMF-Diffusion is shown in Figure 5. LMF-Diffusion is based on the UNet architecture. UNet is a fully convolutional network with an encoder–decoder structure. The structure of the encoder part and the decoder part is symmetrical, and the number of decoders is the same as the number of encoders. The encoder part performs feature extraction and compression by downsampling the input image, while the decoder part inpaints the size of the image and combines the extracted features from the encoder part with the corresponding layers of the decoder part. To enhance the model’s feature extraction capability and address the challenge of training image inpainting models on small datasets, MCA is incorporated as the attention module in UNet. MCA helps the model capture long-range dependencies among pixels effectively, handling features at different scales and orientations. In the subsequent experimental section, it is demonstrated that with the inclusion of the MCA module, the model can more accurately inpaint the details and edge information of fish shapes. To facilitate the practical deployment and application of the model, the use of depthwise separable convolutions is proposed to reduce the model’s parameter size and computational cost.

Algorithm 1 displays the complete training procedure of LMF-Diffusion. Referring to the training method of DDPM, the loss function is derived as a simplification of the one presented by Ho et al. [40].

{\bar{α}}_{t}

represents the total noise variance,

β_{t}

is the variance of the normal distribution, and the noise generated at step t is sampled from

N (0, β_{t})

. In the specific training process,

β_{t}

is determined by different beta schedules. For example, when the beta schedule is linear, the variance is sampled evenly from the minimum to the maximum value at each step.

Algorithm 1 Training LMF-Diffusion model

Input:: input image $x_{0}$
Output:: prediction $ϵ_{p r e d}$
1:: repeat
2:: $t \sim U n i f o r m (\{1, \dots, T\})$
3:: $ϵ \sim N (0, I)$
4:: ${\bar{α}}_{t} = \prod_{s = 1}^{t} (1 - β_{s})$
5:: $ϵ_{p r e d} = ϵ_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, t)$
6:: $L o s s = E_{t, x_{0}, ϵ} [{∥\begin{matrix} ϵ - ϵ_{p r e d} \end{matrix}∥}^{2}]$
7:: Take gradient descent step on $▽_{θ} {∥\begin{matrix} ϵ - ϵ_{p r e d} \end{matrix}∥}^{2}$
8:: Update the weights of the LMF-Diffusion
9:: until converged return $ϵ_{p r e d}$

3.2.2. Multi-Scale Cross-Axis Attention (MCA)

Underwater images suffer from issues such as different scales of blurriness and color distortion, requiring the model to have strong modeling capabilities to more flexibly learn the structure and texture features of fish bodies in underwater images, and to more accurately inpaint lost details. Moreover, due to the limited amount of training data in underwater fish image inpainting tasks, the long-range dependencies among pixels in the model are challenging. In the original guided diffusion model, multi-head attention was used as the attention module. When using it as the pretraining model for the Repaint method, the resulting image was unsatisfactory, with an error of two fish heads in one fish and stiff transitions around the fish edge. Therefore, to enhance the image inpainting capability for underwater fish images, an attempt was made to introduce a more flexible attention module.

MCA was proposed for solving image segmentation problems in the medical field. Their MCA-Net, based on this proposal, achieved superior results to state-of-the-art methods in several typical medical image segmentation tasks. MCA not only effectively captures multi-scale information, but also improves the handling of challenges in achieving long-distance interactions in many medical image segmentation tasks with relatively small datasets. Therefore, introducing the MCA module into the architecture is considered. MCA can help the model establish remote dependency relationships among pixels, fully considering the overall structure and continuity of underwater fish images. By introducing multi-scale information and cross-axis correlations, the model can more accurately inpaint details or the edge information of fish bodies.

MCA is divided into two directions. Given the feature map

F (H \times W \times C^{1})

, it uses three parallel 1D convolutions to encode F. The outputs of these convolutions are fused via summation and fed into a

1 \times 1

convolution. The formulation is as follows.

F_{x} = C o n v_{1 \times 1} (\sum_{i = 0}^{2} C o n v 1 D_{1}^{x} (N o r m (F)))

(1)

F_{y} = C o n v_{1 \times 1} (\sum_{i = 0}^{2} C o n v 1 D_{1}^{y} (N o r m (F)))

(2)

where

C o n v 1 D_{1}^{x} (\cdot)

represents one-dimensional convolution along the axis dimension, and

N o r m (\cdot)

represents layer normalization.

F_{x}

and

F_{y}

are the output of two directions, respectively.

Conduct the y-axis attention using

F_{x}

, and calculate the cross-attention between

F_{x}

and

F_{y}

. That is, using

F_{x}

as the key and value matrices and

F_{y}

as the query matrix, the calculation result

F_{T}

is obtained. Similarly, using

F_{y}

as the key and value matrices and

F_{x}

as the query matrix, the calculation result

F_{B}

is obtained.

F_{T} = M H C A_{y} (F_{y}, F_{x}, F_{x})

(3)

F_{B} = M H C A_{x} (F_{x}, F_{y}, F_{y})

(4)

where

M H C A_{y}

and

M H C A_{x}

represent the multi-head cross-attention along the the y-axis and x-axis, respectively. The final output of MCA is:

F_{o} = C o n v_{1 \times 1} (F_{T}) + C o n v_{1 \times 1} (F_{B}) + F

(5)

In subsequent experiments, the same number of attention heads is maintained in the MCA as in the multi-head attention module of the guided diffusion architecture.

Compared to MCA, ordinary multi-head attention is relatively limited in handling multi-scale features and cross-scale information. For some more complex long-range dependency relationships, traditional multi-head attention may be difficult to fully capture all the necessary contextual information.

In practical application, MCA not only improves the inpainting effect of underwater fish images, but also reduces the number of parameters. MCA integrates multi-scale features and uses attention mechanisms to extract global context information. After integrating MCA, LMF-Diffusion shows a significant performance improvement in fish image inpainting tasks, enabling more accurate inpainting of fish colors and textures.

3.2.3. Depthwise Separable Convolution

When using deep learning methods for image inpainting tasks on underwater fish images, the model parameters pretrained through the guided diffusion architecture have a relatively large quantity, leading to high memory consumption, which poses certain difficulties in practical deployment and application. Constrained by conditions such as computational resources, the aim is to reduce the complexity of the model, decrease the number of parameters and computational costs, and improve the model’s operational speed while maintaining performance. Therefore, depthwise separable convolutions are introduced in LMF-Diffusion.

The depthwise separable convolution decomposes standard convolution into two processes, consisting of depthwise convolution and pointwise convolution. The depthwise convolution performs convolution operations on each input channel, resulting in an output feature map with the same number of channels as the input feature map. Since each channel only needs to be convolved once instead of being repeated over all output channels, this reduces a lot of redundant calculations. The number of output feature maps in depthwise convolution equals the number of input feature maps, which makes effective dimension expansion impossible. Therefore, pointwise convolution is introduced, where pointwise convolution is a normal

1 \times 1

convolution that can change the number of channels and aggregate several feature maps obtained from depthwise convolution.

Referring to Figure 6, assume the standard convolution input is

D_{F} \times D_{F} \times M

and the kernel size is

D_{K} \times D_{K} \times M \times N

, where

D_{F}

and

D_{K}

represent the spatial dimensions of the input and the kernel, M is the number of input feature channels, and N is the number of output channels. Depthwise separable convolution first performs depthwise convolution on the same input, applying a kernel of size

D_{K} \times D_{K}

to each channel of the input to generate the same number of channels, and then stacks these channels along the channel dimension. Pointwise convolution applies a

1 \times 1 \times M \times N

convolution operation to the feature maps obtained from the previous step. Therefore, the parameter count for standard convolution can be succinctly represented as

D_{K} \cdot D_{K} \cdot M \cdot N

, and similarly, the parameter count for depthwise separable convolution is

D_{K} \cdot D_{K} \cdot M + M \cdot N

. The reduction in parameter count is as follows:

\frac{D_{K} \cdot D_{K} \cdot M + M \cdot N}{D_{K} \cdot D_{K} \cdot M \cdot N} = \frac{D_{K} \cdot D_{K} + N}{D_{K} \cdot D_{K} \cdot N}

(6)

Therefore, when

D_{K}

and N values satisfy certain conditions, depthwise separable convolution can significantly reduce model computation compared to standard convolution, but it may also lead to performance degradation. Not all convolutions in the training network are replaced with depthwise separable convolutions. Based on the actual effect, some of them are adjusted to a depthwise separable convolution.

In the final architecture, the parts using depthwise separable convolutions are the upsample block, the downsample block, and the input layer of residual block.

4. Experiments Setup

4.1. Experimental Platform and Parameters

All experiments in this study were conducted on the same software and hardware platform, as shown in Table 1. This study was implemented on Ubuntu using Conda, which allows environment migration across different operating systems. The Repaint method using pretrained guided diffusion model is chosen as the baseline. When using the Repaint method, models are pretrained with 250,000 iterations using the AdamW optimizer for both guided diffusion and LMF-Diffusion. With

T = 250

timesteps, the Repaint approach was applied using

r = 10

times resampling and a jumpy size of

j = 10

.

4.2. Evaluation Metrics

Params are used to evaluate the lightweighting effect of the architecture. LPIPS and PSNR are utilized to assess the final image inpainting effect. PSNR is a metric used to evaluate the quality of an image by comparing it to the original one, with higher values indicating better quality. Proposed in [51], LPIPS learns a generator that enforces the inverse mapping from generated images to ground truth images, prioritizing the perceptual similarity between them. A smaller LPIPS value indicates a better image inpainting effect.

4.3. Mask Settings

The ablation study is evaluated under four different mask settings. To compare the effectiveness of methods in underwater image inpainting, fish masks are designed for the test set to assess whether methods can inpaint the entire fish image based on just half of the fish body. Additionally, thin, thick, and half masks are included in the test set to evaluate the performance of the model.

5. Results

In this section, the ablation study results of LMF-Diffusion on the dataset are reported, comparing them with other existing image inpainting methods and attempting to inpaint obscured fish images. It is noted that model pretraining was conducted on LMF-Diffusion, and the Repaint method was used for underwater fish image inpainting based on this model to evaluate its performance.

5.1. Hyperparameter Optimization

First, the optimal hyperparameter settings for the model are selected. A Repaint method was employed for underwater image inpainting under the fish mask setting, and the model was evaluated using the LPIPS and PSNR metrics.

Initially, comparative experiments were conducted on different learning rates. Under the same batch sizes, LMF-Diffusion was trained with varying learning rate settings, and the experimental results are presented in the Table 2. The results indicate that the model achieved the best performance in underwater fish image inpainting when the learning rate was set to 1 ×

10^{- 5}

.

Furthermore, models were trained under the same learning rate but with different batch sizes. The experimental results, as shown in Table 3, indicate that the best performance was achieved with a batch size of 128.

Therefore, in the subsequent experiments, our architecture was configured with a batch size of 128 and a learning rate of 1 ×

10^{- 5}

.

5.2. The Results of Ablation Studies

Influence of convolution. Depthwise separable convolutions were incorporated to substitute a portion of the convolutions within the guided diffusion architecture. This adjustment has led to a notable reduction in the number of parameters, while causing only a minor impact on the performance of image inpainting. As illustrated in Table 1, after introducing depthwise separable convolutions, our architecture is only

48.7 %

of the guided diffusion.

Influence of attention block. MCA was chosen in LMF-Diffusion. The utilization of the MCA block has demonstrated its ability to augment the efficacy of image inpainting without inflating the number of parameters. Referring to Table 1, it becomes evident that the architecture integrated with MCA showcases a reduction in LPIPS and an increase in PSNR, indicating an improvement in the perceptual similarity between inpainted images and ground truth.

Analysis. According to the results in Table 4, on the fish mask setting, the architecture improved with MCA achieved better image inpainting results, with a significant improvement in image quality but an increase in the number of parameters. The model parameter size significantly decreased with the improved architecture using depthwise separable convolutions, but there was a slight decrease in image quality, as indicated by LPIPS and PSNR values. In the other three mask settings, our architecture achieved better inpainting results in thin and thick mask settings. In the half-mask setting, where half of the image was masked for inpainting, MCA did not show significant improvement in image inpainting, but the results were comparable to the guided diffusion architecture. Therefore, considering the above cases, LMF-Diffusion combines these two improvements, reducing the number of model parameters while enhancing the image inpainting results.

In the fish mask setting, the results of our LMF-Diffusion compared with the original guided diffusion architecture for underwater fish image inpainting are shown in Figure 7. Obviously, from a human perspective, the fish boundaries in the images inpainted by our LMF-Diffusion model are smoother, and the pose and structure of the fish are more similar to the input image. When using guided diffusion, there can be cases where the inpainting image shows two fish heads in a fish, as the original guided diffusion architecture cannot accurately capture underwater fish features. Furthermore, the lines of the fish’s tail cannot be rendered naturally in guided diffusion results, leading to an illogical enlargement of the tail. Our LMF-Diffusion has addressed the above issues. MCA can better capture multi-scale features in input data, reduce information redundancy, and enhance model generalization capability. By relying on MCA to extract global context information, when using LMF-Diffusion, there are no occurrences of one fish body corresponding to two fish heads in the inpainting results. As shown in the sixth column of Figure 7, LMF-Diffusion can better match the color and patterns of the fish head and body compared with guided diffusion. As indicated in the first column of Figure 7, LMF-Diffusion also depicts the tail fin more realistically and naturally. Figure 8 shows the underwater image inpainting results of guided diffusion and LMF-Diffusion under different mask settings. Both guided diffusion and LMF-Diffusion yield visually pleasing results with thin masks and thick masks. Combining these observations with the scores in Table 4, LMF-Diffusion demonstrates slightly better inpainting performance than guided diffusion. However, with half masks, guided diffusion exhibits issues such as mismatches between the fish head and body, or half of the fish body disappearing, while LMF-Diffusion produces better results.

5.3. Comparison with Other Methods

In this section, a comparison is made between the proposed method and other image inpainting methods. The test set was covered with fish masks and inpainted using different methods.

The experimental results are presented in Figure 9 and Table 5. Using the DSI method [17] for underwater fish image inpainting, the unknown areas in the resulting images can be filled with colors similar to the known background. However, due to the small size of the training data set, the transition between unknown and known areas was abrupt, making the shape of the mask clearly visible, and the fish bodies are difficult to distinguish. According to the data in Table 4, AOT [18] performs the best in terms of LPIPS and PSNR values but differs in image inpainting results. The AOT method can inpaint the rough outline of the fish body with smoother color transitions, causing the color of the repaired fish tail to dissolve in the water. As a result, the LPIPS and PSNR values are closer to the original image, but the transition with the water is too blurred, making it difficult to depict the shape of the fish tail, and the resulting inpainting does not meet our expectations. The Repaint method demonstrates a higher fidelity in inpainting results that closely resemble real images. However, it is not without its drawbacks. One notable limitation is the unnatural blending between fish subjects and underwater environments, resulting in anomalies like fish appearing to have two heads in the result images. From the results, our model alleviates the above problems. Not only do the inpainting images more closely resemble real images, but the matching issue between a fish head and body has significantly improved, with no case of one fish head corresponding to two fish bodies.

5.4. Inpainting Obscured Fish Images

To simulate the practical application of image inpainting techniques in underwater fish images, instance segmentation technology was used instead of manually adding masks to demonstrate the feasibility of the method.

The YOLOv8 method was adopted for instance segmentation to inpaint the complete form of the target fish obscured by other fish. Through this method, masks specific to the occluded fish were successfully generated and then used along with the original images as inputs for our image inpainting method. The resulting images, as shown in Figure 10, present clear and complete underwater fish images. From the inpainting results, it is evident that the model reasonably infers the swimming posture of the occluded fish beneath the masks, while the transition of fish body color and patterns appears natural, and the inpainting portions seamlessly blend with the underwater background, resulting in a fairly good visual effect.

6. Discussion

Our experimental section demonstrated the effectiveness of LMF-Diffusion in underwater image inpainting tasks and its real-life applications, proving its practical feasibility. However, there are still limitations to LMF-Diffusion. As mentioned in the ablation studies section, when the input mask covers a large portion of the image, distortion persists in the inpainting result.

Furthermore, the proposed architecture only enhances underwater image inpainting results from pretrained models, without improving the inference process (i.e., the Repaint method). In the “Inpainting obscured fish images” part of the experiments, after generating masks based on instance segmentation, we use Repaint for image inpainting with LMF-Diffusion as a prior. This method requires time for mask generation and inference, making it unsuitable for video inpainting. This presents an area for future research.

7. Conclusions

In this paper, LMF-Diffusion is proposed for model pretraining in underwater fish image inpainting. In LMF-Diffusion, the MCA module helps the model establish long-range dependencies among pixels and extract features from images at different scales and orientations. Considering practical deployment, depthwise separable convolutions are utilized to reduce the model’s parameters and spatial footprint. Experimental results show that proposed LMF-Diffusion enables the Repaint method to exhibit better performance in underwater fish image inpainting. Compared with the original guiding diffusion model, it not only reduces the parameter size to about

48.7 %

of the original, but also demonstrates smoother and more realistic results. Additionally, underwater fish image inpainting results obtained using our LMF-Diffusion model outperform those produced by current popular image inpainting methods.

Author Contributions

Conceptualization, Z.W., X.J., C.C. and Y.L.; methodology, Z.W., X.J. and C.C.; software, X.J.; investigation, Z.W., X.J. and C.C.; data curation, Z.W. and X.J.; writing—original draft preparation, Z.W. and X.J.; writing—review and editing, C.C. and Y.L.; supervision, Z.W. and Y.L.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was sponsored in part by the Guangdong Provincial Marine Electronic Information Special Project (GDNRC[2024]19).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors would like to express their gratitude to the editors and the anonymous reviewers for their insightful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, G.; Chen, Q.; Yoshida, T.; Teravama, K.; Mizukami, Y.; Li, Q.; Kitazawa, D. Detection of Bluefin Tuna by Cascade Classifier and Deep Learning for Monitoring Fish Resources. In Global Oceans 2020: Singapore—U.S. Gulf Coast; IEEE: Piscataway, NJ, USA, 2020; pp. 1–4. [Google Scholar]
Han, F.; Zhu, J.; Liu, B.; Zhang, B.; Xie, F. Fish Shoals Behavior Detection Based on Convolutional Neural Network and Spatiotemporal Information. IEEE Access 2020, 8, 126907–126926. [Google Scholar] [CrossRef]
Jiang, Q.; Gu, Y.Q.; Li, C.; Cong, R.; Shao, F. Underwater Image Enhancement Quality Evaluation: Benchmark Dataset and Objective Metric. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5959–5974. [Google Scholar] [CrossRef]
Zhuang, P.; Wu, J.; Porikli, F.M.; Li, C. Underwater Image Enhancement With Hyper-Laplacian Reflectance Priors. IEEE Trans. Image Process. 2022, 31, 5442–5455. [Google Scholar] [CrossRef] [PubMed]
Zhou, J.; Zhang, D.; Zhang, W. Underwater image enhancement method via multi-feature prior fusion. Appl. Intell. 2022, 52, 16435–16457. [Google Scholar] [CrossRef]
Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Image Inpainting for Irregular Holes Using Partial Convolutions. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 Million Image Database for Scene Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar] [CrossRef]
Doersch, C.; Singh, S.; Gupta, A.; Sivic, J.; Efros, A.A. What makes Paris look like Paris? Commun. ACM 2015, 58, 103–110. [Google Scholar] [CrossRef]
Li, J.; Wang, N.; Zhang, L.; Du, B.; Tao, D. Recurrent Feature Reasoning for Image Inpainting. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 7757–7765. [Google Scholar] [CrossRef]
Ma, Y.; Liu, X.; Bai, S.; Wang, L.; Liu, A.; Tao, D.; Hancock, E.R. Regionwise Generative Adversarial Image Inpainting for Large Missing Areas. IEEE Trans. Cybern. 2023, 53, 5226–5239. [Google Scholar] [CrossRef]
Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.Z.; Ebrahimi, M. EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning. arXiv 2019, arXiv:1901.00212. [Google Scholar]
Yi, Z.; Tang, Q.; Azizi, S.; Jang, D.; Xu, Z. Contextual Residual Aggregation for Ultra High-Resolution Image Inpainting. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 7505–7514. [Google Scholar] [CrossRef]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative Image Inpainting with Contextual Attention. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5505–5514. [Google Scholar] [CrossRef]
Zeng, Y.; Lin, Z.L.; Yang, J.; Zhang, J.; Shechtman, E.; Lu, H. High-Resolution Image Inpainting with Iterative Confidence Feedback and Guided Upsampling. arXiv 2020, arXiv:2005.11742. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T. Free-Form Image Inpainting With Gated Convolution. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4470–4479. [Google Scholar] [CrossRef]
Peng, J.; Liu, D.; Xu, S.; Li, H. Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10770–10779. [Google Scholar] [CrossRef]
Zeng, Y.; Fu, J.; Chao, H.; Guo, B. Aggregated Contextual Transformations for High-Resolution Image Inpainting. IEEE Trans. Vis. Comput. Graph. 2023, 29, 3266–3280. [Google Scholar] [CrossRef]
Lugmayr, A.; Danelljan, M.; Romero, A.; Yu, F.; Timofte, R.; Gool, L.V. RePaint: Inpainting using Denoising Diffusion Probabilistic Models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11451–11461. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.J.; Heinrich, M.P.; Misawa, K.; Mori, K.; McDonagh, S.G.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Hu, K.; Zhang, E.; Xia, M.; Weng, L.; Lin, H. MCANet: A Multi-Branch Network for Cloud/Snow Segmentation in High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 1055. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Wang, W.; Yao, L.; Chen, L.; Cai, D.; He, X.; Liu, W. CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention. arXiv 2021, arXiv:2108.00154. [Google Scholar]
Pathak, D.; Krähenbühl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context Encoders: Feature Learning by Inpainting. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar] [CrossRef]
Shen, J.; Chan, T.F. Mathematical Models for Local Nontexture Inpaintings. SIAM J. Appl. Math. 2002, 62, 1019–1043. [Google Scholar] [CrossRef]
Bai, X.; Yan, C.; Yang, H.; Bai, L.; Zhou, J.; Hancock, E.R. Adaptive hash retrieval with kernel based similarity. Pattern Recognit. J. Pattern Recognit. Soc. 2018, 75, 136–148. [Google Scholar] [CrossRef]
Ding, D.; Ram, S.; Rodríguez, J.J. Image Inpainting Using Nonlocal Texture Matching and Nonlinear Filtering. IEEE Trans. Image Process. 2019, 28, 1705–1719. [Google Scholar] [CrossRef] [PubMed]
Li, K.; Wei, Y.; Yang, Z.; Wei, W. Image inpainting algorithm based on TV model and evolutionary algorithm. Soft Comput. 2014, 20, 885–893. [Google Scholar] [CrossRef]
Sridevi, G.; Kumar, S.S. Image Inpainting Based on Fractional-Order Nonlinear Diffusion for Image Reconstruction. Circuits Syst. Signal Process. Cssp 2019, 38, 3802–3817. [Google Scholar] [CrossRef]
Jin, X.; Su, Y.; Zou, L.; Wang, Y.; Jing, P.; Wang, Z.J. Sparsity-Based Image Inpainting Detection via Canonical Correlation Analysis With Low-Rank Constraints. IEEE Access 2018, 6, 49967–49978. [Google Scholar] [CrossRef]
Mo, J.; Zhou, Y. The research of image inpainting algorithm using self-adaptive group structure and sparse representation. Clust. Comput. 2019, 22, 7593–7601. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Zheng, C.; Cham, T.J.; Cai, J. Pluralistic Image Completion. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1438–1447. [Google Scholar] [CrossRef]
Zhao, S.; Cui, J.; Sheng, Y.; Dong, Y.; Xu, Y. Large Scale Image Completion via Co-Modulated Generative Adversarial Networks. arXiv 2021, arXiv:2103.10428. [Google Scholar]
Richardson, E.; Alaluf, Y.; Patashnik, O.; Nitzan, Y.; Azar, Y.; Shapiro, S.; Cohen-Or, D. Encoding in Style: A StyleGAN Encoder for Image-to-Image Translation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2287–2296. [Google Scholar] [CrossRef]
Jiang, H.; Shen, L.; Wang, H.; Yao, Y.; Zhao, G. Finger vein image inpainting using neighbor binary-wasserstein generative adversarial networks (NB-WGAN). Appl. Intell. Int. J. Artif. Intell. Neural Netw. Complex Probl.-Solving Technol. 2022, 52, 9996–10007. [Google Scholar] [CrossRef]
Jiang, J.; Dong, X.; Li, T.; Zhang, F.; Qian, H.; Chen, G. Parallel adaptive guidance network for image inpainting. Appl. Intell. Int. J. Artif. Intell. Neural Netw. Complex Probl.-Solving Technol. 2023, 53, 1162–1179. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, H.; Liu, L.; Chen, X.; Zhang, Q.; Yang, K.; Xia, R.; Xie, J. Research on image Inpainting algorithm of improved GAN based on two-discriminations networks. Appl. Intell. Int. J. Artif. Intell. Neural Netw. Complex Probl.-Solving Technol. 2021, 51, 3460–3474. [Google Scholar] [CrossRef]
Chen, G.; Zhang, G.; Yang, Z.; Liu, W. Multi-scale patch-GAN with edge detection for image inpainting. Appl. Intell. Int. J. Artif. Intell. Neural Netw. Complex Probl.-Solving Technol. 2023, 53, 3917–3932. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. arXiv 2020, arXiv:2006.11239. [Google Scholar]
Sohl-Dickstein, J.N.; Weiss, E.A.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv 2015, arXiv:1503.03585. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. arXiv 2021, arXiv:2105.05233. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Dockhorn, T.; Vahdat, A.; Kreis, K. Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. arXiv 2021, arXiv:2112.07068. [Google Scholar]
Tian, J.; Wu, J.; Chen, H.; Ma, M. MapGen-Diff: An End-to-End Remote Sensing Image to Map Generator via Denoising Diffusion Bridge Model. Remote Sens. 2024, 16, 3716. [Google Scholar] [CrossRef]
Kawar, B.; Elad, M.; Ermon, S.; Song, J. Dendenoisingoising Diffusion Restoration Models. arXiv 2022, arXiv:2201.11793. [Google Scholar]
Wang, S.; Li, X.; Zhu, X.; Li, J.; Guo, S. Spatial Downscaling of Sea Surface Temperature Using Diffusion Model. Remote Sens. 2024, 16, 3843. [Google Scholar] [CrossRef]
Esser, P.; Rombach, R.; Blattmann, A.; Ommer, B. ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 3518–3532. [Google Scholar]
Wang, J.; Sun, H.; Tang, T.; Sun, Y.; He, Q.; Lei, L.; Ji, K. Leveraging Visual Language Model and Generative Diffusion Model for Zero-Shot SAR Target Recognition. Remote Sens. 2024, 16, 2927. [Google Scholar] [CrossRef]
Li, H.A.; Wang, G.; Gao, K.; Li, H. A Gated Convolution and Self-Attention-Based Pyramid Image Inpainting Network. J. Circuits Syst. Comput. 2022, 31, 2250208. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar] [CrossRef]

Figure 1. Example of underwater fish image collection.

Figure 2. The forward process and the reverse process of DDPM.

Figure 3. Underwater fish video shooting schematic.

Figure 4. Results of underwater fish image inpainting using the guided diffusion model as a prior in the Repaint method.

Figure 5. An example of 256 × 256 images input through the proposed LMF-Diffusion.

Figure 6. Comparison of standard convolution and depthwise separable convolution.

Figure 7. Comparison of the guided diffusion architecture [42] and LMF-Diffusion in underwater image inpainting (fish mask setting).

Figure 8. Comparison of the guided diffusion architecture [42] and the LMF-Diffusion in underwater image inpainting (other mask settings).

Figure 9. Results of inpainting underwater fish images. Comparison against other inpainting methods (DSI [17], AOT [18], Repaint with guided diffusion [42]) for the over fish mask setting.

Figure 10. Results of inpainting obscured fish images.

Table 1. Experimental hardware and software platform.

Name	Configure
CPU	Intel(R) Core (TM) i9-10850K CPU @ 3.60 GHZ (Intel Corporation, Santa Clara, CA, USA)
GPU	Nvidia GeForce RTX 3090 (Nvidia Corporation, Santa Clara, CA, USA)
Operating system	Ubuntu 20.04.6 LTS
CUDA	10.1
CUDNN	8.0.5
Python	3.7.6

Table 2. Comparison of experimental results with different batch size settings under the same learning rate.

	Learning Rate
	1 × $10^{- 3}$	1 × $10^{- 4}$	1 × $10^{- 5}$	1 × $10^{- 6}$
PSNR	14.962	24.505	28.812	26.641
LPIPS	0.425	0.165	0.116	0.148

Table 3. Comparison of experimental results with different learning rate settings under the same batch size.

	Batch Size
	16	32	64	128	256	512
PSNR	24.861	26.966	25.977	28.812	25.782	27.661
LPIPS	0.150	0.135	0.142	0.116	0.159	0.132

Table 4. Ablation studiy results of our LMF-Diffusion on our dataset.

Model	Params (M)	Mask Settings
		Fish		Thin		Thick		Half
		PSNR↑	LPIPS↓	PSNR↑	LPIPS↓	PSNR↑	LPIPS↓	PSNR↑	LPIPS↓
Guided diffusion model	932.08	28.461	0.121	36.137	0.070	33.879	0.087	24.775	0.192
+ DSC	486.86	28.617	0.119	36.138	0.071	33.832	0.089	24.571	0.199
+ MCA	899.74	29.342	0.112	36.109	0.069	33.984	0.084	24.573	0.194
Our architecture	454.53	28.812	0.116	36.154	0.068	33.965	0.084	24.761	0.192

Table 5. Comparison with other methods.

			Repaint
	DSI [17]	AOT [18]	Guided Diffusion [42]	LMF-Diffusion
PSNR↑	21.497	30.084	28.461	28.812
LPIPS↓	0.241	0.082	0.121	0.116

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Jiang, X.; Chen, C.; Li, Y. Lightweight Multi-Scales Feature Diffusion for Image Inpainting Towards Underwater Fish Monitoring. J. Mar. Sci. Eng. 2024, 12, 2178. https://doi.org/10.3390/jmse12122178

AMA Style

Wang Z, Jiang X, Chen C, Li Y. Lightweight Multi-Scales Feature Diffusion for Image Inpainting Towards Underwater Fish Monitoring. Journal of Marine Science and Engineering. 2024; 12(12):2178. https://doi.org/10.3390/jmse12122178

Chicago/Turabian Style

Wang, Zhuowei, Xiaoqi Jiang, Chong Chen, and Yanxi Li. 2024. "Lightweight Multi-Scales Feature Diffusion for Image Inpainting Towards Underwater Fish Monitoring" Journal of Marine Science and Engineering 12, no. 12: 2178. https://doi.org/10.3390/jmse12122178

APA Style

Wang, Z., Jiang, X., Chen, C., & Li, Y. (2024). Lightweight Multi-Scales Feature Diffusion for Image Inpainting Towards Underwater Fish Monitoring. Journal of Marine Science and Engineering, 12(12), 2178. https://doi.org/10.3390/jmse12122178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Multi-Scales Feature Diffusion for Image Inpainting Towards Underwater Fish Monitoring

Abstract

1. Introduction

2. Related Work

2.1. Image Inpainting

2.2. Denoising Diffusion Probabilistic Models

2.3. Attention Mechanisms

3. Materials and Methods

3.1. Koi Dataset

3.2. LMF-Diffusion

3.2.1. LMF-Diffusion Model Architecture

3.2.2. Multi-Scale Cross-Axis Attention (MCA)

3.2.3. Depthwise Separable Convolution

4. Experiments Setup

4.1. Experimental Platform and Parameters

4.2. Evaluation Metrics

4.3. Mask Settings

5. Results

5.1. Hyperparameter Optimization

5.2. The Results of Ablation Studies

5.3. Comparison with Other Methods

5.4. Inpainting Obscured Fish Images

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI