Object Detection for Underwater Cultural Artifacts Based on Deep Aggregation Network with Deformation Convolution

: Cultural artifacts found underwater are located in complex environments with poor imaging conditions. In addition, the artifacts themselves present challenges for automated object detection owing to variations in their shape and texture caused by breakage, stacking, and burial. To solve these problems, this paper proposes an underwater cultural object detection algorithm based on the deformable deep aggregation network model for autonomous underwater vehicle (AUV) exploration. To fully extract the object feature information of underwater objects in complex environments, this paper designs a multi-scale deep aggregation network with deformable convolutional layers. In addition, the approach also incorporates a BAM module for feature optimization, which enhances the potential feature information of the object while weakening the background interference. Finally, the object prediction is achieved through feature fusion at different scales. The proposed algorithm has been extensively validated and analyzed on the collected underwater artifact datasets, and the precision, recall, and mAP of the algorithm have reached 93.1%, 91.4%, and 92.8%, respectively. In addition, our method has been practically deployed on an AUV. In the ﬁeld testing over a shipwreck site, the artifact detection frame rate reached up to 18 fps, which satisﬁes the real-time object detection requirement.


Introduction
During the long history of navigation, maritime losses have occurred from time to time, and a large number of shipwrecks and the objects and cargoes they contain have accumulated on the seabed [1].Reasons for these losses are varied and include the limitations of navigation technology, the influence of extreme weather, human errors, and wars.These seabed artifacts contain rich historical, cultural, and technological information, which is of great help to the in-depth exploration of human civilization.
Detection of underwater artifacts is essential for underwater archaeological research and heritage management.It is important to understand the location of underwater archaeological sites and their conditions and contents.This is important to facilitate research and effective heritage management.Underwater heritage management is particularly important given threats such as illegal salvage or looting and the increasing expansion of offshore and seabed industries.Large-scale cataloging of shipwreck archaeological sites has typically been done by manually identifying sites from seafloor mapping data generated from marine geophysical data such as MBES (Multibeam Echo Sounders) or SSS (Side-Scan Sonar) [2,3], although there have been some recent attempts to increase automation [4,5].Given the increasing numbers of seafloor surveys implemented for scientific purposes, it can be argued that there is a need for increasing automation in underwater cultural artifact detection.
The increasing use of autonomous underwater vehicles (AUVs) for seafloor exploration offers great potential for realizing this goal.In seafloor exploration operations, AUVs have the advantages of a wide operating range, high detection efficiency, and flexible operation [6].AUVs usually carry side-scan sonars and underwater cameras [7].The side-scan sonar can be used for rapid searching of a wide range of underwater sites over large areas.The underwater camera can obtain optical images of artifacts containing rich information such as object shape, color, texture, etc., which is suitable for closerange fine detection [8].In the process of underwater detection operation, if the AUV can autonomously recognize the object in the captured video image, it can re-plan the navigation path according to the location of the discovered object and carry out more detections around the object of interest, so as to facilitate the subsequent analysis and judgment of the seafloor artifacts [9].
Over the past two decades, researchers have made significant strides in the application of machine vision technology for underwater site detection [10,11].Jaklic et al. [12] modeled ancient Roman shipwreck cargo sites using 3D point cloud technology.The mapping of underwater sites has been achieved through three-dimensional image processing techniques by Menna et al. [13].Character et al. [14] employed a deep-learning algorithm model for sonar-based shipwreck image detection.However, there is a noticeable scarcity of object detection algorithms designed for optical images of underwater artifacts in the current literature, particularly in conjunction with AUVs.Therefore, the study of vision-based object detection methods for underwater artifacts holds great significance.
Vision-based object detection algorithms can be categorized into two groups: traditional object detection and deep learning-based object detection [15].In traditional object detection algorithms, the first step involves selecting a region of interest through a sliding window approach [16].Subsequently, various feature extraction techniques, such as Scale-Invariant Feature Transform (SIFT) [17] and Histogram of Oriented Gradients (HOG) [18], are applied to extract features from the selected region.Finally, these extracted features are used for object recognition through trained classifiers like Support Vector Machine (SVM).Cutter et al. [19] employed Haar-like features and multiple cascaded classifiers to detect fish objects, while Rizzini et al. [20] identified underwater objects based on the uniformity of underwater image color and sharpness information from contours.Qiu et al. [21] proposed an algorithm based on surface feature ripples for detecting underwater moving objects in photopolarimetric imaging mode, which has become a notable example of traditional algorithms in underwater object detection.However, traditional detection methods require the design of various feature extraction models and rely on machine learning techniques for classification.This limits their applicability in real underwater scenarios.Moreover, manually designed feature extraction models primarily capture low and mid-level image features, making it challenging to extract representative semantic information.
With the dramatic improvement of graphic computing hardware such as powerful GPUs and the rapid development of deep neural network models in recent years, object detection algorithms based on deep learning have achieved promising detection performance.Many researchers have applied these methods to underwater object detection scenarios.Chen et al. [22] introduced a novel sample-weighted super network (SWIPENET) to address the blurring problem in underwater images amidst significant noise interference.Lei et al. [23] incorporated the Swin Transformer into the backbone network of YOLOv5, enhancing feature extraction for underwater objects and enabling the network to detect objects in low-quality underwater images.Yan et al. [24] integrated the CBAM attention mechanism into a one-stage object detection model to enable the network to focus more on object feature information, thereby improving detection accuracy.However, the afore-mentioned methods still suffer challenges in fully utilizing the characteristics of objects in complex underwater environments.They struggle with detection accuracy when dealing with occlusion and overlapping issues among underwater objects at different scales, as well as problems like leakage and false detection.Song et al. [25] proposed a two-stage underwater object detection algorithm with Boosting R-CNN, which enhances occluded object detection by modeling uncertainty and mining challenging samples.Zeng et al. [26] introduced a Faster R-CNN-AON network based on generative adversarial networks, effectively improving overall detection performance while preventing overfitting.Despite these advancements, it is worth noting that the above-mentioned studies often come with a drawback, i.e., they involve a large number of algorithm parameters, which may not meet the real-time requirements for AUVs.
Currently, object detection applied to underwater cultural artifacts still faces the following challenges: (1) Poor quality of underwater images.Due to the differences in water absorption of light of different wavelengths and the scattering of underwater light, underwater images suffer from color deviation and low visibility [27].In addition, the imaging quality of underwater images is low due to the insufficient underwater illumination conditions of AUVs and limited CMOS imaging levels [28].
(2) Object identification failures in complex underwater environments.Underwater artifacts have different morphologies and tend to accumulate, which makes them easily missed or incorrectly detected [29].In addition, due to the age of underwater artifacts, the artifacts are often covered with sediments, encrusted by marine organisms, or broken, which leads to difficulties in extracting discriminative features of the artifacts in the process of visual inspection.It causes serious interference in the detection of artifacts.
(3) Difficulty in samples of underwater artifacts.Unlike atmospheric optical images, it is difficult to obtain enough samples with relevant features in the preliminary research of object detection algorithms due to the influence of the complex underwater environment and the limitation of imaging equipment [30].
In order to solve the above problems, we propose an object detection algorithm specifically designed for archaeological artifacts located underwater based on a deformable deep aggregation network model.The main contributions are summarized as follows: (1) We design a feature extraction network specifically for underwater artifact detection, which enhances the network to extract features from artifact objects in complex scenarios through a deep aggregation structure with deformable convolutional layers and more jump connections.
(2) We introduce the bottleneck attention module (BAM) attention mechanism to enhance the features of underwater artifacts and weaken the background redundant information through feature optimization, which improves the model's anti-interference ability and spares the redundant parameters and computational complexity.
(3) We build a database of underwater archaeological artifacts.By collecting a large number of underwater object images, the underwater cultural artifacts (UCAs) dataset is established.The accuracy and real-time performance of the underwater objects object detection algorithm are verified.
The rest of the paper is organized as follows: Section 2 briefly describes our underwater vision inspection system.Section 3 describes the materials and proposed methodology in detail.Section 4 presents the experimental details and system analysis.Section 5 summarizes the entire paper as well as future research directions.

AUV Visual Detection System
We have constructed an efficient underwater visual inspection system for AUVs, the main goal of which is to collect data related to underwater artifacts and verify the effectiveness of our proposed algorithms in real-world environments.As shown in Figure 1, in our preliminary work, we employed a robot equipped with an underwater camera to capture images at the shipwreck site and create a custom dataset for our algorithmic study.
In the subsequent phase, we integrated the algorithmic model proposed in this paper into an edge computing platform and deployed it on an autonomous underwater vehicle.During real-world testing, the system utilizes the images captured by the underwater camera, processes them using our detection algorithm for autonomous object recognition and analysis, and ultimately produces the detection results.
our preliminary work, we employed a robot equipped with an underwater camera to capture images at the shipwreck site and create a custom dataset for our algorithmic study.In the subsequent phase, we integrated the algorithmic model proposed in this paper into an edge computing platform and deployed it on an autonomous underwater vehicle.During real-world testing, the system utilizes the images captured by the underwater camera, processes them using our detection algorithm for autonomous object recognition and analysis, and ultimately produces the detection results.Therefore, the focus of this study was to develop an artifact object detection algorithm for AUVs for shipwreck sites and to test the performance of the algorithm in real underwater environments.

Materials and Methods
Our underwater cultural artifacts object detection network (UCA-Net) combines a deformable convolution module and an attention mechanism to improve the performance of artifact object detection in complex underwater environments.As shown in Figure 2, UCA-Net consists of three parts: a feature extraction network, a feature optimization network, and a feature fusion network.First, the feature extraction network adopts a deep aggregation structure that incorporates deformable convolutional layers and multi-hop connections.The deformable convolutional layer enables the network to better adapt to the complex spatial features of the broken artifacts, and the multi-hop connection helps to capture the multi-scale semantic information of the artifact objects.Secondly, the feature optimization network enhances the key features of underwater artifacts by introducing the BAM attention mechanism, augmenting them in both the spatial and channel dimensions while attenuating invalid background information.Finally, the feature fusion network fuses features from different scales to further enhance the algorithm's representation of the object.With the above design, the UCA-Net algorithm proposed in this paper effectively improves the accuracy and robustness of underwater artifact object detection.Therefore, the focus of this study was to develop an artifact object detection algorithm for AUVs for shipwreck sites and to test the performance of the algorithm in real underwater environments.

Materials and Methods
Our underwater cultural artifacts object detection network (UCA-Net) combines a deformable convolution module and an attention mechanism to improve the performance of artifact object detection in complex underwater environments.As shown in Figure 2, UCA-Net consists of three parts: a feature extraction network, a feature optimization network, and a feature fusion network.First, the feature extraction network adopts a deep aggregation structure that incorporates deformable convolutional layers and multi-hop connections.The deformable convolutional layer enables the network to better adapt to the complex spatial features of the broken artifacts, and the multi-hop connection helps to capture the multi-scale semantic information of the artifact objects.Secondly, the feature optimization network enhances the key features of underwater artifacts by introducing the BAM attention mechanism, augmenting them in both the spatial and channel dimensions while attenuating invalid background information.Finally, the feature fusion network fuses features from different scales to further enhance the algorithm's representation of the object.With the above design, the UCA-Net algorithm proposed in this paper effectively improves the accuracy and robustness of underwater artifact object detection.

Feature Extraction Network
The process of object detection for underwater archaeological artifacts is made difficult by the diversity of object types, shapes, and textures.In traditional deep learning models, the convolution operation has a fixed structure which limits the network receptive field, and the network can only capture local information during feature extraction.
However, due to the issues of breakage and burial of underwater artifacts, they present irregular features.In this case, the traditional convolutional operation makes it difficult to fully extract the features of underwater artifacts, leading to detection failure.To enhance the detection ability of convolutional neural networks for underwater artifacts, the long-range spatial relationships can be better captured by expanding the receptive field of the network and constructing an implicit spatial model [31].In complex underwater environments, traditional standard convolution can only perform fixed-size sampling.In contrast, deformable convolution can better learn the features of an object by introducing a learnable offset in the convolution operation, which enables it to dynamically adjust the sampling position and better adapt to the shape of objects such as broken burials [32].As shown in Figure 3, the deformable convolution module adds a two-dimensional offset to each sample in the convolution kernel based on the traditional standard convolution , mathematically defined in Equation (1).

( ) ( ) ( )
where X is the input feature map; R is the 33  convolution kernel;

Feature Extraction Network
The process of object detection for underwater archaeological artifacts is made difficult by the diversity of object types, shapes, and textures.In traditional deep learning models, the convolution operation has a fixed structure which limits the network receptive field, and the network can only capture local information during feature extraction.
However, due to the issues of breakage and burial of underwater artifacts, they present irregular features.In this case, the traditional convolutional operation makes it difficult to fully extract the features of underwater artifacts, leading to detection failure.To enhance the detection ability of convolutional neural networks for underwater artifacts, the long-range spatial relationships can be better captured by expanding the receptive field of the network and constructing an implicit spatial model [31].In complex underwater environments, traditional standard convolution can only perform fixed-size sampling.In contrast, deformable convolution can better learn the features of an object by introducing a learnable offset in the convolution operation, which enables it to dynamically adjust the sampling position and better adapt to the shape of objects such as broken burials [32].As shown in Figure 3, the deformable convolution module adds a two-dimensional offset to each sample in the convolution kernel based on the traditional standard convolution {∆p n |n = 1, . . ., N }, N = |R|, mathematically defined in Equation (1).
where X is the input feature map; R is the 3 × 3 convolution kernel; p n is the nth point in the convolution kernel; w(p n ) is the weight corresponding to the p n point; p 0 is the p 0 point on the input-output feature map; ∆p n is the two-dimensional offset of the deformable convolutional sampling point; and Y is the output feature map.The deep layer aggregation (DLA) network has been widely used as a compact and efficient feature extraction backbone in computer vision tasks such as object detection and semantic segmentation [33].A DLA network merges the layered feature maps in an iterative manner, which achieves a more accurate representation of the object features while keeping fewer parameters.To adapt to the diverse object sizes and shape of archaeological artifacts found in underwater environments, we designed the DLA network structure accordingly so that it could output feature maps with four feature layers of different scales.On this basis, we introduced deformable convolution to replace the traditional convolution operation to enhance the feature extraction capability of the network for irregular objects.The deep layer aggregation (DLA) network has been widely used as a compact and efficient feature extraction backbone in computer vision tasks such as object detection and semantic segmentation [33].A DLA network merges the layered feature maps in an iterative manner, which achieves a more accurate representation of the object features while keeping fewer parameters.To adapt to the diverse object sizes and shape of archaeological artifacts found in underwater environments, we designed the DLA network structure accordingly so that it could output feature maps with four feature layers of different scales.On this basis, we introduced deformable convolution to replace the traditional convolution operation to enhance the feature extraction capability of the network for irregular objects.We named the proposed feature extraction network as a multi-scale deep layer aggregation with a deformable convolution network (MDLA-DCN).The network shows impressive performances in complex underwater environments and significantly enhances the extraction of features for underwater artifact objects with complex morphology.
The MDLA-DCN network structure is shown in Figure 4, with four parallel sub-networks with different resolutions.Each sub-network consists of a series of deformable convolutional modules.The same sub-network feature map resolution does not change with the depth of the network, while the feature map resolution of the parallel sub-network decreases sequentially by 1/2.The number of channels increases by a factor of 2. Information exchange across the parallel sub-networks is implemented within the MDLA-DCN network via upsampling so that each sub-network receives the information from the other parallel sub-networks repetitively.Multi-hop connections in the network aggregate features of different resolutions to yield enhanced underwater artifact features, which are more accurate in terms of spatial and semantic information.In this paper, the 4-, 8-, 16-, and 32-fold downsampled feature maps generated by the parallel sub-networks are used as outputs in order to fully utilize the multi-scale feature information.The MDLA-DCN network structure is shown in Figure 4, with four parallel subnetworks with different resolutions.Each sub-network consists of a series of deformable convolutional modules.The same sub-network feature map resolution does not change with the depth of the network, while the feature map resolution of the parallel sub-network decreases sequentially by 1/2.The number of channels increases by a factor of 2. Information exchange across the parallel sub-networks is implemented within the MDLA-DCN network via upsampling so that each sub-network receives the information from the other parallel sub-networks repetitively.Multi-hop connections in the network aggregate features of different resolutions to yield enhanced underwater artifact features, which are more accurate in terms of spatial and semantic information.In this paper, the 4-, 8-, 16-, and 32-fold downsampled feature maps generated by the parallel sub-networks are used as outputs in order to fully utilize the multi-scale feature information.

Feature Optimization Network
The feature extraction network generates four different resolutions of feature maps, which contain valid features of the object and also a large number of invalid background features, and there are differences in these four feature maps and their contributions to

Feature Optimization Network
The feature extraction network generates four different resolutions of feature maps, which contain valid features of the object and also a large number of invalid background features, and there are differences in these four feature maps and their contributions to the final detection results.Therefore, to suppress the invalid features and enhance the object features, as well as to enable the network to autonomously learn the correlation and importance between feature maps of different resolutions, we introduced BAM attention for feature optimization.Different from the separate channel attention [34] and spatial attention [35], BAM attention enhances features in both the spatial and channel dimensions through different branches, the structure of which is shown in Figure 5.
nel sizes, and the subscript denotes the order of the convolution operation.
The complete computation of the BAM refinement input feature where  denotes element-wise multiplication,  is a sigmoid activation function, In general, networks usually overlay the attention mechanism serially, i.e., adding the attention mechanism after most of the convolutional layers.Due to the special characteristics of the feature extraction network structure, the BAM attention mechanism module is only added in parallel to the final output part of the parallel sub-network, which enhances the output features of the sub-network in the spatial and channel dimensions, effectively filters the invalid background features and strengthens the effective object features, and improves the quality of the output features of the sub-network significantly without increasing the parameters of the network too much.

Feature Fusion Network
After processing with the feature optimization network, feature maps at different scales were obtained, which were used to effectively represent the key features of underwater artifact objects.To realize multi-level feature extraction and fusion for underwater artifacts, we designed a fusion network for combining deep and shallow features.
The feature fusion process is shown in Figure 6.First, channel dimensionality reduction is performed on each source feature map using 3 × 3 convolution to keep the number of channels consistent while reducing the amount of computation within the network.Afterwards, the low-resolution features are up-sampled using the inverse convolutional layer to keep their resolution consistent with the high-resolution feature maps.Commonly used up-sampling methods include the inverse convolution layer [36] and the bilinear difference method.Since the inverse convolution can provide the network with parameters that can be learned and improve the performance of the network, we chose the inverse convolution for up-sampling.Finally, four adjusted feature maps are fused with the Concatenation fusion operation for final prediction.Through the multi-scale feature fusion, the loss of small-scale object features can be effectively reduced and the problem of underestimated utilization of shallow features in spatial locations in the deep network can be solved, thus ensuring the robustness and reliability of object features of underwater artifacts at different scales.Channel attention branching enables the network to focus on the channel features of interest by modeling the correlation between channels.Firstly, the input feature F ∈ R C×H×W undergoes global average pooling to encode the global information of each channel and generate a one-dimensional channel vector; then, the one-dimensional channel vectors are processed by using the multilayer perceptron (MLP) to estimate the inter-channel attention.Finally, the output feature scale is adjusted by using the batch normalization (BN) layer to obtain the channel attention mapping M C (F) ∈ R C .The specific description is shown in Equation (2). where , BN denotes the batch normalization.Spatial attention branching can effectively capture the spatial location information of features and make the network more concerned about the location information of the object.Firstly, the input F ∈ R C×H×W is compressed by 1 × 1 convolution to compress the channel dimension; then, two 3 × 3 null convolutions are used to aggregate the context information with a larger receptive field.Finally, the 1 × 1 convolution is used to map the dimension of the feature map to R 1×H×W , and a batch normalization layer is used for the scale adjustment to obtain the spatial attention mapping M S (F) ∈ R H×W .The specific description is shown in Equation (3).
where f denotes a convolution operation, the superscripts denote the convolution kernel sizes, and the subscript denotes the order of the convolution operation.The complete computation of the BAM refinement input feature F ∈ R C×H×W is shown in Equation (4).
where ⊗ denotes element-wise multiplication, σ is a sigmoid activation function, M C (F) and M S (F) are the channel attention mapping and spatial attention mapping, respectively, which are resized to R C×H×W before being added together.
In general, networks usually overlay the attention mechanism serially, i.e., adding the attention mechanism after most of the convolutional layers.Due to the special characteristics of the feature extraction network structure, the BAM attention mechanism module is only added in parallel to the final output part of the parallel sub-network, which enhances the output features of the sub-network in the spatial and channel dimensions, effectively filters the invalid background features and strengthens the effective object features, and improves the quality of the output features of the sub-network significantly without increasing the parameters of the network too much.

Feature Fusion Network
After processing with the feature optimization network, feature maps at different scales were obtained, which were used to effectively represent the key features of underwater artifact objects.To realize multi-level feature extraction and fusion for underwater artifacts, we designed a fusion network for combining deep and shallow features.
The feature fusion process is shown in Figure 6.First, channel dimensionality reduction is performed on each source feature map using 3 × 3 convolution to keep the number of channels consistent while reducing the amount of computation within the network.Afterwards, the low-resolution features are up-sampled using the inverse convolutional layer to keep their resolution consistent with the high-resolution feature maps.Commonly used up-sampling methods include the inverse convolution layer [36] and the bilinear difference method.Since the inverse convolution can provide the network with parameters that can be learned and improve the performance of the network, we chose the inverse convolution for up-sampling.Finally, four adjusted feature maps are fused with the Concatenation fusion operation for final prediction.Through the multi-scale feature fusion, the loss of small-scale object features can be effectively reduced and the problem of underestimated utilization of shallow features in spatial locations in the deep network can be solved, thus ensuring the robustness and reliability of object features of underwater artifacts at different scales.

Experiments
In order to verify the performance of the algorithm proposed in this paper, a database of underwater archaeological artifacts was built and used for training and testing the detection model.In addition, the algorithm was compared with other state-of-the-art detec-

Experiments
In order to verify the performance of the algorithm proposed in this paper, a database of underwater archaeological artifacts was built and used for training and testing the detection model.In addition, the algorithm was compared with other state-of-the-art detection algorithms to verify the detection performance in complex underwater environments.

Underwater Object Dataset
The images of underwater artifacts were captured from two different underwater archaeological sites.Both sites are located in the sea off Guangdong Province, China, where the water depth ranges from 23 to 30 m.One site is a Southern Song Dynasty (12th century A.D.) shipwreck and the other is a Ming Dynasty (16th century A.D.) shipwreck.Both sites have large cargoes of porcelain artifacts which are scattered over the seabed.All photographs in this dataset were taken with an underwater camera carried by an AUV.Given the complexity of the underwater environment, the dataset covers a wide range of scenarios, including low light, object stacking, object burial, and object breakage.The underwater cultural artifacts (UCAs) dataset was constructed after manual screening, de-duplication, and quality assessment.The dataset consists of 10,714 images and fully covers five types of objects, namely porcelain plates, bowls, jars, incense burners, and tiles, which are commonly found in Chinese maritime trade shipwrecks from the 11th to the 17th centuries A.D. We divided the UCA dataset into training, validation, and test sets according to the ratio of 6:2:2.Examples of representative images are shown in Figure 7.

Experimental Environment and Training Parameters
The hardware environment of our experimental platform was a high-performance server, which was configured as follows: Intel Xeon processor (Intel, Santa Clara, CA, USA) with a main frequency of 2.1 GHz; 64 GB of RAM; and four Nvidia Tesla V100 graphics cards (Nvidia, Beijing, China) with 32 GB of video memory.The software environment was the operating system of Ubuntu18.04,Python 3.7, and CUDA11.0.
The training parameters were as follows: the gradient descent optimizer used to update the parameters of the convolutional kernel was Adam; the optimizer Momentum was 0.937; the learning rate update mode during training was STEP; the maximum learning rate was 0.001; the training batch size was 16; the weight decay coefficient was 0.0005; and the training iteration period Epoch was 300.

Model Evaluation Metrics
We used four main metrics to test the performance of the model.Precision (P) denotes the proportion of positive classes that the model considers to be positive and is computed as in Equation ( 5).Recall (R) denotes the proportion of positive classes classified by the

Experimental Setups 4.2.1. Experimental Environment and Training Parameters
The hardware environment of our experimental platform was a high-performance server, which was configured as follows: Intel Xeon processor (Intel, Santa Clara, CA, USA) with a main frequency of 2.1 GHz; 64 GB of RAM; and four Nvidia Tesla V100 graphics cards (Nvidia, Beijing, China) with 32 GB of video memory.The software environment was the operating system of Ubuntu18.04,Python 3.7, and CUDA11.0.
The training parameters were as follows: the gradient descent optimizer used to update the parameters of the convolutional kernel was Adam; the optimizer Momentum was 0.937; the learning rate update mode during training was STEP; the maximum learning rate was 0.001; the training batch size was 16; the weight decay coefficient was 0.0005; and the training iteration period Epoch was 300.

Model Evaluation Metrics
We used four main metrics to test the performance of the model.Precision (P) denotes the proportion of positive classes that the model considers to be positive and is computed as in Equation (5).Recall (R) denotes the proportion of positive classes classified by the model to the total number of positive classes and is computed as in Equation ( 6).F1 is the harmonic mean of precision and recall.It is used as a proxy for the model's performance and is calculated as in Equation (7).Average precision (AP) is the area under the curve composed of precision and recall, taking different thresholds for each class; the larger the value, the better the recognition accuracy of the class, calculated as in Equation ( 8).The mean average precision (mAP) denotes the average AP of all the classes; the larger the value, the better the accuracy of the model in recognizing the object, calculated as in Equation (9).
Recall = TP TP + FN (6) where TP denotes the number of positive samples correctly predicted by the mode; FP denotes the number of positive samples predicted by the model that are actually negative samples.FN denotes the number of positive samples predicted by the model to be negative.
N denotes the number of all categories, and AP n denotes the average precision of the n th category.

Ablation Studies
To demonstrate each individual module's contribution to overall effectiveness, performance tests were conducted by successively adding or modifying the modules.Furthermore, four generalized metrics, precision, recall, F1, and mAP, were introduced to quantitatively evaluate the performance of the algorithms.The initial test used the original DLA network, which was then replaced with the MDLA network and then successively enhanced by addition of, firstly, the DCN, and secondly, the BAM module.Results for each of the model variants are shown in Table 1.Note: Bolded text shows the optimal results for each column.MDLA is a feature extraction network using standard convolution.
From the experimental results in Table 1, it can be seen that compared with the original DLA network, the multiscale deep aggregation network (MDLA) designed in this paper improves the mAP by 1.7% and the precision by 1.2%, which effectively enhances the detection ability of different scale objects.The use of DCN deformable convolution instead of ordinary convolution effectively enhances the feature extraction ability of the MDLA network for irregular objects, and mAP is further improved by 0.9%.As DCN expands the receptive field of the detection network, it makes the network enhance the aggregation to capture more comprehensive feature information of the object.With the introduction of the BAM attention module, F1 and mAP are increased by 1.3% and 1.4%, respectively, because the attention module enhances the potential information of the object and attenuates the influence of redundant information, which further improves the individual indexes and causes the algorithm to have higher detection accuracy.The experiment proves that the addition of the deformable convolution and attention module is reasonable in the task of underwater artifact detection in complex environments, which can effectively improve the adaptability and accuracy of the algorithm.

Comparison with Mainstream Methods
To verify the effectiveness of the underwater object detection algorithm proposed in this paper, we conducted comparisons with mainstream object detection algorithms such as Faster-RCNN [37], SSD [38], YOLOv5-l [39], and YOLOv7 [40].All comparisons were carried out using the aforementioned UCA dataset.
To ensure the comparability, we refer to the published code of each comparison algorithm and use the original parameter settings.All comparison algorithms were trained on the same training process for a total of 300 epochs, and the models were analyzed qualitatively and quantitatively to evaluate the performance of each algorithm.
We qualitatively analyzed the performance of the algorithms through the detection effects of different models, and the detection effects of Faster-RCNN, SSD, YOLOv5-l, YOLOv7, and the algorithms proposed in this paper are shown in Figure 7. From the diagram, it can be seen that the SSD algorithm has the worst detection performance, due to the fact that its ability to represent shallow features is not strong enough, which results in more misdetections and omissions.The Faster-RCNN algorithm and the YOLOv5 algorithm have comparable detection effects, while the YOLOv7 algorithm has better effects, but these methods still have omissions when the object appears to be buried or stacked.Compared with the above methods, the algorithm proposed in this paper achieves better detection results, thanks to the optimization of the feature extraction network and the introduction of the BAM attention mechanism, which enables effective extraction of the feature information of the object in complex environments and improves the algorithm's overall robustness.
In order to better verify the superiority of the algorithm proposed in this paper, the UCA data test set is used for comparison with the above algorithm.The comparisons of the algorithms' performances are shown in Table 2. Note: Bolded text shows the optimal results for each column.
Comparing the metrics of different algorithms in the table, our algorithm outperforms the others in all metrics.Compared with the SSD, which uses predictors directly based on multi-scale feature maps, the map of ours is improved by 10%.The results of Faster-RCNN and YOLOv7 are close to each other, and the map of ours is higher than both of them by 3.4% and 2.9%, respectively.Obviously, for underwater artifact objects, ours shows better detection performance.It can be seen that through the proposed network structure, the inherent features of underwater artifact objects are retained in the deep layer of the network, which enhances the network's ability to represent the features of artifact objects in complex environments, thus improving the detection performance.In order to fully evaluate the effectiveness of the visual inspection system in this paper, we embedded the visual detection system into an AUV, and the performance was tested in a real underwater environment.
The AUV and edge computing device used in the experiment are shown in Figure 8, and the main parameters of the AUV are shown in Table 3.

Parameters
Value Maximum operating depth 1000 m  High-power and high-load computing platforms are difficult to install in underwater vehicles due to space and power constraints.While considering the actual demand, the Nvidia Jetson TX2 image edge computing device was selected as the embedded computing platform for the AUV.The reasons are as follows: (1) the embedded platform is of small size (50 × 87 mm) and low power consumption (7.5 W under regular load); (2) the CPU is the ARM Cortex-A57 and the GPU is the Nvidia Pascal GPU with 256 CUDA cores, which meets the requirements of the detection algorithm.

Performance Comparison Test
We integrated the visual detection algorithms into the Nvidia Jetson TX2 and deployed it to the AUV for performance evaluation on the images collected on an underwater archaeological site.The field experiments were conducted on a Yuan Dynasty (13th century A.D.) shipwreck site located in the southeastern waters of Fujian Province, China, at a submerged depth of 30 m.The shipwreck was chosen as the test object for this experiment because it contains a range of artifact types similar to those in the UCA dataset.The length and width of the shipwreck are 13.07 m and 3.7 m, respectively.We surveyed an area of 48 square meters, which covered the cargo hold portion of the shipwreck.The site contained a range of artifacts, including porcelain plates, bowls, and incense burners, which were the main objects of this test.The results are shown in Figure 9. High-power and high-load computing platforms are difficult to install in underwater vehicles due to space and power constraints.While considering the actual demand, the Nvidia Jetson TX2 image edge computing device was selected as the embedded computing platform for the AUV.The reasons are as follows: (1) the embedded platform is of small size (50 × 87 mm) and low power consumption (7.5 W under regular load); (2) the CPU is the ARM Cortex-A57 and the GPU is the Nvidia Pascal GPU with 256 CUDA cores, which meets the requirements of the detection algorithm.

Performance Comparison Test
We integrated the visual detection algorithms into the Nvidia Jetson TX2 and deployed it to the AUV for performance evaluation on the images collected on an underwater archaeological site.The field experiments were conducted on a Yuan Dynasty (13th century A.D.) shipwreck site located in the southeastern waters of Fujian Province, China, at a submerged depth of 30 m.The shipwreck was chosen as the test object for this experiment because it contains a range of artifact types similar to those in the UCA dataset.The length and width of the shipwreck are 13.07 m and 3.7 m, respectively.We surveyed an area of 48 square meters, which covered the cargo hold portion of the shipwreck.The site contained a range of artifacts, including porcelain plates, bowls, and incense burners, which were the main objects of this test.The results are shown in Figure 9.To evaluate the real-time performance of the proposed object detection algorithm, we have selected the classical lightweight detection algorithms for comparative analysis.At the same time, two performance metrics-Frames per Second (FPS) and Model Parameters (Params)-are introduced for quantitative evaluation.The system performance metrics are shown in Table 4.The algorithm of this paper detected the frame rate and the number of parameters better than the SSD [38] algorithm, the YOLOv5-l [39] algorithm, and the YOLOv7 [40]   To evaluate the real-time performance of the proposed object detection algorithm, we have selected the classical lightweight detection algorithms for comparative analysis.At the same time, two performance metrics-Frames per Second (FPS) and Model Parameters (Params)-are introduced for quantitative evaluation.The system performance metrics are shown in Table 4.The algorithm of this paper detected the frame rate and the number of parameters better than the SSD [38] algorithm, the YOLOv5-l [39] algorithm, and the YOLOv7 [40]  improving the utilization efficiency of the features.MDLA guarantees detection accuracy while decreasing the number of parameters in the model.(2) The designed attention feature optimization module enhances the object feature information without increasing the number of model parameters.The algorithm in this paper achieves a detection speed of 19 frames per second on an image with a resolution of 640 × 640, which basically meets the requirements of real-time detection (see Figure 10).Because YOLOv5-s and YOLOv7-tiny reduce the depth of the network model more, this paper's algorithm is slightly lower than these two in the detection frame rate, but the mAP is relatively higher, which makes up for the disadvantage of the temporal performance.Note: Bolded text shows the optimal results for each column.

Discussion
Underwater artifacts are affected by the complex environment in which they are located, as well as changes in their own shape and texture, and these problems hinder the effective detection of underwater cultural artifacts.We have designed the proposed algorithm components to effectively improve the detection of underwater artifacts.
In this paper, we designed a deformable convolution-based multi-scale deep aggregation network for underwater cultural relics object feature extraction, which can identify

Discussion
Underwater artifacts are affected by the complex environment in which they are located, as well as changes in their own shape and texture, and these problems hinder the effective detection of underwater cultural artifacts.We have designed the proposed algorithm components to effectively improve the of underwater artifacts.
In this paper, we designed a deformable convolution-based multi-scale deep aggregation network for underwater cultural relics object feature extraction, which can identify and localize objects in complex environments by fusing semantic and spatial information.The deformable convolution expands the receptive field of the detection network to effectively extract the broken and irregular artifact features, and the multi-scale deep aggregation network reduces the loss of contextual information of the object features and better captures the global information of the artifact objects.The BAM attention module is introduced for feature optimization, which effectively cuts down the background redundant information and makes the network focus on the object feature information.Finally, progressive feature fusion of different network layers is realized by the multi-scale feature fusion module.
From the experimental results, the algorithm in this paper has achieved better detection results.However, the seabed environment where the underwater artifacts are located is complex, and the algorithm may fail to detect them in special cases, such as the appearance of marine organisms attached to the object.

Conclusions
In this work, we propose an underwater cultural artifact detection algorithm based on the deformable deep aggregation network model for AUV exploration.In order to fully capture the feature information, we designed an MDLA-DCN feature extraction network in which the deformable convolution is embedded to ensure the efficient utilization of the feature information of the underwater object in complex scenes.Furthermore, we introduce the BAM attention module for feature optimization to enhance the potential feature of the object while attenuating the background interference information.Finally, we obtain the different scale object predictions using multi-scale feature fusion.The algorithm has lightweight characteristics and is suitable for deployment on edge computing devices.In order to verify the effectiveness of the proposed algorithm, we have built a UCA dataset.The experimental results show that the algorithm achieves 93.1%, 91.4%, 92.2%, and 92.8% on the precision, recall, F1 value, and mAP metrics, respectively.It should be noted that the algorithm has been deployed on the AUV to achieve a detection rate of 18 frames per second in real scene tests, which meets the real-time detection requirements.
The algorithm proposed in this paper has high detection accuracy and computational efficiency which can meet the task requirements of detecting artifact objects in underwater environments.The innovative ideas of the algorithm can also be applied to other underwater object detection tasks.Although the algorithm in this paper achieves good detection results, there are still some shortcomings.In future research, we will focus on solving the problem of detection failure when marine organisms are attached to the object and further improve the generalization ability of the algorithm model.

Figure 2 .
Figure 2. The framework of underwater objects detection network (UCA-Net).

p
point on the input-output feature map; n p  is the two-dimensional off- set of the deformable convolutional sampling point; andY is the output feature map.

Figure 2 .
Figure 2. The framework of underwater objects detection network (UCA-Net).
channel attention mapping and spatial attention mapping, respectively, which are resized to C H W  before being added together.

J 18 Figure 6 .
Figure 6.Illustration of the feature fusion network.

Figure 6 .
Figure 6.Illustration of the feature fusion network.

Figure 8 .
Figure 8.Comparison of the detection results among various algorithms.Different colored squares represent different objects.Red squares represent bowls; green squares represent porcelain items; light blue squares represent plates; and deep blue squares represent high-foot bowls.(a) Low-light scene; (b) stacked scene; (c) burial scene.

Figure 8 .
Figure 8.Comparison of the detection results among various algorithms.Different colored squares represent different objects.Red squares represent bowls; green squares represent porcelain items; light blue squares represent plates; and deep blue squares represent high-foot bowls.(a) Low-light scene; (b) stacked scene; (c) burial scene.
algorithm in real tests.The reasons are analyzed as follows: (1) the algorithm in this paper designs MDLA as the basic feature extraction network, which effectively fuses the features of different levels by means of deep aggregation at different scales, thus improving the utilization efficiency of the features.MDLA guarantees detection accuracy while decreasing the number of parameters in the model.(2) The designed attention feature optimization module enhances the object feature information without increasing the number of model parameters.The algorithm in this paper achieves a detec-
algorithm in real tests.The reasons are analyzed as follows: (1) the algorithm in this paper designs MDLA as the basic feature extraction network, which effectively fuses the features of different levels by means of deep aggregation at different scales, thus J. Mar.Sci.Eng.2023, 11, 2228 of 17

Figure 10 .
Figure 10.Typical results of cultural artifacts detection: (a,d) contain a large number of objects, and the visual detection system achieves the detection rate of 18 frames per second; (b,c) contain fewer objects, and the visual detection system achieves the detection rate of 19 frames per second.

Figure 10 .
Figure 10.Typical results of cultural artifacts detection: (a,d) contain a large number of objects, and the visual detection system achieves the detection rate of 18 frames per second; (b,c) contain fewer objects, and the visual detection system achieves the detection rate of 19 frames per second.

Author Contributions:
Conceptualization, Y.Y. and W.L.; methodology, D.Z. and Y.Z.; software, D.Z.; validation, D.Z.; formal analysis, Y.Z.; investigation, G.X.; resources, G.X.; writing-original draft preparation, D.Z. and Y.Y.; writing-review and editing, D.Z. and Y.Z.; visualization, W.L.; supervision, W.L.; project administration, W.L. and G.X.; funding acquisition, Y.Z. and G.X.All authors have read and agreed to the published version of the manuscript.Funding: This research was funded by the National Natural Science Foundation of China grant number 62273332, Youth Innovation Promotion Association of the Chinese Academy of Sciences grant number 2023386 and 2022201, the National Key Research and Development Program of China grant number 2020YFC1521704, Guangdong Basic and Applied Basic Research Foundation grant number 2023A1515011363.Institutional Review Board Statement: Not applicable.Informed Consent Statement: Not applicable.

Table 2 .
Performance comparison of different algorithms.

Table 3 .
Main parameters of the AUV.

Table 3 .
Main parameters of the AUV.

Table 4 .
Comparisons on the inference performance between ours and state-of-the-art methods.
Note: Bolded text shows the optimal results for each column.

Table 4 .
Comparisons on the inference performance between ours and state-of-the-art methods.