Auxiliary Equipment Detection in Marine Engine Rooms Based on Deep Learning Model

: In the intelligent perception of the marine engine room, visual identiﬁcation of auxiliary equipment is the prerequisite for defect recognition and anomaly detection. To improve the detection accuracy, this study presents an auxiliary equipment detector in the cabin based on a deep learning model. Owing to the compact layout of pipeline networks and the large disparity in the equipment scales, we initially adopted RetinaNet as the basic framework, and introduced the single channel plain architecture RepVGG as the feature extraction network to simplify the complexity and improve realtime detection. Secondly, the Neighbor Erasing and Transferring Mechanism (NETM) was applied in the feature pyramid to deal with more complicated scale variations. Then, the complete IoU (CIoU) regression loss function was used instead of smooth L1, and the DIoU Soft-NMS mechanism was proposed to alleviate the misdetection in congested cabins. Further, comparison experiments and ablation experiments were performed on the auxiliary equipment in a marine engine room (AEMER) dataset to validate the efﬁcacy of these strategies on the model performance boost. Speciﬁcally, our model can correctly detect 93.44% of coolers, 100.00% of diesel engines, 60.26% of meters, 95.30% of pumps, 55.01% of reservoirs, 97.68% of oil separators, and 74.37% of valves in a practical cabin.


Introduction
The intelligent monitoring and alarm system in a marine engine room can perform realtime monitoring of the various running statuses of the power system to guarantee the safety of the ship's operation. Whenever a failure happens, the system will send various signals such as sound-light to alarm and simultaneously backup the relevant data of the operating and system status, so that the engineers can find the cause of the failure and repair it promptly. Even if there is no one on duty, the engineer can respond to the signals by the extended alarm system when the intelligent monitoring in cabin transmits them to all corners of the ship. However, these alarm systems seem to lose focus on the defect recognition and anomaly detection of the appearance of the equipment because the sensors cannot obtain the information above. For example, some screws may be loose or the pipelines could leak when there is no engineer watching in the engine room, the consequences could be deadly.
If the liquid level of the bilge well is abnormally high, the bilge would overflow due to an unsupervised alarm; seawater pipelines could corrode and penetrate; the pipelines connected with submarine gates or other discharge overboard valves could break up; the bulkheads of a measuring tube could be missing. If any of the faults is not troubleshot immediately, it will cause electrical equipment to trip and will endanger the safety of the ship. If visual sensors are applied to identify the appearance information of the equipment in the engine room automatically, and the monitoring information is integrated into the centralized monitoring and alarm system, this can predict some faults or defects early to help engineers deal with the hidden dangers and minimize the odds of failure in advance.
However, the current visual perception technologies for ship's intelligent engine room are lesser-known. With the exploitation of computer vision, our main intention of this paper is to propose a detection model for auxiliary equipment in a marine engine room, which will replace the engineer's eyes and recognize the equipment autonomously when unattended. At the same time, it might provide a potential guide for subsequent visual inspection to find the appearance defects in the cabin equipment. Despite the emergence of convolutional neural networks that have greatly achieved considerable progress both in detection robustness and accuracy, tasks of auxiliary equipment detection in practical cabin still face difficulties and challenges for depth frameworks, which can be summarized as the following:

•
There are unavailable datasets for the marine engine room. The number of valves and meters accounts for a large proportion of equipment in an engine room, while other equipment accounts for only a small proportion; • The auxiliary equipment is multiscale in size ranging from tiny valves to giant diesel engines; • The engine room is congested and formless, the pipelines with corresponding equipment are densely distributed in cabin, which means there is a large amount of occlusion or obscuring equipment.
Considering the challenges mentioned above, this study proposes a realtime detection model for auxiliary equipment in a marine engine room. To recapitulate briefly, the contributions of this paper can be shown below:

•
In consideration of the currently unavailable public datasets, we filtered the original image resources and expanded the samples of the equipment in small proportion relying on our 3D virtual engine room team. Moreover, we built the auxiliary equipment in a marine engine room (AEMER) dataset, whose equipment classes included diesel engine, oil separator, cooler, reservoir, pump, valve, and meter; • To facilitate the deployment of the detector in the cabin monitoring and alarm, we replaced the backbone in RetinaNet with RepVGG, which combined the plain architecture of VGG and the residual branch of ResNet. Furthermore, to ameliorate the situation of small-scale equipment misdetection in the cabin, we adopted the Neighbor Erasing and Transferring Mechanism (NETM) with FPN to filter out the redundant features of large-scale objects in the shallow feature pyramid layers and transfer them to the deeper layers; • Because of the characteristics of the cabin layout, we applied the DIoU Soft-NMS to undermine the destructive impact on undetected errors, which can ensure the precision and recall in cabin. At the same time, we replaced the regression loss of smooth L1 with CIou loss, which not only ensures prediction boxes fit the targets better but also accelerates the speed of convergence and regression accuracy of training.
The remainder of our paper is organized as follows. Section 2 discusses the related works on computer vision. Section 3 introduces the proposed novel auxiliary equipment detection model in a marine engine room based on RetinaNet. Section 4 analyzes the ablation study and comparison experiments based on the AEMER dataset. Finally, we summarize the full text and identify the future work in Section 5.

Related Work
Visual identification is a prerequisite for the inspection task in a marine engine room. Whether the information can be detected comprehensively and accurately will greatly affect the reliability of subsequent equipment prediction and evaluation. With the development of deep learning, the convolutional neural network made major breakthroughs in accuracy and speed compared with traditional object detection methods [1][2][3]. In general, there are two main schools among the detection models: the two-stage algorithm represented by the R-CNN [4][5][6] and the one-stage algorithm represented by YOLO [7][8][9] and SSD [10]. Specifically, the two-stage algorithm firstly generates candidate regions on the image, then classifies and regresses them individually. Conversely, the one-stage algorithm directly locates and classifies all targets on the entire image, which bypasses the step of generating candidate regions. Both the two-stage and one-stage have their own advantages, generally speaking, the former is more accurate and the latter is faster. For the current object detection task, no matter which genre of algorithm is adopted, one must face the challenge of multiscale, that is, the size of the target to be detected differs greatly from the proportion of the entire images and between different images, even within the same image. In Figure 1a-c, we can see the scale of the diesel engine, pump, and valve are from large to small while Figure 1d contains multiscale targets. The challenges caused by the scale variations severely limit the overall performance of the existing detectors. Therefore, how to better achieve multiscale object detection has always been a central issue in scholarship.
J. Mar. Sci. Eng. 2021, 9,1006 3 of 17 SSD [10]. Specifically, the two-stage algorithm firstly generates candidate regions on the image, then classifies and regresses them individually. Conversely, the one-stage algorithm directly locates and classifies all targets on the entire image, which bypasses the step of generating candidate regions. Both the two-stage and one-stage have their own advantages, generally speaking, the former is more accurate and the latter is faster. For the current object detection task, no matter which genre of algorithm is adopted, one must face the challenge of multiscale, that is, the size of the target to be detected differs greatly from the proportion of the entire images and between different images, even within the same image. In Figure 1a-c, we can see the scale of the diesel engine, pump, and valve are from large to small while Figure 1d contains multiscale targets. The challenges caused by the scale variations severely limit the overall performance of the existing detectors. Therefore, how to better achieve multiscale object detection has always been a central issue in scholarship.
(a) Large object (b) Medium objects (c) Small object (d) Multiscale objects Figure 1. Examples of the multiscale objects in a marine engine room. In (a), the engine occupies almost the whole image. In (b), the pixels of every pump account for about 10% of the image. In (c), the pixels of the tiny valve are less than 1% of the whole image. Here, in (d), are some small valves and a middle size valve.
Object detection includes the two subtasks of regression and classification. The root of the scale problem is that, as the convolutional neural network deepens, its ability to express abstract features strengthens. However, shallow spatial and semantic information is gradually lost during downsampling, which results in the inability of deep features to provide fine-grained spatial information, so it cannot accurately locate the target. Therefore, a generic strategy to solve the scale variations is to construct multiscale feature expression. At present, the commonly used methods for constructing multiscale features include: (1) The use of the feature pyramid network (FPN) [11] to sequentially perform object detection on different resolutions [12,13]. (2) In the neural network, the connection of feature maps of different depths to reconstruct a feature pyramid for object detection [14,15]. (3) The design of parallel branches in the internal neural network to build a spatial Object detection includes the two subtasks of regression and classification. The root of the scale problem is that, as the convolutional neural network deepens, its ability to express abstract features strengthens. However, shallow spatial and semantic information is gradually lost during downsampling, which results in the inability of deep features to provide fine-grained spatial information, so it cannot accurately locate the target. Therefore, a generic strategy to solve the scale variations is to construct multiscale feature expression. At present, the commonly used methods for constructing multiscale features include: (1) The use of the feature pyramid network (FPN) [11] to sequentially perform object detection on different resolutions [12,13]. (2) In the neural network, the connection of feature maps of different depths to reconstruct a feature pyramid for object detection [14,15].
(3) The design of parallel branches in the internal neural network to build a spatial pyramid for object detection [16,17]. In addition to constructing multiscale feature expressions, some scholars have studied strategies to reduce the accuracy gap of different scales from a more detailed level in the algorithm process, such as bounding box regression loss function [18][19][20][21] and anchor mechanism [22,23]. Among many strategies, the typical single-shot detector is SSD [10], which combines both the main point of YOLO [7] and the anchor mechanism of Faster R-CNN [6] to ensure that the feature maps of different receptive fields can adapt to different scale targets. However, the representative ability of shallow features is much weaker than the deep, which leads to poor performance in detecting small objects. DSSD [24] proposed a complex feature pyramid network on the basis of SSD, promoting feature fusion between different levels and achieving better accuracy at the price of calculating efficiency. FSSD [25] inserted a fusion mechanism into the original SSD, making full use of local detailed features and global semantic features. ASSD [26] added an attention module [27] to each feature layer and achieved the accuracy of RetinaNet [28]. Considering the single-shot detectors are prone to scale-confusion during feature fusion, Li et al. [29] proposed the Neighbor Erasing and Transferring Mechanism (NETM) that erases salient large-scale features in the shallow feature maps by the Neighbor Erasing Module (NEM), and transfers them to the deep by the Neighbor Transferring Module (NTM) to ensure that small-scale features can be better perceived by the network. Their experimental results were significantly better than other single shot detectors. To this end, the RepVGG-RetinaNet equipped with NETM was used to detect auxiliary equipment of various scales in marine engine room.

Methodologies
In this section, the overview of our proposed framework for auxiliary equipment detection in marine engine room is introduced firstly. Next, we present the simple but powerful plain architecture of the convolutional neural network RepVGG used for the feature extraction network in RetinaNet, and describe how to convert a trained block into the inference layer. Then, we discuss the Neighbor Erasing and Transferring Mechanism (NETM), which is adopted in the feature pyramid to deal with more complicated scale variations. Finally, the DIoU Soft-NMS postprocessing mechanism and CIou loss function are introduced in detail.

Overview of the Proposed RetinaNet
The basic framework in this paper is RetinaNet, which is mainly comprised of a backbone, feature pyramid network (FPN), and subnet. The backbone is designed to pick up low-level general features, such as shape and texture. FPN is a U-shaped network structure, the pyramid generated by feature fusion can effectively combine the semantic representations of different depths and dimensions, which can individually explore multiscale features in different layers. The subnet module includes classification and regression branches.
In the RetinaNet as shown in Figure 2, the low-level features are first extracted through the RepVGG [30], and feature maps of different resolutions are output through five stages, which are labeled as C1, C2, C3, C4, and C5 according to the output sequence. Then the maps are fused by the FPN, and generate the fused pyramid layers P3, P4, and P5 with the same resolution as C3, C4, and C5. At the same time, the pyramid layer P6 is obtained through 3 × 3 convolution with stride 2 on P5, and then the P7 is obtained through 3 × 3 convolution with stride 2 on P6. In FPN, pyramid features are fused in a top-down strategy, which may introduce large-scale object features to shallow maps and exert passive influence on detecting tiny objects. Therefore, the feature maps need to be aggregated by NETM [29] except for P7, and the salient large object features in P3 and P4 erased by NEM are transferred to P5 and P6 via NTM. Finally, the five feature maps are sent to the subnet to classify and regress. Both the classification and the regression subnet adopt the FCN [31] structure. After a series of straight convolution operations, the former can obtain the confidence score that each anchor contains the ground truth. Similarly, the latter can obtain a set of location offsets that each anchor regresses to the ground truth. Specifically, both the object class and precise coordinate position are obtained in the subnet.
obtain the confidence score that each anchor contains the ground truth. Similarly, the latter can obtain a set of location offsets that each anchor regresses to the ground truth. Specifically, both the object class and precise coordinate position are obtained in the subnet.

Backbone Feature Extraction Network
For the current computer vision tasks, ResNet [32] and MobileNet [33][34][35] appear frequently. A large collection of experimental analyses argue that the ResNet can extract robust feature representations, and the customers can flexibly choose ResNet-50 or 101 according to their requirements; MobileNet is suitable for some embedded devices with low computing power, which can significantly balance the detection speed and accuracy. RepVGG is improved on the basis of classical VGG [36], whose main idea is to add the essence of ResNet to the VGG Block, namely, the identity branch and the residual branch. The sketch of the RepVGG architecture is shown in Figure 3a represents the classical Res-Net that contains the residual structure of identity and 1 × 1 convolution, which commendably solve the vanishing gradient problem in the deep layers and make the model easier to converge Figure 3b represents the RepVGG training enlightened by ResNet but in a different way that the identity and 1×1 branches can be removed by structural reparameterization, which not only allows the deep network to obtain robust feature performance, but also solves the vanishing gradient problem quite nicely Figure 3c represents the RepVGG inference, we performed the transformation in identity and 1 × 1 branches to accelerate the network deployment, which can be converted into a stack of 3 × 3 convolutional structure with simple algebra.

Backbone Feature Extraction Network
For the current computer vision tasks, ResNet [32] and MobileNet [33][34][35] appear frequently. A large collection of experimental analyses argue that the ResNet can extract robust feature representations, and the customers can flexibly choose ResNet-50 or 101 according to their requirements; MobileNet is suitable for some embedded devices with low computing power, which can significantly balance the detection speed and accuracy. RepVGG is improved on the basis of classical VGG [36], whose main idea is to add the essence of ResNet to the VGG Block, namely, the identity branch and the residual branch. The sketch of the RepVGG architecture is shown in Figure 3a represents the classical ResNet that contains the residual structure of identity and 1 × 1 convolution, which commendably solve the vanishing gradient problem in the deep layers and make the model easier to converge Figure 3b represents the RepVGG training enlightened by ResNet but in a different way that the identity and 1×1 branches can be removed by structural reparameterization, which not only allows the deep network to obtain robust feature performance, but also solves the vanishing gradient problem quite nicely Figure 3c represents the RepVGG inference, we performed the transformation in identity and 1 × 1 branches to accelerate the network deployment, which can be converted into a stack of 3 × 3 convolutional structure with simple algebra.  Figure 4 describes how to subtly convert a trained RepVGG block into a single plain 3 × 3 conv layer for RepVGG inference. Firstly, the convolutional layer in the residual block is fused, then the fused convolution layer is transformed into 3 × 3 convolution, and finally the 3 × 3 convolution in the residual branches is merged, that is, the weights and  4 describes how to subtly convert a trained RepVGG block into a single plain 3 × 3 conv layer for RepVGG inference. Firstly, the convolutional layer in the residual block is fused, then the fused convolution layer is transformed into 3 × 3 convolution, and finally the 3 × 3 convolution in the residual branches is merged, that is, the weights and offsets of all branches are added together.
(c) RepVGG (inference)  4 describes how to subtly convert a trained RepVGG block into a single plain 3 × 3 conv layer for RepVGG inference. Firstly, the convolutional layer in the residual block is fused, then the fused convolution layer is transformed into 3 × 3 convolution, and finally the 3 × 3 convolution in the residual branches is merged, that is, the weights and offsets of all branches are added together.  We used to denote the kernel of 3 × 3 and 1 × 1 convolutional layers respectively, and used , , , μ σ γ β as the accumulated mean, standard deviation, learned scaling factor, and bias of the identity branch or the BN layer, respectively. Let be the input and output respectively, and * be We used W 3 ∈ R C 2 ×C 1 ×3×3 , W 1 ∈ R C 2 ×C 1 to denote the kernel of 3 × 3 and 1 × 1 convolutional layers respectively, and used µ, σ, γ, β as the accumulated mean, standard deviation, learned scaling factor, and bias of the identity branch or the BN layer, respectively. Let M 1 ∈ R N×C 1 ×H 1 ×W 1 , M 2 ∈ R N×C 2 ×H 2 ×W 2 be the input and output respectively, and * be the operator of convolution. (1):

FPN with NETM
As shown in Figure 5a, the NETM contains the neighbor erasing module (NEM) and the neighbor transferring module (NTM). The former was proposed to erase the redundant salient features of large objects and highlight the small objects in shallow feature maps. The latter was designed to receive these erased features from NEM, and transfer them to enhance the deep features.
To ease the feature scale-confusion, the NEM was designed to take out the superfluous features. As shown in Figure 5b, s th and (s + 1) th are two adjacent pyramid layers, p s ∈ R c s ×h s ×w s has more semantic information about object x s than p s+1 ∈ R c s+1 ×h s+1 ×w s+1 . Based on the distribution of features, p s for object s from the original p s can be generated by the filtering features p es of objects in [s + 1, S] as Equation (2), and the feature p es from p s is extracted by Equation (3)

FPN with NETM
As shown in Figure 5a, the NETM contains the neighbor erasing module (NEM) and the neighbor transferring module (NTM). The former was proposed to erase the redundant salient features of large objects and highlight the small objects in shallow feature maps. The latter was designed to receive these erased features from NEM, and transfer them to enhance the deep features. To ease the feature scale-confusion, the NEM was designed to take out the superfluous features. As shown in Figure 5b, th s and ( 1) As formulated in Equation (2), es p helps extract the refined information of large objects. In Figure 5c, we transferred it and obtained a specific pyramid layer 1 s p +  as Equation (4): As formulated in Equation (2), p es helps extract the refined information of large objects. In Figure 5c, we transferred it and obtained a specific pyramid layer p s+1 as Equation (4): where D(p es ) represents a downsampling operation in order for p es to match the feature resolution with p s+1 . In addition, c 1×1 represents a 1 × 1 convolutional layer, which can discern the corresponding channel number with learnable W s+1 s ∈ R 1×1×c s ×c s+1 .

DIoU Soft-NMS
How to better achieve object detection in dense scenarios has always been a research hotspot. With our observation in the AEMER dataset, the equipment obstruction in marine engine room can be roughly divided into the following conditions: the same class of equipment obstruction, different classes of equipment obstruction, and nonequipment obstruction. If we take the first case as an example in traditional NMS, all of the bounding boxes must first be sorted by the confidence score in descending order, and then the highest score box is selected. In addition, the rest of boxes might be suppressed if there is an obvious overlap with the selected box. However, what if the suppressed boxes have better location information than the selected box? As shown in Figure 6, when the separator (S2) overlaps with separator (S1), the detectors are readily confused as a result of the similar physical features of the separators. Therefore, the bounding box2 (B2) that should regress to S2 may be misguided to S1 or suppressed by the bounding box1 (B1) that should regress to S1, resulting in inaccurate positioning.
Considering that the model used in this paper might generate multiple prediction boxes, we adopted the Soft-NMS [37] postprocessing mechanism to ensure that each target was detected, and replaced the IoU metric with DIoU [20]. In other words, we should not abandon the prediction boxes that were mistakenly deleted due to the excessive overlap such NMS, but retain them by lowering their confidence scores. The pseudo code of DIoU Soft-NMS is shown in Figure 7. est score box is selected. In addition, the rest of boxes might be suppressed if there is an obvious overlap with the selected box. However, what if the suppressed boxes have better location information than the selected box? As shown in Figure 6, when the separator (S2) overlaps with separator (S1), the detectors are readily confused as a result of the similar physical features of the separators. Therefore, the bounding box2 (B2) that should regress to S2 may be misguided to S1 or suppressed by the bounding box1 (B1) that should regress to S1, resulting in inaccurate positioning. Figure 6. An example of auxiliary equipment detection in a congested marine engine room. B1 and B2 are the prediction boxes for the S1 and S2 target separators. In traditional NMS, B2 may be misguided to S1 or suppressed by B1.
Considering that the model used in this paper might generate multiple prediction boxes, we adopted the Soft-NMS [37] postprocessing mechanism to ensure that each target was detected, and replaced the IoU metric with DIoU [20]. In other words, we should not abandon the prediction boxes that were mistakenly deleted due to the excessive overlap such NMS, but retain them by lowering their confidence scores. The pseudo code of DIoU Soft-NMS is shown in Figure 7. and B2 are the prediction boxes for the S1 and S2 target separators. In traditional NMS, B2 may be misguided to S1 or suppressed by B1. To improve the recall and eliminate the redundancy in marine engine room, the postprocessing mechanism takes into account the distance and the overlap relationship between multiple boxes. Here, the IoU metric in NMS was replaced by Equation (5): where ρ is the Euclidean distance between the midpoints of two boxes. c is the diagonal length of the outer rectangular bounds covering the two boxes. cand b and b represent the candidate boxes and highest score box. In addition, the penalty term, β , is assigned 1 generally, when it approaches zero, nearly all prediction boxes whose center points do not overlap with the center points of the highest score box are preserved; when it approaches infinity, the DIoU will degenerate to IoU, that is to say, the effectiveness of DIoU Soft-NMS can assimilate with greedy-NMS [38].
If the DIoU of candidate boxes with the highest score prediction box is greater than or equal to θ , the confidence score will be punished by Gauss rather than harshly setting as zero. As a result, the final confidence score function is shown in Equation (6):

Loss Function
In the self-made dataset, we found that valves and meters made up a large majority of the auxiliary equipment in a marine engine room, compared with diesel engines or oil To improve the recall and eliminate the redundancy in marine engine room, the postprocessing mechanism takes into account the distance and the overlap relationship between multiple boxes. Here, the IoU metric in NMS was replaced by Equation (5): where ρ is the Euclidean distance between the midpoints of two boxes. c is the diagonal length of the outer rectangular bounds covering the two boxes. b cand and b represent the candidate boxes and highest score box. In addition, the penalty term, β, is assigned 1 generally, when it approaches zero, nearly all prediction boxes whose center points do not overlap with the center points of the highest score box are preserved; when it approaches infinity, the DIoU will degenerate to IoU, that is to say, the effectiveness of DIoU Soft-NMS can assimilate with greedy-NMS [38]. If the DIoU of candidate boxes with the highest score prediction box is greater than or equal to θ, the confidence score will be punished by Gauss rather than harshly setting as zero. As a result, the final confidence score function is shown in Equation (6):

Loss Function
In the self-made dataset, we found that valves and meters made up a large majority of the auxiliary equipment in a marine engine room, compared with diesel engines or oil separators. To solve the imbalanced sample class, we first expanded the small percentage of equipment to minimize the negative impact of the dataset. Since the focal loss can well solve the passive influence of the traditional cross entropy loss on class imbalance and difficult to classify samples, the original classification function focal loss in RetinaNet was retained and defined by Equation (7): where γ represents the focusing parameter. α t is an indicator variable. In our experiments, we set them as suggested in Ref. [28].
Regarding the regression loss function of the bounding boxes, Girshick et al. [5] proposed the smooth L1 loss, which combines the characteristics of L1 and L2. Considering that these functions cannot directly reflect the similarity of boxes, Yu et al. [18] proposed the IoU loss function, which treats the rectangular box as a whole. However, if there is no overlap between boxes, the IoU will always be zero and the model cannot learn. Therefore, Rezatofighi et al. [19] proposed a Generalized IoU (GIoU) loss function, which added a penalty term on the basis of the IoU. Zhang et al. [20] argued that the GIoU loss will degenerate into an IoU loss if the prediction box is wholly surrounded by the ground truth box, which might cause the model to fail to distinguish the relative position relationship. Accordingly, they proposed Distance IoU (DIoU) and Complete IoU (CIoU) loss, the former adds a penalty term for the center distance between boxes on the basis of IoU loss, and the latter adds a penalty term for the similarity of the aspect ratio on the basis of DIoU. By comprehensive comparison, we used the CIoU loss defined by Equation (8) as the regression loss.
L CIoU = 1 − DIoU + αv (8) where α and v, respectively, denote a positive tradeoff parameter and the consistency of aspect ratio, which is defined as Equation (9).

Experiments
The process of auxiliary equipment detection in marine engine room is presented in Figure 8. Raw images of auxiliary equipment were collected by the cabin acquisition device. Some of the images were processed by data augmentation, and then the auxiliary equipment in marine engine room (AEMER) dataset was built completely. We trained the RepVGG-RetinaNet detector on the AEMER, and the equipment was detected through the trained model eventually.

AEMER Dataset
With the resources of our 3D virtual engine room team, we quickly collected various images in the Very Large Container Ship (VLCS), Very Large Ore Carrier (VLOC), and Very Large Crude Carrier (VLCC) cabin scenes. Most of them were taken by our team through Canon digital cameras, and others were photographed by cabin monitoring. Due to fact that the angular variation of the image acquisition devices might cause inconsistent light intensity, we preprocessed the original images. Furthermore, to ease the passive influence of imbalanced class, the data augmentations we used included adding Gaussian noise, mirroring, rotating, shifting, color translation, and cutout. Specifically, we combined several augmentations to enhance the auxiliary equipment images that accounted for a small proportion. Then, we built the AEMER dataset with 7375 images, which contained Cooler, Engine, Meter, Pump, Reservoir, Separator, and Valve. Figure 9 displays some raw image samples in the AEMER. In this paper, we randomly selected 70% of the data in AEMER as the training set, 20% as the validation set, and 10% as the test set.

Experiments
The process of auxiliary equipment detection in marine engine room is presented in Figure 8. Raw images of auxiliary equipment were collected by the cabin acquisition device. Some of the images were processed by data augmentation, and then the auxiliary equipment in marine engine room (AEMER) dataset was built completely. We trained the RepVGG-RetinaNet detector on the AEMER, and the equipment was detected through the trained model eventually.

AEMER Dataset
With the resources of our 3D virtual engine room team, we quickly collected various images in the Very Large Container Ship (VLCS), Very Large Ore Carrier (VLOC), and Very Large Crude Carrier (VLCC) cabin scenes. Most of them were taken by our team through Canon digital cameras, and others were photographed by cabin monitoring. Due to fact that the angular variation of the image acquisition devices might cause inconsistent light intensity, we preprocessed the original images. Furthermore, to ease the passive influence of imbalanced class, the data augmentations we used included adding Gaussian noise, mirroring, rotating, shifting, color translation, and cutout. Specifically, we combined several augmentations to enhance the auxiliary equipment images that accounted for a small proportion. Then, we built the AEMER dataset with 7375 images, which contained Cooler, Engine, Meter, Pump, Reservoir, Separator, and Valve. Figure 9 displays some raw image samples in the AEMER. In this paper, we randomly selected 70% of the data in AEMER as the training set, 20% as the validation set, and 10% as the test set.

Implementation
Our experiments were implemented according to the configuration in Table 1. For training details, we selected the RepVGG-B1g4 as the backbone for our proposed Reti-naNet and trained it on the PASCAL VOC 07++12 trainval dataset (see the following Section 4.4.1). Next, we joined this to the trained weights of Section 4.4.1 to conduct contrast experiments on AEMER dataset. During the AEMER training period, we set the number of whole training iterations and the initial learning rate to 100 epochs and 1 × 10 −4 , respectively. If the total loss did not reduce noticeably in four straight epochs, the learning rate would drop to 50% of the previous stage. Then, Adam was used to update the weights to accelerate model convergence. Furthermore, we treated the bounding box as a positive sample if the IoU was greater than 0.5, and as a negative sample if the IoU was less than 0.4. The hyperparameters of weighting factors and focusing parameter in focal loss were set to 2.5 and 0.25 respectively.

Implementation
Our experiments were implemented according to the configuration in Table 1. For training details, we selected the RepVGG-B1g4 as the backbone for our proposed RetinaNet and trained it on the PASCAL VOC 07++12 trainval dataset (see the following Section 4.4.1). Next, we joined this to the trained weights of Section 4.4.1 to conduct contrast experiments on AEMER dataset. During the AEMER training period, we set the number of whole training iterations and the initial learning rate to 100 epochs and 1 × 10 −4 , respectively. If the total loss did not reduce noticeably in four straight epochs, the learning rate would drop to 50% of the previous stage. Then, Adam was used to update the weights to accelerate model convergence. Furthermore, we treated the bounding box as a positive sample if the IoU was greater than 0.5, and as a negative sample if the IoU was less than 0.4. The hyperparameters of weighting factors and focusing parameter in focal loss were set to 2.5 and 0.25 respectively.

Evaluation Criteria
In the object detection task, the image information generally consisted of background and foreground (targets). When the current foreground was correctly detected by detectors, we denoted the prediction boxes as true positive (TP). When the current foreground was misdetected as background or other foregrounds by detectors, we denoted the prediction boxes as false positive (FP). When the background was misdetected as foreground by detectors, we denoted the prediction boxes as false negative (FN). Otherwise, we denoted the prediction boxes as true negative (TN). On the basis of the four situations, precision defined by Equation (10) and recall defined by Equation (11) were introduced to evaluate the detection accuracy. Every class can generate a P-R curve according to precision and recall, and the enclosed area of the curve and the coordinate axis in the range of (0,1) was the average precision (AP). The mean average precision (mAP) has been widely used in target detection and evaluation. In addition to detection accuracy, another important evaluation metric for detection is speed. Only high speed can achieve realtime detection. Generally, FPS is used to evaluate the speed of object detection, that is, the number of images that can be processed per second. In this paper, the evaluation criteria we used contained: precision, recall, AP, mAP, and FPS.  Table 2. With a comparable detection speed, our proposed RepVGG-RetinaNet carried out an appreciable improvement in detection compared to others. Specifically, the mAP of our detector on PASCAL VOC 2007 test set was 3.3%, 0.2%, 6.0%, 2.5%, 2.0%, 1.1%, 1.2%, 0.9%, 0.2%, 0.2%, and 0.4% better than Faster R-CNN [6], R-FCN [39], YOLOv2 [8], SSD [10], DSOD [40], DSSD [23], RSSD [41], FSSD [25], ASSD [26], RefineDet [42], and RetinaNet, respectively.

Ablation Study on AEMER
To validate the contribution of our strategies, we performed an ablation study on the AEMER dataset to explore the effects of backbone, NETM, DIoU Soft-NMS, and CIoU loss on detection accuracy and speed. In this experiment, we reconstructed the feature pyramid network with NETM and replaced the bounding box regression loss function of the original RetinaNet. At the same time, we applied the DIoU Soft-NMS postprocessing mechanism during training. The comparison results are shown in Table 3, where the largest difference of M1-4 and M5-8 lay in the backbone. M5 was 0.47% and 16% better than M1 in terms of mAP and Time, M6 was 1.12%/34% better than M2, M7 was 0.91%/16% better than M3, M8 was 2.07%/34% better than M4. With the support of NETM, DIoU Soft-NMS, and CIoU, the result of each strategy enjoyed significant mAP improvement and a faster inference time.

Visualization
We randomly selected several images from the AEMER dataset to visually compare the detection results between the original RetinaNet (upper) and the RepVGG-RetinaNet (lower) in Figure 10. We set the IOU threshold and confidence threshold to 0.5 and 0.25 respectively. There were nine whole valves in (a), both RetinaNet and RepVGG-RetinaNet detected all of them, but the latter had higher confidence scores than the former. In (b), the RetinaNet mistakenly identified one meter as a valve, and completely missed the four small valves in the lower left corner. By comparison, RepVGG-RetinaNet detected one more valve than RetinaNet. There were some multi\scale objects in (c), including two meters, one reservoir, and one valve. Both RetinaNet and RepVGG-RetinaNet detected the large reservoir, but the latter correctly detected one tiny valve more than the former. Furthermore, we present the detection results in the congested scenarios (d-f), the RepVGG-RetinaNet had a higher confidence probability and smaller position deviation in the practical cabin. From the perspective of the detection error, our detector alleviated the problem of missed detection and false detection, but the accuracy on valves and meters was not perfect. In summary, the overall performance of our RepVGG-RetinaNet was superior to the original RetinaNet, and had the robust ability of adaptive filtering feature information.
Furthermore, we present the detection results in the congested scenarios (d-f), the RepVGG-RetinaNet had a higher confidence probability and smaller position deviation in the practical cabin. From the perspective of the detection error, our detector alleviated the problem of missed detection and false detection, but the accuracy on valves and meters was not perfect. In summary, the overall performance of our RepVGG-RetinaNet was superior to the original RetinaNet, and had the robust ability of adaptive filtering feature information.

Conclusions and Discussion
Considering the key technology of intelligent perception in a marine engine room, we built the AEMER dataset and proposed a RepVGG-RetinaNet detector for auxiliary equipment in a congested cabin. According to the analysis of experimental data, the following main conclusions can be reached: (1) Compared with ResNet, the backbone of RepVGG in RetinaNet has better detection performance in practical cabin scenes. (2) The FPN with the feature scale unmixing NETM is capable of helping the detector to have an adaptive filtering function, which enhances the expression ability of small-scale features and effectively solves the misdetection and false positive problems. (3) Through the improvement of the RetinaNet based on the CIoU regression loss as well as DIoU Soft-NMS postprocessing mechanism, we have further advanced the detection accuracy of auxiliary equipment in the cabin. (4) The proposed RepVGG-RetinaNet has comparable detection speed and accuracy, which meets the elementary demands of the inspection tasks in marine engine room, and effectively provides technical support for the defect recognition and anomaly detection of the appearance of equipment in the centralized monitoring and alarm system.

Conclusions and Discussion
Considering the key technology of intelligent perception in a marine engine room, we built the AEMER dataset and proposed a RepVGG-RetinaNet detector for auxiliary equipment in a congested cabin. According to the analysis of experimental data, the following main conclusions can be reached: (1) Compared with ResNet, the backbone of RepVGG in RetinaNet has better detection performance in practical cabin scenes. (2) The FPN with the feature scale unmixing NETM is capable of helping the detector to have an adaptive filtering function, which enhances the expression ability of small-scale features and effectively solves the misdetection and false positive problems. (3) Through the improvement of the RetinaNet based on the CIoU regression loss as well as DIoU Soft-NMS postprocessing mechanism, we have further advanced the detection accuracy of auxiliary equipment in the cabin. (4) The proposed RepVGG-RetinaNet has comparable detection speed and accuracy, which meets the elementary demands of the inspection tasks in marine engine room, and effectively provides technical support for the defect recognition and anomaly detection of the appearance of equipment in the centralized monitoring and alarm system.
As for the low detection accuracy of Reservoir and Meter, we tried to modify the framework and parameters, but the results were unsatisfactory. Therefore, we strategically shifted focus on improving the mAP and temporarily gave up the AP. In future work, we will fully consider the AP of single class and try to further improve the detection accuracy of small-scale targets in the cabin. Moreover, the AEMER constructed for the inspection task has not been ideal, it will be supplemented and expanded in the future. Meanwhile, the new semantic information will be added to the dataset, and unknown object detection will be carried out by modifying the model and combining zero-sample classifiers, so that the detector might have the ability of self-learning and self-updating, which will further realize the full-smart visual perception.

Data Availability Statement:
The processed data cannot be shared at this time as the data also forms part of an ongoing study.