Object Detection and Classiﬁcation Based on YOLO-V5 with Improved Maritime Dataset

: SMD (Singapore Maritime Dataset) is a public dataset with annotated videos, and it is almost unique in the training of deep neural networks (DNN) for the recognition of maritime objects. However, there are noisy labels and imprecisely located bounding boxes in the ground truth of the SMD. In this paper, for the benchmark of DNN algorithms, we correct the annotations of the SMD dataset and present an improved version, which we coined SMD-Plus. We also propose augmentation techniques designed especially for the SMD-Plus. More speciﬁcally, an online transformation of training images via Copy & Paste is applied to solve the class-imbalance problem in the training dataset. Furthermore, the mix-up technique is adopted in addition to the basic augmentation techniques for YOLO-V5. Experimental results show that the detection and classiﬁcation performance of the modiﬁed YOLO-V5 with the SMD-Plus has improved in comparison to the original YOLO-V5. The ground truth of the SMD-Plus and our experimental results are available for download.


Introduction
Public image datasets such as COCO [1] and Pascal visual object classes (VOC) [2] have made a great contribution to the development of deep neural networks (DNN) for computer vision problems [3][4][5][6][7][8]. These datasets include many different categories of objects. On the other hand, a domain-specific dataset usually contains only a relatively small number of sub-categories under a parent category. For domain-specific applications, obtaining a sufficient number of annotated images is considered a difficult task. Moreover, most domain-specific datasets suffer from the class-imbalance problem and noisy labels. Thus, to overcome the overfitting problem due to these inherent problems in the domain-specific dataset, a DNN model pre-trained by the public image dataset mentioned above is usually adopted for its fine-tuning.
The application areas that make use of domain-specific datasets have been expanding and now include road condition recognition [9,10], face detection [11,12], and food recognition [13,14], among others. Object recognition [15,16] in maritime environments is another important domain-specific problem for various security and safety purposes. For example, an autonomous ship equipped with an Automatic Identification System (AIS) requires safe navigation, which is achieved by the detection of surrounding objects [17]. This is a difficult problem simply because the objects at sea change dynamically due to environmental factors such as illumination, fog, rain, wind, and light reflection. In addition, depending on the viewpoint, the same ship can be shown with quite different shapes. Since the ocean usually has a wide-open view, the ships on the sea can be seen with a variety of sizes and occlusions. That is, large inter-class variances in terms of the size and shape of the maritime objects make the recognition problem very challenging. To tackle these difficulties, we rely on the recent advancements in DNN. However, the immediate problem of the DNN-based approach is the lack of annotated training data in maritime environments.
Maritime video datasets with annotated bounding boxes and object labels are hardly available. There exist few published datasets, collected especially for object detection in maritime environments [18][19][20]. Among them, only the Singapore Maritime Dataset (SMD), introduced by Prasad et al. [20], provides sufficiently large video data with labeled bounding boxes for 10 maritime object classes. The SMD consists of onboard and onshore video shots captured by Visual-Optical (VIS) and Near Infrared (NIR) sensors, which can be used for tracking as well as detecting ships on the sea. Although the SMD can be used for the training and testing of DNNs, it is hard to find completely reproducible results published with the SMD for comparative studies. This is due to the fact that the SMD has the following problems. First, there are bounding boxes in the ground truth of the SMD with inaccurate object boundaries. Some of their bounding boxes are too loose to include the background as well as the whole object. Additionally, some of them are too tight to have only a part of the object. Since the maritime images are usually taken from a wide-open view, a faraway object can appear as a tiny one. In this case, a small difference at the border of the bounding box can make a big difference in testing the accuracy of object detection. Second, there are incorrectly labeled classes in the ground truth of the SMD. These noisy labels may not be a big problem for distinguishing the foreground object from the background, but they certainly affect the training and testing of the DNN for the object classification problem. Third, there exists a serious class imbalance in the SMD. The class imbalance can cause the biased training of the DNN in favor of the majority classes and deteriorate the generalization ability of the model. Fourth, there is no proper train/test split in the original SMD.
Note that in [15], they split the SMD into training, validation, and testing subsets. Using the split datasets, they also provided the benchmark results for the object detection via the Mask R-CNN model. However, their benchmark results were about object detection, with no further classification for each detected object. In fact, most of the previous research works that used the dataset only dealt with object detection [15,21,22]. However, for applications in maritime security such as in the use of Unmanned Surface Vehicles (USV), we also need to identify the type of the detected object [23]. Since the original SMD includes the class labels of the objects as well as their bounding box information, we may use the SMD for both object detection and classification problems.
Although the SMD provides the class label for each object with a bounding box, as already mentioned, there are still noisy labels. Furthermore, the split dataset provided by [15] suffers from the class-imbalance problem (e.g., no data assigned for some of the object classes such as Kayak and Swimming Person in the training subset). In this paper, by using the SMD as a benchmark dataset for both detection and classification tasks, we fix its imprecisely determined bounding boxes and noisy labels. To alleviate the class-imbalance problem, we discard rare classes such as 'swimming person' and 'flying bird and plane'. In addition, we merge the 'boat' and 'speed boat' labels and thus propose a modified SMD (coined SMD-Plus) with seven maritime object classes.
Hence, in having the SMD-Plus dataset, we are able to provide benchmark results for the detection and classification (detection-then-classification) problem. That is, based on the YOLO-V5 model [24], we modify its augmentation techniques through the consideration of the maritime environments. More specifically, an Online Copy & Paste is applied to alleviate the imbalance problem in the training process. Likewise, the original augmentation techniques of the YOLO-V5 such as the geometric transformation, mosaic, and mix-up of the YOLO-V5 are adjusted especially for the SMD-Plus.
The contributions of this paper can be summarized as follows: (i) We have improved the existing SMD dataset by removing noisy labels and fixing the bounding boxes. It is expected that the improved dataset of the SMD-Plus will be used as a benchmark dataset for the detection and classification of objects in maritime environments.

Maritime Dataset
In domain-specific DNN applications, it is of vital importance to obtain a proper dataset for training. However, for some domain-specific problems, it is quite difficult to obtain publically available datasets. Depending on the target domain, it is often expensive to collect images for specific classes and annotate them. Moreover, security and proprietary rights often prevent the owners from opening their datasets. One such domain-specific dataset is the maritime dataset. Maritime datasets can be classified into three groups [25]: (i) datasets for object detection [19], (ii) datasets for object classification [26], (iii) datasets for both object detection and classification [20]. The dataset for object detection provides the location information of the objects in the image with their bounding boxes, while no class label is given for each object. On the other hand, in the dataset for both object detection and classification, each image includes multiple objects with their bounding boxes and class labels. Finally, there is only a single maritime object in an image from the dataset for object classification.
Although the SMD [20] provides the ground truth of video objects and their class labels for both object detection and classification, there are no benchmark results reported from the SMD. This is due to the fact that the original SMD is not quite ready for training DNN models. Moosbauer et al. [15] analyzed the SMD and proposed the split sub-datasets of 'train, validation, and test'. After applying Mask R-CNN on their split sub-datasets, they then reported the foreground object detection results. However, for both object detection and classification tasks, their split sub-datasets of train, validation, and test may not be appropriate for training the DNNs. Note that there certainly exist noisy labels in the SMD, which cause no problems in detection but negatively affect the DNN training for the classification. Additionally, due to the class-imbalance problem of the SMD, some of the split sub-datasets in [15] only have a few or even no data in a certain class of the test dataset. The SMD has been combined with other existing maritime datasets to resolve the limitations. For example, to expand the SMD dataset, Shin et al. [22] exploited the public datasets for classification such as MARVEL [18] by pasting copies of the objects in MARVEL into the SMD dataset. Furthermore, in Nalamati et al. [23], the SMD was combined with the SeaShips [19] dataset. However, these combined datasets were only used for detection. Moreover, due to the lack of dataset-combining details, it is hard to reproduce and compare the results. The Maritime Detection Classification and Tracking benchmark (MarDCT) [27] provided maritime datasets for detection, classification, and tracking separately. Therefore, it is inappropriate to use them for the classification of detected objects with bounding boxes.

Object Detection Models
Although improved versions of R-CNN [3], such as Faster R-CNN [4] and cascade R-CNN [28], were proposed to speed up the inference, the two-stage architectures of the R-CNN generically limit the processing speed. This has motivated researchers to develop one-stage DNNs such as YOLO [29], SSD [8], and RetinaNet [7] for object detection. Unlike the R-CNN, YOLO performs classification and bounding box regression at the same time, thus reducing the processing time. To further improve the accuracy and speed performance, the first YOLO has been refined to YOLO-V3 [6], YOLO-V4 [30], and YOLO-V5 [24]. The SSD [8] is another model of the one-stage object detector. For the anchor box of the YOLO, the SSD uses a predefined default box and has a scale-invariant feature by using a number of feature maps obtained from the middle layer of the backbone. RetinaNet [7] also adopts the one-stage framework with a modified focal loss, which assigns small weights to easily detectable objects but large weights to objects that are difficult to handle.
The detectors based on anchor boxes have the disadvantage of being sensitive to hyper-parameters. To solve this problem, anchor-free methods such as FCOS [31] have been proposed. However, since FCOS [31] performs pixel-wise bounding box prediction, it takes more time to execute the detection-then-classification task. Since the real-time requirement is essential for autonomous surveillance, we focus on using the fast one-stage method of YOLO-V5 [24] as the baseline object detection model.

Improved SMD: SMD-Plus
The SMD provides high-quality videos with ground truth for 10 types of objects in marine environments. Since the ground truth of the SMD was created by non-expert volunteers, it includes some label errors and imprecise bounding boxes. Those ambiguous and incorrect class labels in the ground truth make it difficult to use the SMD as a benchmark dataset for maritime object classification. Therefore, most of the researches making use of the SMD only deal with object detection, with no classification of the detected objects. To make use of the SMD for the detection-then-classification purpose, our first task was to revise and improve its imprecise annotations.
To train a DNN for object detection, we needed the location and size information of the bounding boxes. Note that unlike the datasets with general objects, the background regions of sea and sky in the maritime datasets, similar to the SMD, usually take up much larger areas in the image than the target objects of ships. Therefore, the precise bounding box annotations for the small maritime objects are of importance, and even a small mislocation of the bounding box for the small object can make a huge difference in the training and testing of the DNNs.  The ground truth annotation of the SMD for each maritime object provides one of ten class labels as well as its bounding box information of location and size. However, there are quite a few noisy labels in the SMD. In addition, there are indistinguishable classes that need to be merged. For example, as shown in Figure 2, the two ships from the apparently identical class are assigned the different labels of 'Speed boat' and 'Boat'. Therefore, in our improved version of the SMD-Plus, we are going to merge the two classes of 'Speed boat' and 'Boat' into a single class of 'Boat'. Another motivation to combine these two classes is that the number of image data for the two classes is not sufficient for training and testing.  Next, we point out the problem of the 'Other' classification in the SMD. We noticed that the SMD included a clearly identifiable 'Person' in the 'Other' class, as seen in Figure 4a, as well as blurred unidentifiable objects, as seen in Figure 4b. This makes the definition of the label 'Other' rather fuzzy. Therefore, we assigned the 'Other' classification only to unidentifiable objects, excluding rare objects such as the 'Person' from the class.  Since there exist no actual labeled objects for the 'Flying bird and plane' and 'Swimming person' classes in the SMD, we discarded these two classes. Therefore, putting all the above modifications together, we can summarize the criteria for our SMD revisions as follows: (i) 'Swimming person' class is empty and is deleted; (ii) Non-ship 'Flying bird and plane' class is deleted; (iii) Visually similar classes of 'Speed boat' and 'Boat' are merged; (iv) Bounding boxes of the original SMD are tightened; (v) Some of the missing bounding boxes in 'Kayak' are added; (vi) According to our redefinitions for the 'Ferry' and 'Other' classes, some of the misclassified objects in them are corrected.
Our final version of the SMD, coined as SMD-Plus, is quantitatively compared with the original SMD in Table 1. We needed to split the SMD-Plus into training and testing subsets for the DNNs. Note that the separation of the SMD into train, validation, and test subsets proposed by [15] is good for detection, but not for detection-then-classification. Furthermore, some of the classes in the test subset of the original SMD were empty. Hence, we carefully re-separated the SMD video clips such that they were distributed evenly for all classes in both the train and test subsets as much as possible (see Table 2).   Test (14) OnShore (

Data Augmentation for YOLO-V5
In this section, we address our detection-then-classification method based on YOLO-V5 with the SMD-Plus dataset. We focus mainly on image augmentation techniques designed especially for the maritime dataset of the SMD-Plus.
Considering the relatively small size and class imbalance problems in the SMD-Plus, data augmentation plays an important role in alleviating the overfitting problem when training the DNNs. As shown in Figure 5, in addition to the basic YOLO-V5 augmentation techniques such as mosaic and geometric transformation, we employ the Online Copy & Paste and Mix-up techniques. That is, to a set of four training images, {I 1 , I 2 , I 3 , I 4 }, we first apply color jittering by randomly altering the brightness, hue, and saturation components of the images. Then, the Copy & Paste is performed by inserting the copied objects from other training images into the input images. Next, adding another set of four training images, {J 1 , J 2 , J 3 , J 4 }, a random mosaic is applied to both sets of {I 1 , I 2 , I 3 , I 4 } and {J 1 , J 2 , J 3 , J 4 }. Then, the two mosaic images are geometrically transformed by translation, horizontal flip, rotation, and scaling. Finally, after the geometric transformations, the two images are fused by the Mix-up process. Among the augmentations mentioned previously, the Copy & Paste and the Mix-up are the newly adopted techniques for the basic YOLO-V5 augmentations. Now, we will elaborate on these two techniques in the following subsections.

Copy & Paste Augmentation
Copy & Paste augmentation is an effective means of increasing the number of objects for the minority classes, thus alleviating the class-imbalance problem. Here, to enhance the recognition performance for small objects, we can choose smaller objects to be copied as much as possible. To this end, we first divide the objects in the training images into three groups: small (s), medium (m), and large (l). The criterion for the division is given by the size of the rectangular area of the bounding box (see Table 3). Moreover, from Table 1, we can choose more objects from the minority classes for the Copy & Paste to mitigate the class-imbalance problem. Consequently, we first choose the class k ∈ {1, 2, · · · , K} out of the K object class with the following probability, P class (k): where w c (k) = N min /N k , N min = min{N 1 , · · · , N K }, and N k is the number of objects in class k. By choosing the object to be copied by (1), the minority classes have higher chances of being selected. Once the object from class k is chosen by (1), we need to select the final object to be copied from one of the three groups of small (s), medium (m), and large (l), determined according to Table 3. The probability of choosing one of the three groups P size (k) for class k is given by the following equation: where w s (j) = min{N k (s), N k (m), N k (l)}/N k (j), and N k (j) is the number of objects for the size of j ∈ {s, m, l} in the object class k. Note that P size (j) in (2) also gives a higher probability for the minority group among small (s), medium (m), and large (l). Since the small-sized (s) groups for all class labels usually have the smallest number of objects in the SMD-Plus, the objects in the small-sized group s has more chances of being selected than the other groups of m and l. In the previous methods, Copy & Paste was executed before training as an offline pre-processing technique. As a consequence, the images pre-processed by the Copy & Paste were used over and over again for every epoch of the training process. To provide more diversified images in training the DNN, for this paper, we apply the Copy & Paste in an on-the-fly manner in order to have an Online Copy & Paste scheme. Now, this Online Copy & Paste creates differently pasted objects for every training epoch, which allows the DNN to be trained with maritime objects of many different sizes and locations.
Next, we need to locate the position in the training image where the copied object is to be pasted, avoiding any overlap between the copied object and the existing ones. This can be performed by calculating the Intersection of Union (IoU) between the candidate position for the paste and the location of the original bounding box. That is, with the equation below, we can check if the IoU for the paste is equivalent to zero. In the object detection area, the IoU measures the overlapping area between the to-be-pasted bounding box B p and the existing bounding box B gt in the ground truth, divided by the area of union between them:

Mix-up Augmentation
The Mix-up technique [32] is a means of generating a new image by the weighted linear interpolation of two images and their labels. It is known to be effective for mislabeled data because the labels of the two images are mixed, just as their images. More specifically, for the given input images and their label pairs (x i , y i ) and (x j , y j ) from the training data, the Mix-up can be implemented as follows: where (x,ȳ) are the Mix-up outputs and λ ∈ [0, 1] is the mixing ratio.

Basic Augmentations from YOLO-V5
We also use the basic geometric transformations of YOLO-V5 such as flipping, rotation, translation, and scale. Another basic augmentation adopted from YOLO-V5 is the mosaic augmentation. It was first introduced in [30]. The mosaic augmentation mixes four training images into a single training image in order to have four different contexts. According to [30], the mosaic augmentation allows the model to learn how to identify objects on a smaller-than-usual scale, and it is useful for training as it greatly reduces the need for large mini-batch sizes.

Experiment Results
As explained in the previous section, we revised the SMD in order to obtain the SMD-Plus. As a tool for modifying the ground truth of the SMD, we used the MATLAB ImageLabeler tool. The MATLAB ImageLabeler provides an application interface to be able to easily create video clips and attach annotations to each object.
Our experiments were conducted on an Intel I7-9900 Processor with a main memory of 32GB and an NVIDIA GeForce RTX 2080Ti. Based on the YOLO-V5, we trained the model with the SMD-Plus. The hyper-parameters for the YOLO-V5 training are as follows: the stochastic gradient descent (SGD) optimizer with a momentum of 0.9, a learning rate of 0.01, and a batch size of 8. We also used the following values for the augmentation parameters: • For color jittering: hue ranges from 0 to 0.015; saturation, from 0 to 0.7; and brightness, from 0 to 0.4; • The probability of generating a mosaic is 0.5; • Translate shifts range from 0 to 0.1; • The probability of a horizontal flip is 0.5; • Random rotation within angles from −10 to +10 degrees; • Random scaling in the range of 0.5×∼1.5×.
Using the same augmentation parameters listed above, for the sake of comparison, we conducted additional experiments with YOLO-V4 [30]. Table 4 compares the detection performance of the SMD and the SMD-Plus. As shown in Table 4, the detection performance of the SMD-Plus compared to the SMD increased by more than 10% for both YOLO-V4 and all versions of YOLO-V5. Here, as in the previous benchmarks [15,21,22], only foreground and background detections were performed. Note that the problem with detecting only the foreground and background is that it can be used to evaluate the accuracy of the bounding box detection, but not the recognition accuracy for the class label. Therefore, we can use the results of Table 4 to verify the bounding box accuracy of the SMD-Plus.  Table 5 shows the results of object detection-then-classification task for the train/test split of the SMD, as suggested by [15]. In this train/test split, however, there exist classes with no test data. Therefore, the corresponding classes of columns c1, c5, c7, and c10 are blank. Those non-empty classes for the test set in [15] include 'Speed boat', 'Vessel/ship', 'Ferry', 'Buoy', 'Others', and 'Flying bird and Plane'. Fixing the IoU threshold at 0.5, the mAPs for the six non-empty classes are 0.186 for YOLO-V4, 0.22 for YOLO-V5-S, 0.182 for YOLO-V5-M, and 0.304 for YOLO-V5-L.
Next, Table 6 shows the results of the detection-then-classification task for the SMD-Plus. In the table, we can evaluate the performance for the Copy & Paste scheme. More specifically, the detection-then-classification results for 'No Copy&Paste', 'Online Copy&Paste', and 'Offline&Paste' are compared in Table 6. As one can see in the table, our proposed 'Online Copy&Paste' outperformed the 'None' and 'Offline Copy&Paste' methods for YOLO-V4 and all versions of YOLO-V5. Furthermore, the proposed 'Online&Paste' has been proven to be quite effective for the minority classes, such as 'Kayak' of c6.  Table 6. Detection-then-classification results for the SMD-Plus dataset: c1: Ferry, c2: Buoy, c3: Vessel_ship, c4: Boat, c5: Kayak, c6: Sail_boat, c7: Others. Columns P and R represent the precision and the recall performance, respectively, for IoU = 0.5.

Conclusions
In this paper, we provided an improved SMD-Plus dataset for future research works on maritime environments. We also adjusted the augmentation techniques of the original YOLO-V5 for the SMD-Plus. In particular, the proposed 'Online Copy & Paste' method was proven to be effective in alleviating the class-imbalance problem. Our SMD-Plus dataset and the modified YOLO-V5 are open to the public for future research. We hope that our detection-then-classification model of YOLO-V5 based on the SMD-Plus serves as a benchmark for future research and development initiatives for automated surveillance in maritime environments.