1. Introduction
The synthetic aperture radar (SAR), as an active sensor, has the capability of all-day, all-weather imaging. Due to the long wavelength of emitted electromagnetic waves, SAR can effectively identify camouflage and penetrate masked objects. Hence, with increasing resolution, SAR can acquire high-quality images and is being widely used in target surveillance and reconnaissance. The appearance features of a target are varied and influenced by its near environments. How to detect, i.e., find and locate, the target in a complex background has been an important research direction for SAR image applications. Vehicles, as a common type of land transportation, have gradually become increasingly a point of focus for researchers.
The most commonly used traditional SAR target detection method is the Constant False Alarm Rate (CFAR) algorithm, a typical detection operator based on statistical characteristics. Nevertheless, this method has a drawback in that it necessitates the manual definition of the background clutter distribution and guarding area, which ultimately leads to detection outcomes being influenced by complex scenarios and human factors.
However, deep networks have a remarkable feature abstraction ability and can automatically perform target detection in complex backgrounds. Due to their powerful performance, deep networks are at present widely used for SAR object detection, and many researchers have designed different networks or improved on existing detection networks. For SAR ship detection, Jiao et al. [
1] proposed a multi-scale neural network based on Faster R-CNN for densely connected convolutional network layers to solve the multi-scale and multi-scene SAR ship detection problem. Cui et al. [
2] proposed a ship detection method based on Dense Attention Pyramid Network, which also adds attention modules to the convolutional network layer and connects them for multi-scale ship detection. Zhang et al. [
3] proposed a novel balance learning network (BL-Net) to solve four imbalance problems in SAR ship detection. They also carried out work on high-speed SAR ship detection [
4,
5,
6]. In addition, researchers have used a rotatable bounding box-based detection network [
7,
8,
9,
10] to cope with dense scenes, thus further improving localization accuracy. For SAR airplane detection, He et al. [
11] proposed a component-based multi-layer parallel network to solve the sparsity and diversity problems that arise from the SAR scattering mechanism. Wang et al. [
12] proposed a new fast detection framework, named the Efficient Weighted Feature Fusion and Attention Network (EWFAN), which conducts the automatic and rapid detection of aircrafts with high accuracy. Moreover, many researchers [
13,
14,
15] have carried out work related to SAR aircraft detection. Other types of SAR object detection include oil-tank detection [
16,
17] and video SAR shadow detection [
18], to name but a few.
Since the deep network is a data-driven approach, a considerable amount of input samples are required as training data to achieve a better performance. Therefore, corresponding datasets must be created for network training purposes. Over the past few years, the reason for the more rapid development of SAR ship detection is the richness of the SAR ship dataset [
19,
20,
21,
22,
23]. However, the only public dataset containing vehicle targets is Moving and Stationary Target Acquisition and Recognition (MSTAR) [
24], which comprises a series of military vehicle chips and clutter images and was originally intended for classification studies. In recent years, some researchers have carried out detection work using MSTAR. For example, Long et al. [
25] proposed a simple rotated detection model on the dataset, where the vehicles were embedded into the clutter images for detection experiments. Zhang et al. [
26] selected eight types of vehicle chips and integrated them into the background images to construct the SAR_OD dataset first and then utilized data enhancement to improve detection accuracy. Sun et al. [
27] constructed a small dataset called LGSVOD by manually labeling three target classes, and proposed an improved YOLOv5 network for detection. As mentioned above, the studies all embedded vehicle targets into the clutter images for detection. In this case, although there are many vehicle categories in the MSTAR dataset, the background is homogeneous, and the distribution of vehicles is ideal and not sufficiently dense. Therefore, for civilian vehicle detection in urban-area scenes, some researchers have used MiniSAR images [
28] or FARAD images [
28], which are publicly available at Sandia National Laboratories (USA). Wang et al. [
29] took MiniSAR images as the basis, added MSTAR data as an expansion to form a detection dataset, and used transfer learning based on the Single Shot MultiBox Detector (SSD). Zou et al. [
30] proposed a SCEDet network on a band of FARAD images, and the produced dataset contained 1661 vehicles for experimental validation. Tang et al. [
31] employed a CFAR-guided SSD algorithm using five MiniSAR images for experimental validation. As mentioned, in the above studies, only a portion of those images were selected for experiments, and the corresponding data were not fully explored and required a unified evaluation metric.
In this paper, we construct a SAR Image dataset for VEhicle Detection based on a rotatable bounding box named SIVED, which collects data in the X, Ku, and Ka bands from MSTAR, MiniSAR, and FARAD. The imaging areas of MiniSAR and FARAD are mainly concentrated in urban areas with complex backgrounds composed of trees, buildings, roads, and clutter. SIVED contains more than 270 dense scene chips. The rotatable bounding boxes are adopted to avoid information redundancy and reduce interference from the background and adjacent targets in dense scenes. This is also convenient for orientation estimations and aspect ratio calculations. The creation of SIVED consists of three main steps. The first step involves data preprocessing, including the removal of scenes without targets. Then, an automatic annotation algorithm based on CFAR and YOLOv5 is proposed to establish semi-automatic annotation. Finally, chips and annotation files are automatically organized to build the dataset. After construction, a complete analysis is performed for the dataset characteristics, eight rotated detection algorithms are selected to verify the stability of and challenge the dataset, and the corresponding baseline is built for the reference of relevant researchers.
The main contributions of this paper are as follows.
Using publicly available high-resolution SAR data that includes vehicle targets, we construct the first SAR image dataset in three bands for vehicle detection. The rotatable bounding box annotation is adapted to reduce redundant background clutter and accurately position targets in dense scenes. This dataset can advance vehicle detection development and facilitate vehicle monitoring in complex terrestrial environments.
An algorithm, combined with a detection network, is proposed for the annotation of MSTAR data to increase annotation efficiency. The annotated files contain enriched information, expanding the potential for various applications.
Experiments are conducted using eight state-of-the-art rotated detection algorithms, which establish a relevant baseline to evaluate this dataset. The experimental results confirm the dataset’s stability and the adaptability of the current algorithms to vehicle targets.
The rest of this paper is organized as follows.
Section 2 presents the basic information about SIVED.
Section 3 describes the construction of SIVED. In
Section 4 is an analysis of the characteristics of SIVED.
Section 5 introduces the architectures of the selected eight rotated detection algorithms. Presented in
Section 6 are experiments conducted based on SIVED toward establishing a baseline to analyze the dataset characteristics.
Section 7 provides a detailed discussion of the dataset based on the experimental results from the eight algorithms. Finally, the conclusions and outlook are given in
Section 8.
2. Basic Information about SIVED
The previously mentioned open data MSTAR (X-band), MiniSAR (Ku-band) [
28], and FARAD (Ka and X-bands) [
28] are used to construct the dataset; the basic information of the above open data is shown in
Table 1. For MSTAR, 5168 vehicle chips at 17° and 15° depression angles are selected. SIVED consists of a training, test, and validation set, and the chip size is set to 512 × 512. The statistics for the number of chips and the number of targets are shown in
Table 2. Chips for vehicles located in urban areas are selected for display in
Figure 1a, and the scenes contain parking lots, buildings, tree coverings, both sides of roads, etc. Since the original MSTAR chip size is 128 × 128, the chips are stitched in groups of 16 (4 × 4) to form slices of corresponding settings, as shown in
Figure 1b.
SIVED is annotated with rotatable bounding boxes; one of the chips is annotated with bounding boxes as shown in
Figure 2a. The row direction of the image matrix is defined as the y-axis, the column direction is defined as the
x-axis, and the coordinates follow the form of (x, y) as recorded. Meanwhile, two formats of annotation files are provided. One format is derived from the DOTA [
32] annotation format, using TXT files to record the annotation information of each chip, which is characterized by conciseness and clarity as well as the possibility of being directly applied to most rotated detection networks. The corresponding file format is shown in
Figure 2b, where the corner coordinates are arranged clockwise. It should be mentioned that the difficulty indicates whether the labeled target is a detection difficulty instance (0 means no difficulty, and 1 means difficulty). Another from the PASCAL VOC [
33] annotation format, using XML files, is employed to record the detailed information of each chip, including source data, band, resolution, polarization mode, target azimuth, etc., which is convenient for researchers to exploit these for further research; the annotation file is shown in
Figure 3.
7. Discussion
As discussed in the previous section, currently available datasets for SAR image target detection are primarily focused on ship targets. However, as ships are much larger than vehicles in terms of physical scale, our dataset is positioned to address the detection of small targets defined based on image pixels. Thus, SIVED would serve as a valuable complement to constructing a multiple-target dataset with various scales. Compared to the typical SAR ship dataset SSDD [
21], SIVED offers rich associated information in its annotation file. Although SSDD employs rotatable bounding boxes, the annotation file only includes the position and angle of the target. In contrast, SIVED is annotated with the source of the slices and the basic information of the corresponding sensors, such as band, polarization mode, and resolution. This additional information will be useful in small-target detection research based on SAR imaging mechanisms. SSDD contains 1160 chips and 2587 ship targets, with an average of two targets per chip. On the other hand, SIVED’s urban area comprises 721 chips and 6845 vehicle targets, indicating an average of nine targets per chip. This suggests that SIVED’s distribution of targets is denser while the target capacity is larger, which puts forward a higher challenge for the positioning accuracy of the detection algorithm. Furthermore, rich land features and clutters constitute a complex background.
The characteristics of vehicles in SAR imaging are closely related to the wavelength of the band used. Typically, longer wavelengths lead to a deeper penetration but worse characterization ability for target details, whereas shorter wavelengths result in a weaker penetration but better characterization ability for target details. In this paper, we constructed a dataset of Ka, Ku, and X bands, with sequential increases in their wavelengths.
Figure 20 shows the imaging results of different bands in FARAD, and it is evident that the vehicles in Ka-band images exhibit more texture features than those in X-band images. The inclusion of images of different bands indicates that the features become rich, enabling the network to learn more knowledge and improve the generalization ability. In addition, due to the side-view imaging of SAR and the penetration of the microwave, it can form a mixed area of vehicles and trees as shown in
Figure 7, where the pixel information is formed by the combination of trees and vehicles, and the vehicle targets in the above scenes are labeled, which further enriches the features of the dataset while making full use of SAR imaging characteristics.
As a high recall indicates a high rate of target detection, the results of the experiment in
Table 4 suggest that the dataset is relatively stable. However, the precision does not reach the same level as the recall, resulting in more false alarms, which means that the background was wrongly identified as the target. This finding verifies the complexity of the dataset’s background. It provides further evidence that the dataset is challenging but still maintains stability. Generally, the one-stage network has a lower detection accuracy compared to the two-stage network since the two-stage network distinguishes the background from the target in the RPN, whereas the one-stage network performs regression directly, resulting in an imbalance between the categories of background and target. Nonetheless, Oriented RepPoints utilizes the Focal Loss in RetinaNet to provide appropriate weight control to cross-entropy loss and focus loss calculation on target categories, which improves upon the influence caused by the imbalance between background and target. Combined with the adaptive improvements for rotatable bounding box regression, it thus obtains the highest performance.
The visual detection results presented in
Figure 18 reveal that the phenomenon of missing detection and false alarm exists in the detection results of different algorithms. This finding is in line with the actual scenario, where there is greater interference and difficulty in distinguishing dense targets in urban environments, ultimately highlighting the challenge of the dataset.
When constructing SIVED, multiple sources of data are used, making full use of the SAR imaging mechanism, and sensor information is recorded during annotation, which endows the dataset with richness. Compared with the typical ship dataset, SIVED mainly focuses on small targets, and contains more dense scenes and complex backgrounds, which make the dataset challenging. The targets are distributed at different angles, and the different algorithms maintain high recall values in the experiment, which demonstrates the stability of the dataset. In summary, SIVED exhibits three properties: richness, stability, and challenge.