Mix MSTAR: A Synthetic Benchmark Dataset for Multi-Class Rotation Vehicle Detection in Large-Scale SAR Images

: Because of the counterintuitive imaging and confusing interpretation dilemma in Synthetic Aperture Radar (SAR) images, the application of deep learning in the detection of SAR targets has been primarily limited to large objects in simple backgrounds, such as ships and airplanes, with much less popularity in detecting SAR vehicles. The complexities of SAR imaging make it difﬁcult to distinguish small vehicles from the background clutter, creating a barrier to data interpretation and the development of Automatic Target Recognition (ATR) in SAR vehicles. The scarcity of datasets has inhibited progress in SAR vehicle detection in the data-driven era. To address this, we introduce a new synthetic dataset called Mix MSTAR, which mixes target chips and clutter backgrounds with original radar data at the pixel level. Mix MSTAR contains 5392 objects of 20 ﬁne-grained categories in 100 high-resolution images, predominantly 1478 × 1784 pixels. The dataset includes various landscapes such as woods, grasslands, urban buildings, lakes, and tightly arranged vehicles, each labeled with an Oriented Bounding Box (OBB). Notably, Mix MSTAR presents ﬁne-grained object detection challenges by using the Extended Operating Condition (EOC) as a basis for dividing the dataset. Furthermore, we evaluate nine benchmark rotated detectors on Mix MSTAR and demonstrate the ﬁdelity and effectiveness of the synthetic dataset. To the best of our knowledge, Mix MSTAR represents the ﬁrst public multi-class SAR vehicle dataset designed for rotated object detection in large-scale scenes with complex backgrounds.


Introduction
Thanks to its unique advantages, such as all-time, all-weather, high-resolution, and long-range detection, SAR has been widely used in various fields, such as land analysis and target detection.Vehicle detection in SAR-ATR is of great significance in urban traffic, hotspot target focusing, and other aspects.
In recent years, with the development of artificial intelligence, deep learning-based object detection algorithms [1,2] have dominated the field with their powerful capabilities in automatic feature extraction.Deep learning is a subject of data hunger.Historical experience has shown that big data is an important driver for the flourishing development of deep learning technology in various fields.With the rapid development of aerospace and sensor technology, an increasing number of high-resolution remote sensing images can be obtained.In the field of remote sensing, visible light object detection has experienced vigorous development since the release of the DOTA [3].As the first publicly available SAR ship dataset, the introduction of the SSDD [4] has directly promoted the application of deep learning in SAR object detection and has led to the emergence of more SAR ship datasets [5][6][7][8][9], which are still one of the detection benchmarks and exhibit strong vitality to this day.
However, due to the imaging mechanism of SAR images being distinct from visible light, their interpretation is unintuitive for the human eye.Ground clutter and scattering caused by object corner points can seriously interfere with human interpretation.This leads to the fact that the detection objects of the current SAR datasets are mainly large targets such as ships and planes in relatively pure backgrounds.In contrast, SAR datasets for vehicles are very rare.The community has long relied on the Moving and Stationary Target Acquisition and Recognition (MSTAR) [10] released by the Sandia National Laboratory in the last century.However, the vehicle images in MSTAR are also separated from large-scene images and appear in the form of small patches.Due to its lack of complex background, it is only suitable for classification tasks, and its classification accuracy has reached more than 99%.Up until now, in the SAR-ATR field, MSTAR has been more widely used in few-shot learning and semi-supervised learning [11,12].Meanwhile, the volume of the SAR dataset containing vehicle images with large scenes is quite small.The reason for this is that the small area of the vehicle requires higher resolution for SAR-ATR than aircrafts and ships, which leads to higher data acquisition costs.Moreover, vehicles exist in more complex clutter backgrounds, which increases the difficulty of manual interpretation and reduces the accuracy of target annotation.Table 1 provides detailed information on existing public SAR vehicle datasets with large scenes.Unfortunately, in these datasets, there is no official localization annotation that can be obtained, so manual identification of annotations is required.Due to strong noise interference, the FARAD X BAND [13] and FARAD KA BAND [14] make it too difficult for humans to identify the position of vehicles, so the annotations cannot meet the accuracy requirements.The Spotlight SAR [15] has only a very small number of vehicles, and pairs of pictures were taken at different time periods at the same location.The Mini SAR [16] includes more vehicles, but it only contains 20 pictures, and it has the same problem as the Spotlight SAR in having duplicate scenes.The subsequent experiments proved that the small size of the Mini SAR caused a large standard error in the results.These reasons make the above datasets difficult to use as a reliable benchmark for SAR-ATR algorithms.In addition, there is the GOTCHA [17], which contains vehicles and large scenes, but it is a fully polarized circular SAR dataset that is significantly different from the commonly used single-polarized linear SAR.It only contains one scene and is mainly used for the classification of calibrated vehicles in the SAR-ATR field.The size of GOTCHA makes it difficult to meet the requirements of object detection, so it is not included in the table here for comparison.In view of the scarcity of vehicle datasets in the SAR-ATR field, people have conducted a series of data generation efforts around MSTAR, which can be mainly divided into the following three methods: The first method is based on generative adversarial nets (GANs) [18].The generator network transforms the input noise into generative images that can deceive discriminator networks by fitting the distribution of real images.In theory, GANs [19][20][21] can generate an infinite number of generative images (see Figure 1a), thereby solving the problem of scarce real samples.However, unlike optical images, SAR imaging is strictly based on radar scattering mechanisms, and the black box properties of neural networks cannot prove that the generative samples comply with SAR imaging mechanisms.Moreover, due to the limitations of real samples, it is difficult to generate large-scene images.The second method is based on computer-aided design (CAD) 3D modeling and electromagnetic calculation simulation [22][23][24][25].Among them is the SAMPLE [25] dataset released by Lewis et al. from the same institution as MSTAR, and it has advantages in terms of model errors, as shown in Figure 1b.The advantage of this method is that the imaging of synthetic samples is based on physical mechanisms, and the imaging under different conditions can be easily obtained by changing the simulation environment parameters.Compared with the original images, the simulation images can also remove the correlation between the targets and the background by setting random background noise, which prevents overfitting of the detection model.However, both of these methods have background limitations, and it is currently difficult to simulate vehicles located in complex, large-scale backgrounds.The third method is background transfer [26][27][28].Chen et al. [26] believe that since the acquisition conditions of the chip image (Chip for short) and the clutter image (Clutter for short) in MSTAR are similar, Chips can be embedded in Clutters to generate vehicle images with large scenes, as shown in Figure 1c.Like the first method, the synthetic images cannot strictly comply with SAR imaging mechanisms, and the current use of such methods is to directly paste Chips with their backgrounds onto Clutters, which looks quite abrupt visually while maintaining an association between the target and the background.
Remote Sens. 2023, 15, 4558 3 of 28 [19][20][21] can generate an infinite number of generative images (see Figure 1a), thereby solving the problem of scarce real samples.However, unlike optical images, SAR imaging is strictly based on radar scattering mechanisms, and the black box properties of neural networks cannot prove that the generative samples comply with SAR imaging mechanisms.Moreover, due to the limitations of real samples, it is difficult to generate large-scene images.The second method is based on computer-aided design (CAD) 3D modeling and electromagnetic calculation simulation [22][23][24][25].Among them is the SAMPLE [25] dataset released by Lewis et al. from the same institution as MSTAR, and it has advantages in terms of model errors, as shown in Figure 1b.The advantage of this method is that the imaging of synthetic samples is based on physical mechanisms, and the imaging under different conditions can be easily obtained by changing the simulation environment parameters.Compared with the original images, the simulation images can also remove the correlation between the targets and the background by setting random background noise, which prevents overfitting of the detection model.However, both of these methods have background limitations, and it is currently difficult to simulate vehicles located in complex, large-scale backgrounds.The third method is background transfer [26][27][28].Chen et al. [26] believe that since the acquisition conditions of the chip image (Chip for short) and the clutter image (Clutter for short) in MSTAR are similar, Chips can be embedded in Clutters to generate vehicle images with large scenes, as shown in Figure 1c.Like the first method, the synthetic images cannot strictly comply with SAR imaging mechanisms, and the current use of such methods is to directly paste Chips with their backgrounds onto Clutters, which looks quite abrupt visually while maintaining an association between the target and the background.To generate large-scale SAR images with complex backgrounds, we constructed the Mix MSTAR using an improved background transfer method.Unlike the previous works, we overcame the abrupt visual appearance of synthetic images and demonstrated the fidelity and effectiveness of synthetic data.Our key contributions are as follows:

•
We improved the method of background transfer and generated realistic synthetic data by linearly fusing vehicle masks in Chips and Clutters, resulting in the fusion of 20 types of vehicles (5392 in total) into 100 large background images.The dataset adopts rotation bounding box annotation and includes one Standard Operating Condition (SOC) and two EOC partitioning strategies, making it a challenging and diverse dataset;

•
Based on the Mix MSTAR, we evaluated nine benchmark models for general remote sensing object detection and analyzed their strengths and weaknesses for SAR-ATR;

•
To address potential artificial traces and data variance in synthetic images, we designed two experiments to demonstrate the fidelity and effectiveness of Mix MSTAR To generate large-scale SAR images with complex backgrounds, we constructed the Mix MSTAR using an improved background transfer method.Unlike the previous works, we overcame the abrupt visual appearance of synthetic images and demonstrated the fidelity and effectiveness of synthetic data.Our key contributions are as follows:

•
We improved the method of background transfer and generated realistic synthetic data by linearly fusing vehicle masks in Chips and Clutters, resulting in the fusion of 20 types of vehicles (5392 in total) into 100 large background images.The dataset adopts rotation bounding box annotation and includes one Standard Operating Condition (SOC) and two EOC partitioning strategies, making it a challenging and diverse dataset;

•
Based on the Mix MSTAR, we evaluated nine benchmark models for general remote sensing object detection and analyzed their strengths and weaknesses for SAR-ATR;

•
To address potential artificial traces and data variance in synthetic images, we designed two experiments to demonstrate the fidelity and effectiveness of Mix MSTAR in SAR image features, demonstrating that Mix MSTAR can serve as a benchmark dataset for evaluating deep learning-based SAR-ATR algorithms.
The remaining article is composed of four sections.Section 2 presents the detailed methodology employed to construct the synthetic dataset as well as an extensive analysis of the dataset itself.In Section 3, we introduce and evaluate nine rotated object detectors using the synthetic dataset as the benchmark.Sequentially, a comprehensive analysis of the results is conducted.Section 4 focuses specifically on the analysis and validation of two vital problems related to the dataset, namely, artificial traces and data variance.Moreover, we provide an outlook on the potential future of the synthetic dataset.Section 5 concludes our work.

Preliminary Feasibility Assessment
We first evaluated the feasibility of merging Clutters and Chips.Since the sensor's depression was 15 • when collecting Clutters, we chose Chips with the same depression as the target images.As shown in Table 2, both Clutters and Chips use the same STARLOS sensor based on an airborne platform and maintain consistency in terms of radar center frequency, bandwidth, polarization, and depression.Although the radar mode is different, the final imaging resolution and pixel spacing are the same.Therefore, we assume that if the working parameters of Clutters are used for imaging vehicles, the visual effect will be approximately the same as that of Chips.So, it is feasible to transfer the vehicles in Chips to the Clutters' backgrounds, and the final effect is in line with the human observation mechanism.Of course, we must acknowledge that due to the differences in the operating modes, the two have significant differences in the raw radar data (especially phase).This means that synthetic data generated by background transfer cannot strictly conform to the scattering mechanism of the radar.However, what we pursue is the consistency between synthetic data and real data in terms of 8-bit image features, which is crucial for current deep learning models based on image feature extraction in the computer vision field.Unlike previous attempts that involved crude background transfers, Mix MSTAR aims to be a visually realistic synthetic dataset.To achieve this goal, we conducted extensive research into domain transfer and imaging algorithms to harmoniously blend two different radar datasets and developed a paradigm for creating synthetic datasets, as shown in Figure 2. Next, we will describe in detail the process of constructing the dataset.

Mask Extraction
In order to make vehicles fit seamlessly into the Clutters' backgrounds, we used labelme [29] to mask the outlines of the vehicles on Chips.Since the shadow in the radar blind area is also an important feature of SAR targets, the shadow of the vehicle was included in the mask.We also labeled the OBBs of the vehicle on Chips to be used as the label for the final synthetic dataset.The four points of the OBBs are labeled in a clockwise direction, and the first point is on the left side of the vehicle's front.It is worth noting that according to the principle of electromagnetic wave scattering, there will always be a part of the vehicle in the shadow area at any angle with weak but not negligible reflected signals.This ambiguity can cause interference in manual annotation.Therefore, to unify the standard, the strategy followed for annotating the OBBs is based on human visual perception.We only label the salient areas that can attract the attention of the human rather than including the entire actual occupation of the vehicle based on prior knowledge according to the object resolution and vehicle size, as shown in Figure 3b.This conforms to the annotation rules of computer vision and ensures that the model trained on this dataset focuses on features that are in line with human perception.

Mask Extraction
In order to make vehicles fit seamlessly into the Clutters' backgrounds, we used labelme [29] to mask the outlines of the vehicles on Chips.Since the shadow in the radar blind area is also an important feature of SAR targets, the shadow of the vehicle was included in the mask.We also labeled the OBBs of the vehicle on Chips to be used as the label for the final synthetic dataset.The four points of the OBBs are labeled in a clockwise direction, and the first point is on the left side of the vehicle's front.It is worth noting that according to the principle of electromagnetic wave scattering, there will always be a part of the vehicle in the shadow area at any angle with weak but not negligible reflected signals.This ambiguity can cause interference in manual annotation.Therefore, to unify the standard, the strategy followed for annotating the OBBs is based on human visual perception.We only label the salient areas that can attract the attention of the human rather than including the entire actual occupation of the vehicle based on prior knowledge according to the object resolution and vehicle size, as shown in Figure 3b.This conforms to the annotation rules of computer vision and ensures that the model trained on this dataset focuses on features that are in line with human perception.

Data Harmonization
In fact, after extracting the masks of the vehicles in Chips, we can already embed the

Mask Extraction
In order to make vehicles fit seamlessly into the Clutters' backgrounds, we used labelme [29] to mask the outlines of the vehicles on Chips.Since the shadow in the radar blind area is also an important feature of SAR targets, the shadow of the vehicle was included in the mask.We also labeled the OBBs of the vehicle on Chips to be used as the label for the final synthetic dataset.The four points of the OBBs are labeled in a clockwise direction, and the first point is on the left side of the vehicle's front.It is worth noting that according to the principle of electromagnetic wave scattering, there will always be a part of the vehicle in the shadow area at any angle with weak but not negligible reflected signals.This ambiguity can cause interference in manual annotation.Therefore, to unify the standard, the strategy followed for annotating the OBBs is based on human visual perception.We only label the salient areas that can attract the attention of the human rather than including the entire actual occupation of the vehicle based on prior knowledge according to the object resolution and vehicle size, as shown in Figure 3b.This conforms to the annotation rules of computer vision and ensures that the model trained on this dataset focuses on features that are in line with human perception.

Data Harmonization
In fact, after extracting the masks of the vehicles in Chips, we can already embed the mask in Clutters as the foreground.However, prior to this step, it is necessary to harmonize the two kinds of data for the visual harmony of the synthetic image.In the field of image composition, image harmonization aims to adjust the foreground to make it com-

Data Harmonization
In fact, after extracting the masks of the vehicles in Chips, we can already embed the mask in Clutters as the foreground.However, prior to this step, it is necessary to harmonize the two kinds of data for the visual harmony of the synthetic image.In the field of image composition, image harmonization aims to adjust the foreground to make it compatible with the background in the composite image [30].In visible light image harmonization, traditional methods [31][32][33] or deep learning-based methods [30,[34][35][36] can already perfectly combine the foreground and background visually.However, in the strict imaging mechanism of SAR, pixel brightness corresponds to the intensity of radar echoes, which requires synthetic images to not only look visually harmonious but also conform to the physical mechanism.Therefore, we propose a domain transfer method that uses the same ground objects as prior information to harmonize the synthetic images, conforming to the SAR imaging mechanism as much as possible.Notably, in the following two steps, we apply data harmonization to the raw radar data with high bit depth to obtain more accurate results.

Domain Transfer
Since Chips and Clutters are two different types of data, their distribution and threshold values are different, so it is necessary to unify them reasonably based on their relationship before merging.Since the background is the main body in the synthetic images, we choose to transfer masks from domain Chips to domain Clutters.Based on the satellite map and information in the source files, we noticed that the background of Chips is a dry grassland, and Clutters also contains a large amount of grassland.Both were collected in Huntsville City, less than 26 km apart, and in the autumn season, so it can be assumed that the vegetation in the grasslands of the two places is similar.To validate this assumption, we annotated the grassland in nine Clutters and conducted data analysis with the grassland in seven kinds of Chips, collection dates of which were close to Clutters, as the region of interest (RoI).As shown in Table 3, it can be seen that the coefficient of variation calculated based on Formula (1) for both data is around 0.6, indicating similar data dispersion levels.Based on the above analysis and given the similar data distribution of both data after being dimensionless, we linearly mapped the data of Chips to the data space of Clutters.According to Formula (2), we multiplied the data for Chips by the ratio coefficient K (K = 1371.8) of the mean value of the grassland in both RoIs and then rounded it.Following the pipeline shown in Figure 4a, we calculated the histograms of the grassland in transformed Chips and Clutter and calculated their cosine similarity (CSIM) according to Formula (3).From Figure 4b, it can be seen that the data distribution of the two datasets is very similar.In Table 3, the CSIM values for the two grasslands are all above 0.99.Therefore, K can be used as the mapping coefficient from domain Chips to domain Clutters, and the whole data of Chips can be harmonized by multiplying it by K.

Brightness Uniformity
Schumacher et al. pointed out that the background and the targets of Chips are highly relevant [37,38].Geng et al. conducted experiments and indicated that the SAR-ATR model recognizes vehicles by treating the brightness of the background as an important feature [39].For instance, the background of BRDM2 is brighter than that of other types of vehicles, which causes the neural network to learn from the training data that "the brighter ones are more likely to be BRDM2" [39].Thus, the SAR-ATR model cheats by recognizing the associated background to classify the vehicles.We discovered that this phenomenon is due to the nonlinear mapping of the official imaging algorithm, as seen in the left column of Table 4. ScaleAdj in the 11th step of the original algorithm is determined by the value of the most and least appearing pixels in each image.We found that the mean of ScaleAdj in BRDM2 is higher than that of other vehicles.Additionally, the non-uniform ScaleAdj results in different gray-level transformations for each category of vehicles and even for each image.Furthermore, for Clutters, the original algorithm produces very dark images.The reason for this lies in the high dynamic range of Clutters radar data, with most data in low values, and the maximum-minimum value stretching in the 3rd step, which results in most data being assigned low gray values.

Brightness Uniformity
Schumacher et al. pointed out that the background and the targets of Chips are highly relevant [37,38].Geng et al. conducted experiments and indicated that the SAR-ATR model recognizes vehicles by treating the brightness of the background as an important feature [39].For instance, the background of BRDM2 is brighter than that of other types of vehicles, which causes the neural network to learn from the training data that "the brighter ones are more likely to be BRDM2" [39].Thus, the SAR-ATR model cheats by recognizing the associated background to classify the vehicles.We discovered that this phenomenon is due to the nonlinear mapping of the official imaging algorithm, as seen in the left column of Table 4. ScaleAdj in the 11th step of the original algorithm is determined by the value of the most and least appearing pixels in each image.We found that the mean of ScaleAdj in BRDM2 is higher than that of other vehicles.Additionally, the non-uniform ScaleAdj results in different gray-level transformations for each category of vehicles and even for each image.Furthermore, for Clutters, the original algorithm produces very dark images.The reason for this lies in the high dynamic range of Clutters radar data, with most data in low values, and the maximum-minimum value stretching in the 3rd step, which results in most data being assigned low gray values.if minPixelCountBin>maxPixelCountBin then 10: thresh←minPixelCountBin-maxPixelCountBin 11: scaleAdj←255/thresh 12: img←img * scaleAdj 13: else 14: img←img * 3 15: img←uint8(img) 16: Return img Therefore, we believe that applying a uniform brightness transformation to the imaging algorithm is an effective way to avoid the aforementioned two problems, as shown in the right column of Table 4.The improved imaging algorithm maps the radar amplitude values linearly to the image gray values by setting a threshold and a linear transformation.Too high a threshold pools the low-value signals, while too low a threshold causes the loss of information from the high-value signals.Therefore, to preserve most of the image details while minimizing the loss of high-value signals, we set the threshold to 511, as 99.8% of the radar amplitudes in Clutters are less than this threshold and 95.5% for the mask of the vehicle in Chips.This approach linearly images the low-value signals and preserves most of the image details without significant loss of the high-value signals.

Embedded Synthesis
Based on OpenCV, our laboratory developed an interactive software that can conveniently embed vehicle masks of the specific category or specified azimuth angles at designated positions in the Clutters background.We follow the basic logic of radar scattering to select the embedding positions.First, we prevent the overlap of vehicle masks through logical settings at the code level.Second, we avoid placing vehicles above tall objects (such as trees or buildings) or their shadow areas.To achieve a seamless transition between the mask and the background at the edges, a 5 × 5 Gaussian operator is applied for smoothing filtering on the inner and outer circles of the mask edges.To investigate the impact of background objects and corner reflectors on SAR-ATR, we mark the recognition difficulty of vehicles near objects with strong reflection echoes, such as trees or buildings, as 1.Additionally, we embed corner reflectors with a 15 • depression in Clutters and set the recognition difficulty of vehicles near them to 2. For other vehicle positions, the recognition difficulty is set by default to 0. As shown in Equation ( 4), the final label format follows the DOTA format [3], with each ground truth including the position of the four vertices of the rectangle, category, and difficulty.The position of the vertices of each rectangle is obtained from the rotated bounding box (shown in Figure 3) after coordinate transformation.

Analysis of the Dataset
In order to meet the requirements of large scene detection tasks, we selected 34 out of 100 Clutters that can be stitched together as the test set, and the remaining 66 Clutters serve as the train set.For the Chips partition, To create a challenging dataset, we combined one SOC and two EOC division strategies.As shown in Table 5, the first EOC strategy is based on version variants using BMP2sn-9563 as the train set and BMP2sn-9566 and BMP2sn-c21 as the test sets.The second EOC strategy is based on configuration variants using a 7:3 fine-grained partitioning of T72's 11 subtypes.The rest of the eight vehicle categories are partitioned based on a 7:3 SOC strategy.Similarly, as described in Section 2.4, corner reflectors are embedded in a 7:3 ratio but are not used as detection objects.Finally, the proportion of total vehicles in the training set and the test set also remained at 66:34, which matches the partition of the Clutters.After the partitioning of the dataset, we fused Chips and Clutters according to the method described in Figure 2, resulting in 100 images.To simulate a realistic remote sensing application scenario, we stitched together the geographically contiguous images in the test set into four large images.
In summary, Mix MSTAR consists of 100 large images with 5392 vehicles in 20 finegrained categories.The geographically contiguous test set can be stitched into four large images, as shown in Figure 5.The arrangement of vehicles is diverse, with both tight and sparse groupings and various scenes such as urban, highway, grassland, and forest.
As shown in the data analysis in Figure 6, the vehicle orientations are relatively uniformly distributed between [0-2π], and the vehicle areas fluctuate due to changes in azimuth angles, with different vehicles having different sizes.The aspect ratio of the vehicle ranges from 1 to over 3.According to the definition of object sizes in the COCO regulation [40], over 98% of the vehicles are small objects, which requires detection algorithms to have good small object detection capabilities.The number of vehicles in each Clutter is also uneven, ranging from 1 to over 90 vehicles, indicating the need for detection algorithms to be more robust to the issue of uneven sample distribution.In summary, Mix MSTAR consists of 100 large images with 5392 vehicles in finegrained categories.The geographically contiguous test set can be stitched into four large images, as shown in Figure 5.The arrangement of vehicles is diverse, with both tight and sparse groupings and various scenes such as urban, highway, grassland, and forest.formly distributed between [0-2π], and the vehicle areas fluctuate due to changes in azimuth angles, with different vehicles having different sizes.The aspect ratio of the vehicle ranges from 1 to over 3.According to the definition of object sizes in the COCO regulation [40], over 98% of the vehicles are small objects, which requires detection algorithms to have good small object detection capabilities.The number of vehicles in each Clutter is also uneven, ranging from 1 to over 90 vehicles, indicating the need for detection algorithms to be more robust to the issue of uneven sample distribution.

Results
After constructing Mix MSTAR, in order to further evaluate the dataset, nine benchmark models are selected in this section to evaluate the performance of mainstream rotated object detection algorithms on Mix MSTAR.

Models Selected
In the field of deep learning, the types of detectors can be roughly divided into singlestage, refinement-stage, two-stage, and anchor-free algorithms.
The single-stage algorithm directly predicts the class and bounding box coordinates for objects from the feature maps.It tends to be computationally more efficient, albeit at the potential cost of less precise localization.
The refinement stage algorithm is typically a supplementary step incorporated within an object detection process to enhance the precision of the detected bounding box coordinates, as proposed initially.It refines the spatial dimensions of bounding boxes via

Results
After constructing Mix MSTAR, in order to further evaluate the dataset, nine benchmark models are selected in this section to evaluate the performance of mainstream rotated object detection algorithms on Mix MSTAR.

Models Selected
In the field of deep learning, the types of detectors can be roughly divided into single-stage, refinement-stage, two-stage, and anchor-free algorithms.
The single-stage algorithm directly predicts the class and bounding box coordinates for objects from the feature maps.It tends to be computationally more efficient, albeit at the potential cost of less precise localization.
The refinement stage algorithm is typically a supplementary step incorporated within an object detection process to enhance the precision of the detected bounding box coordinates, as proposed initially.It refines the spatial dimensions of bounding boxes via a series of regressors learning to make small iterative corrections towards the ground truth box, thereby improving the performance of object localization.
The two-stage algorithm operates on the principle of segregation between object localization and its classification.First, it generates region proposals through its region proposal network (RPN) stage based on the input images.Then, these proposals are run through the second stage, where the actual object detection takes place, discerning the object class and refining the bounding boxes.Due to this two-step process, these algorithms tend to be more accurate but slower.
Unlike traditional algorithms, which leverage anchor boxes as prior knowledge for object detection, anchor-free algorithms operate by directly predicting the object's bounding box without relying on predetermined anchor boxes.They circumvent drawbacks such as choosing the optimal scale, ratio, and number of anchor boxes for different datasets and tasks.Furthermore, they simplify the pipeline of object detection models and have been successful in certain contexts on both efficiency and accuracy fronts.
To make the evaluation results more convincing, the nine algorithms cover the four kinds of algorithms mentioned above.

RotatedRetinanet
Retinanet [41] argues that the core reason why single-stage detectors underperform compared to two-stage models is due to the extreme foreground and background imbalance during training.To address this, the Focal Loss was proposed, which adds two weights to the binary cross-entropy loss to balance the importance of positive and negative samples.It also reduces the emphasis on easy samples so that the focus of training is on hard negatives.Retinanet is the first single-stage model with accuracy surpassing that of two-stage models.Based on it, Rotated Retinanet predicts an additional angle in the regression branch (x, y, w, h) without other modifications.The network architecture can be seen in Figure 7.
posal network (RPN) stage based on the input images.Then, these proposals are run through the second stage, where the actual object detection takes place, discerning the object class and refining the bounding boxes.Due to this two-step process, these algorithms tend to be more accurate but slower.
Unlike traditional algorithms, which leverage anchor boxes as prior knowledge for object detection, anchor-free algorithms operate by directly predicting the object's bounding box without relying on predetermined anchor boxes.They circumvent drawbacks such as choosing the optimal scale, ratio, and number of anchor boxes for different datasets and tasks.Furthermore, they simplify the pipeline of object detection models and have been successful in certain contexts on both efficiency and accuracy fronts.
To make the evaluation results more convincing, the nine algorithms cover the four kinds of algorithms mentioned above.

RotatedRetinanet
Retinanet [41] argues that the core reason why single-stage detectors underperform compared to two-stage models is due to the extreme foreground and background imbalance during training.To address this, the Focal Loss was proposed, which adds two weights to the binary cross-entropy loss to balance the importance of positive and negative samples.It also reduces the emphasis on easy samples so that the focus of training is on hard negatives.Retinanet is the first single-stage model with accuracy surpassing that of two-stage models.Based on it, Rotated Retinanet predicts an additional angle in the regression branch (x, y, w, h) without other modifications.The network architecture can be seen in Figure 7.

S 2 A-Net
S 2 A-Net [42] is a refinement stage model that proposes the Feature Alignment Module (FAM) on the basis of improving deformable convolution (DCN) [43].The network architecture is shown in Figure 8.In the refinement stage, the horizontal anchor is refined to a rotated anchor by the Anchor Refinement Network (ARN), which is a learnable offset field module that is directly supervised by box annotations.Next, the feature map within the anchor is aligned and then convolved with the Alignment Convolution.This method eliminates the low-quality, heuristically defined anchors and addresses the uncorrelated problem between anchor boxes and the axis-aligned features they cause.

S 2 A-Net
S 2 A-Net [42] is a refinement stage model that proposes the Feature Alignment Module (FAM) on the basis of improving deformable convolution (DCN) [43].The network architecture is shown in Figure 8.In the refinement stage, the horizontal anchor is refined to a rotated anchor by the Anchor Refinement Network (ARN), which is a learnable offset field module that is directly supervised by box annotations.Next, the feature map within the anchor is aligned and then convolved with the Alignment Convolution.This method eliminates the low-quality, heuristically defined anchors and addresses the uncorrelated problem between anchor boxes and the axis-aligned features they cause.3.1.3.R 3 Det R 3 Det [44] is a refinement stage model that proposes the Feature Refinement Module (FRM) for reconstructing the feature map according to the refined bounding box.Each point in the reconstructed feature map is obtained by adding five feature vectors consisting of five points (four corner points and the center point in the refined bounding box) after interpolation.FRM can alleviate the feature misalignment problems that exist in refined single-stage detectors and can be added multiple times for better performance.Additionally, an approximate SkewIoU loss is proposed, which can better reflect the real loss of SkewIoU while maintaining differentiability.The network architecture is depicted in Figure 9. single-stage detectors and can be added multiple times for better performance.Additionally, an approximate SkewIoU loss is proposed, which can better reflect the real loss of SkewIoU while maintaining differentiability.The network architecture is depicted in Figure 9.
3.1.3.R 3 Det R 3 Det [44] is a refinement stage model that proposes the Feature Refinement Module (FRM) for reconstructing the feature map according to the refined bounding box.Each point in the reconstructed feature map is obtained by adding five feature vectors consisting of five points (four corner points and the center point in the refined bounding box) after interpolation.FRM can alleviate the feature misalignment problems that exist in refined single-stage detectors and can be added multiple times for better performance.Additionally, an approximate SkewIoU loss is proposed, which can better reflect the real loss of SkewIoU while maintaining differentiability.The network architecture is depicted in Figure 9.    [44] is a refinement stage model that proposes the Feature Refinement Module (FRM) for reconstructing the feature map according to the refined bounding box.Each point in the reconstructed feature map is obtained by adding five feature vectors consisting of five points (four corner points and the center point in the refined bounding box) after interpolation.FRM can alleviate the feature misalignment problems that exist in refined single-stage detectors and can be added multiple times for better performance.Additionally, an approximate SkewIoU loss is proposed, which can better reflect the real loss of SkewIoU while maintaining differentiability.The network architecture is depicted in Figure 9.

Oriented RCNN
Oriented RCNN [46] is built upon the Faster RCNN [2] and proposes an efficient oriented RPN network.The overall framework is shown in Figure 11.The oriented RPN uses a novel six-parameter mid-point offset representation to represent the offsets of the rotated ground truth relative to the horizontal anchor box and generate a quadrilateral proposal.Compared with RRPN [47], it avoids the huge amount of calculation caused by presetting a large number of rotating anchor boxes.Compared to ROI Transformer, it converts horizontal anchor boxes into oriented proposals in a single step, greatly reducing the parameter amount of the RPN network.Efficient and high-quality oriented proposal networks make Oriented RCNN both high-accuracy and high-speed.

Gliding Vertex
Gliding Vertex [48] introduces a robust OBB representation that addresses the limitations of predicting vertices and angles.Specifically, on the regression branch of RCNN, four extra length ratio parameters are used to slide the corresponding vertex on each side of the horizontal bounding box.This approach avoids the problem of order confusion when directly predicting the position of the four vertices.It also mitigates the high sensitivity issue caused by predicting the angle.Additionally, with the idea of divide and conquer, an area ratio parameter r is used to predict the obliquity of the bounding box.This parameter can guide the regression in the Horizontal Bounding Box method, or OBB method, resolving the confusion issue of nearly-horizontal objects.The network architecture is shown in Figure 12.
Oriented RCNN [46] is built upon the Faster RCNN [2] and proposes an efficient oriented RPN network.The overall framework is shown in Figure 11.The oriented RPN uses a novel six-parameter mid-point offset representation to represent the offsets of the rotated ground truth relative to the horizontal anchor box and generate a quadrilateral proposal.Compared with RRPN [47], it avoids the huge amount of calculation caused by presetting a large number of rotating anchor boxes.Compared to ROI Transformer, it converts horizontal anchor boxes into oriented proposals in a single step, greatly reducing the parameter amount of the RPN network.Efficient and high-quality oriented proposal networks make Oriented RCNN both high-accuracy and high-speed.

Gliding Vertex
Gliding Vertex [48] introduces a robust OBB representation that addresses the limitations of predicting vertices and angles.Specifically, on the regression branch of RCNN, four extra length ratio parameters are used to slide the corresponding vertex on each side of the horizontal bounding box.This approach avoids the problem of order confusion when directly predicting the position of the four vertices.It also mitigates the high sensitivity issue caused by predicting the angle.Additionally, with the idea of divide and conquer, an area ratio parameter r is used to predict the obliquity of the bounding box.This parameter can guide the regression in the Horizontal Bounding Box method, or OBB method, resolving the confusion issue of nearly-horizontal objects.The network architecture is shown in Figure 12.

ReDet
ReDet [49] argues that the regular CNNs are not equivariant to the rotation and that rotated data augmentation, or RRoI Align, can only approximate rotation invariance.To

Gliding Vertex
Gliding Vertex [48] introduces a robust OBB representation that addresses the limitations of predicting vertices and angles.Specifically, on the regression branch of RCNN, four extra length ratio parameters are used to slide the corresponding vertex on each side of the horizontal bounding box.This approach avoids the problem of order confusion when directly predicting the position of the four vertices.It also mitigates the high sensitivity issue caused by predicting the angle.Additionally, with the idea of divide and conquer, an area ratio parameter r is used to predict the obliquity of the bounding box.This parameter can guide the regression in the Horizontal Bounding Box method, or OBB method, resolving the confusion issue of nearly-horizontal objects.The network architecture is shown in Figure 12.

ReDet
ReDet [49] argues that the regular CNNs are not equivariant to the rotation and that rotated data augmentation, or RRoI Align, can only approximate rotation invariance.To

ReDet
ReDet [49] argues that the regular CNNs are not equivariant to the rotation and that rotated data augmentation, or RRoI Align, can only approximate rotation invariance.To address this issue, ReDet uses e2cnn theory [50] to design a new rotational equivariant backbone called ReResNet, which is based on ResNet [1].The new backbone features a higher degree of rotation weight sharing, allowing it to extract rotation-equivariant features.Additionally, the paper proposes Rotation-Invariant RoI Align, which performs warping on the spatial dimension and then circularly switches channels to interpolate and align on the orientation dimension to produce completely rotation-invariant features.The overall network architecture can be seen in Figure 13.

Rotated FCOS
FCOS [51] is an anchor-free, one-stage detector that employs a full convolution structural design.Unlike traditional detectors, FCOS eliminates the need for presetting anchors, thereby avoiding complex anchor operations, sensitive and heuristic hyperparameter settings, and the large number of parameters and calculations associated with anchors.FCOS employs the four distances (l, r, t, and b) between the feature point and the four sides of the bounding box as the prediction format.The distance between the center point and the feature point is used to measure the bounding box's center-ness, which is then multiplied by the classification score to obtain the final confidence.The multi-level prediction based on FPN [52] alleviates the influence of overlapping ambiguous samples.Rotated FCOS is a re-implementation of FCOS in rotated object detection that adds an additional angle branch parallel to the regression branch.The network architecture is shown in Figure 14.
ns. 2023, 15, 4558 14 of 28 address this issue, ReDet uses e2cnn theory [50] to design a new rotational equivariant backbone called ReResNet, which is based on ResNet [1].The new backbone features a higher degree of rotation weight sharing, allowing it to extract rotation-equivariant features.Additionally, the paper proposes Rotation-Invariant RoI Align, which performs warping on the spatial dimension and then circularly switches channels to interpolate and align on the orientation dimension to produce completely rotation-invariant features.The overall network architecture can be seen in Figure 13.3.1.8.Rotated FCOS FCOS [51] is an anchor-free, one-stage detector that employs a full convolution structural design.Unlike traditional detectors, FCOS eliminates the need for presetting anchors, thereby avoiding complex anchor operations, sensitive and heuristic hyperparameter settings, and the large number of parameters and calculations associated with anchors.FCOS employs the four distances (l, r, t, and b) between the feature point and the four sides of the bounding box as the prediction format.The distance between the center point and the feature point is used to measure the bounding box's center-ness, which is then multiplied by the classification score to obtain the final confidence.The multi-level prediction based on FPN [52] alleviates the influence of overlapping ambiguous samples.Rotated FCOS is a re-implementation of FCOS in rotated object detection that adds an additional angle branch parallel to the regression branch.The network architecture is shown in Figure 14.3.1.8.Rotated FCOS FCOS [51] is an anchor-free, one-stage detector that employs a full convolution structural design.Unlike traditional detectors, FCOS eliminates the need for presetting anchors, thereby avoiding complex anchor operations, sensitive and heuristic hyperparameter settings, and the large number of parameters and calculations associated with anchors.FCOS employs the four distances (l, r, t, and b) between the feature point and the four sides of the bounding box as the prediction format.The distance between the center point and the feature point is used to measure the bounding box's center-ness, which is then multiplied by the classification score to obtain the final confidence.The multi-level prediction based on FPN [52] alleviates the influence of overlapping ambiguous samples.Rotated FCOS is a re-implementation of FCOS in rotated object detection that adds an additional angle branch parallel to the regression branch.The network architecture is shown in Figure 14.

Oriented RepPoints
Based on RepPoints [53], Oriented RepPoints [54] summarizes three ways of converting a point set to an OBB, making it suitable for detecting aerial objects.Inherited from RepPoints, Oriented RepPoints combines DCN [43] with anchor-free key-point detection, enabling the model to extract non-axis-aligned features from an aerial perspective.To constrain the spatial distribution of point sets, the proposed spatially constrained loss constrains the vulnerable outliers within their instance owner and uses GIOU [55] to quantify localization loss.Additionally, the proposed Adaptive Points Assessment and Assignment adopts four metrics to evaluate the quality of learning point sets and use them to determine positive samples.The network architecture is shown in Figure 15.

Evaluation Metrics
In rotated object detection, the ground truth of the object's position and the bounding box predicted by the model are oriented bounding boxes.Similar to generic target detection, rotated target detection uses Intersection over Union (IoU) to measure the quality of the predicted bounding box: RepPoints, Oriented RepPoints combines DCN [43] with anchor-free key-point detection, enabling the model to extract non-axis-aligned features from an aerial perspective.To constrain the spatial distribution of point sets, the proposed spatially constrained loss constrains the vulnerable outliers within their instance owner and uses GIOU [55] to quantify localization loss.Additionally, the proposed Adaptive Points Assessment and Assignment adopts four metrics to evaluate the quality of learning point sets and use them to determine positive samples.The network architecture is shown in Figure 15.

Evaluation Metrics
In rotated object detection, the ground truth of the object's position and the bounding box predicted by the model are oriented bounding boxes.Similar to generic target detection, rotated target detection uses Intersection over Union (IoU) to measure the quality of the predicted bounding box: In the classification stage, based on the combination of the prediction bounding box and the ground truth, four results are produced: True Positives (TP), True Negatives (TN), False Negatives (FN), and False Positives (FP).Judging whether a prediction bounding box is TP is based on its IoU with the ground truth.If the IoU exceeds the threshold, it is TP; otherwise, it is FP.In the object detection of remote sensing, the IoU threshold is generally set to 0.5, as in this paper.Precision and recall are formulated as follows: TP recall= TP + FN (7) Based on precision and recall, AP is defined as the area under the precision-recall (P-R) curve, while Mean Average Precision (mAP) is defined as the mean of AP values across all classes: In the classification stage, based on the combination of the prediction bounding box and the ground truth, four results are produced: True Positives (TP), True Negatives (TN), False Negatives (FN), and False Positives (FP).Judging whether a prediction bounding box is TP is based on its IoU with the ground truth.If the IoU exceeds the threshold, it is TP; otherwise, it is FP.In the object detection of remote sensing, the IoU threshold is generally set to 0.5, as in this paper.Precision and recall are formulated as follows: recall = TP TP + FN (7) Based on precision and recall, AP is defined as the area under the precision-recall (P-R) curve, while Mean Average Precision (mAP) is defined as the mean of AP values across all classes: The F1 score is the harmonic mean of precision and recall, which is defined as:
All models involved in this article are implemented through the MMRotate [56] framework.For fair comparison, the backbone network of each detector is ResNet50 [1], pretrained on ImageNet [57] by default, and the neck is FPN [52].Each image in the train set was cropped into 4 pieces of 1024 * 1024, and the four large-scene images in the test set were split into a series of 1024 * 1024 patches with a stride of 824, which follows the setting of DOTA [3].To display the performance of each detector on Mix MSTAR as fairly as possible, we simply follow these settings without additional embellishments: data augmentation used a random flip with a probability of 0.25 on horizontal, vertical, or diagonal axes.Each model was trained for 180 epochs with 2 images per batch.The optimizer was SGD with an initial learning rate of 0.005, momentum of 0.9, and weight decay of 1 × 10 −4 .The L2 norm was adopted for gradient clipping, with the maximum gradient set to 35.The learning rate decayed by a factor of 10 at the 162nd and 177th epochs.Linear preheating was used for the first 500 iterations, with the initial preheating learning rate set to 1/3 of the initial learning rate.The mAP and its standard error for all models in this article were obtained by training the network with three different random seeds.The final result is obtained by mapping the prediction of each small picture to the big picture and applying NMS.More details can be found in our log files.

Result Analysis on Mix MSTAR
The evaluation results for nine models on Mix MSTAR are shown in Table 6, while the class-wise AP results are shown in Table 7.It is important to note that in Table 6, Precision, Recall, and F1-score are calculated based on the statistics of all categories of TP, FP, and FN.Combining the results and the previous analysis of the model and the dataset, we can draw the following conclusions: 1.
In terms of the mAP metric, Oriented RepPoints achieved the best accuracy, which we attribute to its unique proposal approach based on sampling points.This approach successfully combines deformation convolution and non-axis-aligned feature extraction.Additionally, being a two-stage model, its feature extraction is more accurate.Compared to other refine-stage models, it has more sampling points, up to 9, which makes the extracted features more comprehensive.However, the heavy use of deformation convolution has made its training speed slow.The two-stage model performs better than the single-stage network due to the initial screening of the RPN network.However, the performance of Gliding Vertex is average, which may be due to its failure to use directed proposals in the first stage, resulting in inaccurate feature extraction.ReDet has poor performance, possibly because the rotation-invariant network used in ReDet is not suitable for SAR images with low depression.Mix MSTAR is simulated at a depression of 15 • , and the shadow area is quite large, leading to significant imaging differences for the same object under different azimuth angles.For example, rotating a vehicle image at an angle of θ by α degrees would produce an image that is significantly different from the image of the same vehicle captured at (θ + α) degrees, which may cause ReResNet to extract incorrect rotation-invariant features.Compared to single-stage models, refined-stage models demonstrate a significant performance improvement, suggesting that refined-stage models are more accurate in extracting non-axis-aligned features of rotated objects, which can reduce the gap between refined-stage models and two-stage models.While the performance of R 3 Det is slightly inferior, it is similar to ReDet, and its reason may lie in the sampling points in its refined stage, which are fixed at the four corners and the center point.In low-pitch-angle SAR images, one vertex far from the radar sensor is necessarily shaded, which means that the feature extraction of the sampling point interferes with the overall feature expression.S 2 A-Net uses deformation convolution, with the position of the sampling point being learnable.Although there is still a probability of collecting data from the shaded vertexes, there are nine sampling points, which dilutes the influence of features from the shaded vertexes; 2.
In terms of speed, Rotated FCOS performs the best, benefiting from its anchor-free design and full convolution structure.Its parameters and computation are both lower than those of Rotated Retinanet.In contrast, other models use deformation convolution, non-conventional feature Alignment Convolution, or non-full convolution structures, making network speed relatively slow.Due to its special rotation-equivariant convolution, ReDet has the slowest inference speed, even though its parameters and computation are the lowest.In terms of parameter quantities, the two anchor-free models and the single-stage model have fewer parameters than other models.The RPN of the ROI Transformer requires two stages to extract the rotation ROI, so it has the most parameters.In terms of computation, due to its multi-head design, the detection head of the single-stage model is too cumbersome, making its computation not significantly lower than that of the two-stage model.However, Mix MSTAR is a small target data set, with most of its ground truth width being below 32.After five downsamplings, its localization information has been lost.A better balance may be obtained by optimizing the regression subnetwork of layers with downsample sizes greater than 32; 3.
In terms of precision and recall metrics, all networks tend to maintain high recall.
As using inter-class NMS limits the Recall integration range of mAP, like the DOTA, inter-class NMS is disabled.But this resulted in lower accuracy.Among them, ROI Transformer achieved a balance between accuracy and recall and obtained the highest F1 score; 4.
From the results presented in Table 7, it is evident that the fine-grained classification result of the T72 tank is poor and has a significant impact on all detectors.Figure 16a further illustrates this point, as the confusion matrix of Oriented RepPoints indicates a considerable number of FP assigned to wrong subtypes of the T72 tank, which is also observed in cross-category confusion intervals such as BTR70-BTR60, 2S1-T62, and T72-T62.Another notable observation is the poor detection effect of BMP2 under EOC, as indicated in the confusion matrix.Many BMP2 subtypes that did not appear in the train set were mistaken for other vehicles in testing.Figure 16b depicts the P-R curves of all detectors; 5.
Figure 17 presents the detection results of three detectors on the same picture.The results showed that the localization of the vehicles was accurate, but the recognition accuracy was not high, with a small number of false positives and misses.Additionally, we discovered two unknown vehicles in the scene, which were initially hidden among the Clutters and did not belong to the Chips.One vehicle was recognized as T62 by all three models, while the other vehicle was classified as background, possibly because its area was significantly larger than the vehicles in the Mix MSTAR.This indicates that the model trained by Mix MSTAR has the ability to recognize real vehicles.
The RPN of the ROI Transformer requires two stages to extract the rotation ROI, so it has the most parameters.In terms of computation, due to its multi-head design, the detection head of the single-stage model is too cumbersome, making its computation not significantly lower than that of the two-stage model.However, Mix MSTAR is a small target data set, with most of its ground truth width being below 32.After five downsamplings, its localization information has been lost.A better balance may be obtained by optimizing the regression subnetwork of layers with downsample sizes greater than 32; 3.In terms of precision and recall metrics, all networks tend to maintain high recall.As using inter-class NMS limits the Recall integration range of mAP, like the DOTA, interclass NMS is disabled.But this resulted in lower accuracy.Among them, ROI Transformer achieved a balance between accuracy and recall and obtained the highest F1 score; 4. From the results presented in Table 7, it is evident that the fine-grained classification result of the T72 tank is poor and has a significant impact on all detectors.Figure 16a further illustrates this point, as the confusion matrix of Oriented RepPoints indicates a considerable number of FP assigned to wrong subtypes of the T72 tank, which is also observed in cross-category confusion intervals such as BTR70-BTR60, 2S1-T62, and T72-T62.Another notable observation is the poor detection effect of BMP2 under EOC, as indicated in the confusion matrix.Many BMP2 subtypes that did not appear in the train set were mistaken for other vehicles in testing.Figure 16b depicts the P-R curves of all detectors; 5. Figure 17 presents the detection results of three detectors on the same picture.The results showed that the localization of the vehicles was accurate, but the recognition accuracy was not high, with a small number of false positives and misses.Additionally, we discovered two unknown vehicles in the scene, which were initially hidden among the Clutters and did not belong to the Chips.One vehicle was recognized as T62 by all three models, while the other vehicle was classified as background, possibly because its area was significantly larger than the vehicles in the Mix MSTAR.This indicates that the model trained by Mix MSTAR has the ability to recognize real vehicles.

Discussion
For a synthetic dataset that aims to become a detection benchmark, both fidelity and effectiveness are essential.However, in the production of Mix MSTAR, it is necessary to manually extract vehicles from Chips and fuse radar data collected in different modes before generating the final image.Thus, there are two potential problems in this process that will affect the visual effectiveness of the synthetic images:

•
Artificial traces: The vehicle masks manually extracted can alter the contour features of the vehicles and leave artificial traces in the synthetic images.Even though Gaussian smoothing was applied to reduce this effect on the vehicle edges, theoretically, these traces could still be utilized by overfitting models to identify targets; • Data variance: The vehicle and background data in Mix MSTAR were collected under different operating modes.Although we harmonized the data amplitude based on reasonable assumptions, Chips was collected using spotlight mode, while Clutters used strip mode.The two different scanning modes of radar can cause variances in the image style (particularly spatial distribution) of the foreground and background in the synthetic images.This could lead detection models to find some cheating shortcuts due to the non-realistic effects of the synthetic images, failing to extract common image features.
To address these concerns, we designed two separate experiments to demonstrate the reliability of the synthetic dataset.

Discussion
For a synthetic dataset that aims to become a detection benchmark, both fidelity and effectiveness are essential.However, in the production of Mix MSTAR, it is necessary to manually extract vehicles from Chips and fuse radar data collected in different modes before generating the final image.Thus, there are two potential problems in this process that will affect the visual effectiveness of the synthetic images:

•
Artificial traces: The vehicle masks manually extracted can alter the contour features of the vehicles and leave artificial traces in the synthetic images.Even though Gaussian smoothing was applied to reduce this effect on the vehicle edges, theoretically, these traces could still be utilized by overfitting models to identify targets; • Data variance: The vehicle and background data in Mix MSTAR were collected under different operating modes.Although we harmonized the data amplitude based on reasonable assumptions, Chips was collected using spotlight mode, while Clutters used strip mode.The two different scanning modes of radar can cause variances in the image style (particularly spatial distribution) of the foreground and background in the synthetic images.This could lead detection models to find some cheating shortcuts due to the non-realistic effects of the synthetic images, failing to extract common image features.
To address these concerns, we designed two separate experiments to demonstrate the reliability of the synthetic dataset.

The Artificial Traces Problem
To address the potential problem of artificial traces and to prove the fidelity of the synthetic dataset, our approach was to use a model trained in Mix MSTAR to detect intact vehicle images.We randomly selected 25 images from the Chips and expanded them to 204 × 204 to maintain their original size.These images were then stitched into a 1024 × 1024 large image, which was input into the ROI Transformer trained on Mix MSTAR.As shown in Figure 18a, all these intact vehicles were accurately localized, with a classification accuracy of 80%.To address the potential problem of artificial traces and to prove the fidelity of the synthetic dataset, our approach was to use a model trained in Mix MSTAR to detect intact vehicle images.We randomly selected 25 images from the Chips and expanded them to 204 × 204 to maintain their original size.These images were then stitched into a 1024 × 1024 large image, which was input into the ROI Transformer trained on Mix MSTAR.As shown in Figure 18a, all these intact vehicles were accurately localized, with a classification accuracy of 80%.However, an accuracy of 80% is not an ideal result, as the background in Chips is quite simple and the five misidentified vehicles were all subtypes of T72.As a comparison experiment, we trained and tested ResNet18 as a classification model on the 20 classes of Chips of MSTAR, following the same partition strategy as Mix MSTAR, and the classifier easily achieved 92.22% accuracy.However, we found through class activation maps [58] that since each type of vehicle in MSTAR was captured at different angles but at the same location, the high correlation between the backgrounds in Chips causes the classifier to focus more on the terrain than the vehicles themselves.As shown in Figure 19, the two subtypes of T72 were identified based on their tracks and unusual vegetation, with recognition rates of 98.77% and 100%, respectively.However, the accuracy of the two T72 subtypes that did not benefit from background correlation was only 73.17% and 66.67%, respectively.This phenomenon also existed in other types of vehicles, indicating that the training results of using background-correlated Chips are actually unreliable.However, an accuracy of 80% is not an ideal result, as the background in Chips is quite simple and the five misidentified vehicles were all subtypes of T72.As a comparison experiment, we trained and tested ResNet18 as a classification model on the 20 classes of Chips of MSTAR, following the same partition strategy as Mix MSTAR, and the classifier easily achieved 92.22% accuracy.However, we found through class activation maps [58] that since each type of vehicle in MSTAR was captured at different angles but at the same location, the high correlation between the backgrounds in Chips causes the classifier to focus more on the terrain than the vehicles themselves.As shown in Figure 19, the two subtypes of T72 were identified based on their tracks and unusual vegetation, with recognition rates of 98.77% and 100%, respectively.However, the accuracy of the two T72 subtypes that did not benefit from background correlation was only 73.17% and 66.67%, respectively.This phenomenon also existed in other types of vehicles, indicating that the training results of using background-correlated Chips are actually unreliable.
Through the detection of intact vehicles in real images, we have proven that the artificial traces generated in the process of mask extraction did not affect the models.On the contrary, benefiting from mask extraction and background transfer, Mix MSTAR eliminated background correlation, allowing models trained on the high-fidelity synthetic images to focus on vehicle features such as shadows and bright spots, as shown in Figure 18b.

The Data Variance Problem
To address the potential data variance problem and demonstrate the authentic detection capability of models obtained from Mix MSTAR, we designed the following experiment to prove the effectiveness of Mix MSTAR.The real dataset, Mini SAR, was used to train and evaluate models pretrained on Mix MSTAR and those not pretrained on Mix MSTAR.For the pretrained models, we froze the weights of the first stage of the backbone, forcing the network to extract features in the same way as it does with synthetic images.The nonpretrained models were loaded from ImageNet weights as a regular setting.We selected nine images containing vehicles as the dataset, and seven were used for training and two for validation.The images were divided into 1024 × 1024 images with a stride of 824.Since the dataset was very small, the training process for each network was unstable.Therefore, we extended the number of iterations to 240 epochs, recorded the mAP of the model on the validation set after each epoch, and set the learning rate to reduce 10 folds at the 160th epoch and the 220th epoch, with all other settings consistent with those in the Mix MSTAR experiments.It is worth noting that there is no perfect unified training setting that can fit all detectors due to their different feature extraction capabilities and the propensity for overfitting on the small dataset.Thus, we record the best results of the validation set during training in Table 8.
However, an accuracy of 80% is not an ideal result, as the background in Chips is quite simple and the five misidentified vehicles were all subtypes of T72.As a comparison experiment, we trained and tested ResNet18 as a classification model on the 20 classes of Chips of MSTAR, following the same partition strategy as Mix MSTAR, and the classifier easily achieved 92.22% accuracy.However, we found through class activation maps [58] that since each type of vehicle in MSTAR was captured at different angles but at the same location, the high correlation between the backgrounds in Chips causes the classifier to focus more on the terrain than the vehicles themselves.As shown in Figure 19, the two subtypes of T72 were identified based on their tracks and unusual vegetation, with recognition rates of 98.77% and 100%, respectively.However, the accuracy of the two T72 subtypes that did not benefit from background correlation was only 73.17% and 66.67%, respectively.This phenomenon also existed in other types of vehicles, indicating that the training results of using background-correlated Chips are actually unreliable.Firstly, as shown in Table 8, all models obtained improvements after being pretrained on Mix MSTAR.Since the weights of the first layer are frozen after pretraining, this indicates that the models effectively learn how to extract general underlying features from SAR images.Secondly, since the validation set contains only two images, the results of non-pretrained models were very unstable, but the standard errors of all models were significantly reduced after pretraining on Mix MSTAR.Additionally, as shown in Figure 20, the pretrained models had very rapid loss reduction during the training process.After a few epochs, their accuracy on the validation set increased significantly and ultimately reached a relatively stable result (see Figure 21).However, the loss and mAP of the non-pretrained models were unstable.
We noticed that Rotated RetinaNet and Rotated FCOS are very sensitive to the random seed initialization, making them prone to training failure.This may be due to the weak ability of single-stage detectors in feature extraction, which makes it difficult for them to learn effective feature extraction capabilities from a small quantity of data.Therefore, we conducted a comparison experiment in which we added the Mix MSTAR train set to the Mini SAR train set to increase the data size when training the non-pretrained models.As shown in Table 9, both single-stage models obtained significant improvements after mixed training with the two datasets.As seen in Figure 22, pretraining on Mix MSTAR or mixed training with Mix MSTAR both resulted in increased recall and precision of the models, achieving more accurate bounding box regression.
Based on the above comparison experiments using real data, we have demonstrated the effectiveness of Mix MSTAR, indicating that synthetic data can also help networks learn how to extract features from real SAR images, thereby proving the effectiveness and transferability ability of Mix MSTAR.In addition, the experiment shows that the unstable Mini SAR is not suitable as the benchmark dataset for algorithm comparison, especially for the single-stage model, and also verifies that the Mix MSTAR is effective in addressing the problem of insufficient real data for SAR vehicle detection.We noticed that Rotated RetinaNet and Rotated FCOS are very sensitive to the random seed initialization, making them prone to training failure.This may be due to the weak ability of single-stage detectors in feature extraction, which makes it difficult for them to learn effective feature extraction capabilities from a small quantity of data.Therefore, we conducted a comparison experiment in which we added the Mix MSTAR train set to the Mini SAR train set to increase the data size when training the non-pretrained models.As shown in Table 9, both single-stage models obtained significant improvements after mixed training with the two datasets.As seen in Figure 22, pretraining on Mix MSTAR or mixed training with Mix MSTAR both resulted in increased recall and precision of the models, achieving more accurate bounding box regression.

Potential Application
As more and more creative work leverages synthetic data to advance human understanding of the real world, Mix MSTAR, as the first public SAR vehicle multi-class detection dataset, has many potential applications.Here, we envision two potential use cases: • SAR image generation.While mutual conversion between optical and SAR imagery is no longer a groundbreaking achievement, current style transfer methods between visible light and SAR are primarily used for low-resolution terrain classification [59].
Given the scarcity of high-resolution SAR images and the abundance of high-resolution labeled visible light images, a promising avenue is to combine the two to generate more synthetic SAR images to address the lack of labeled SAR data and ultimately improve real SAR object detection.Although the synthetic image obtained in this way cannot be used for model evaluation, it can help the detection model obtain stronger positioning ability when detecting real SAR objects through pre-training or mixed training.Figure 23 demonstrates an example of using CycleGAN [60] to transfer vehicle images from the DOTA domain to the Mix MSTAR domain; • Out-of-distribution detection.Out-of-distribution detection, or OOD detection, aims to detect test samples that are drawn from a distribution that is different from the training distribution [61].Using the model trained by synthetic images to classify real images was regarded as a challenging problem in SAMPLE [25].Unlike visible-light imagery, SAR imaging is heavily influenced by sensor operating parameters, resulting in significant stylistic differences between images captured under different conditions.Our experiments found that current models' performance on different SAR datasets is poorly generalizable.If reannotation and retraining are required for every new dataset, the cost will increase significantly, exacerbating the scarcity of SAR imagery and limiting the application scenarios of SAR-ATR.Therefore, it is an important research direction to use the limited labeled datasets to detect more unlabeled data.We used the Redet model trained on Mix MSTAR to detect real vehicles in an image from FARAD KA BAND.Due to resolution differences, three vehicles were detected after applying multi-scale test techniques, as shown in Figure 24.
after mixed training with the two datasets.As seen in Figure 22, pretraining on Mix MSTAR or mixed training with Mix MSTAR both resulted in increased recall and precision of the models, achieving more accurate bounding box regression.Based on the above comparison experiments using real data, we have demonstrated the effectiveness of Mix MSTAR, indicating that synthetic data can also help networks learn how to extract features from real SAR images, thereby proving the effectiveness and transferability ability of Mix MSTAR.In addition, the experiment shows that the unstable Mini SAR is not suitable as the benchmark dataset for algorithm comparison, especially for the single-stage model, and also verifies that the Mix MSTAR is effective in addressing the problem of insufficient real data for SAR vehicle detection.

Potential Application
As more and more creative work leverages synthetic data to advance human understand- image from FARAD KA BAND.Due to resolution differences, three vehicles were detected after applying multi-scale test techniques, as shown in Figure 24.

Conclusions
This research released a large-scale SAR image synthesis dataset for multi-class rotated vehicle detection and proposed a paradigm for realistically fusing SAR data from different domains.Upon evaluating nine different benchmark detectors, we found that fine-grained classification makes Mix MSTAR highly challenging, with considerable room for improving object detection performance.Additionally, to address concerns over potential artificial traces and data variance in synthetic data, we conducted two experiments to demonstrate the fidelity and effectiveness of Mix MSTAR.Finally, we summarized two potential applications of Mix MSTAR and called on the community to enhance communication and cooperation in SAR data sharing to alleviate the scarcity of data and promote the development of SAR.

Conclusions
This research released a large-scale SAR image synthesis dataset for multi-class rotated vehicle detection and proposed a paradigm for realistically fusing SAR data from different domains.Upon evaluating nine different benchmark detectors, we found that fine-grained classification makes Mix MSTAR highly challenging, with considerable room for improving object detection performance.Additionally, to address concerns over potential artificial traces and data variance in synthetic data, we conducted two experiments to demonstrate the fidelity and effectiveness of Mix MSTAR.Finally, we summarized two potential applications of Mix MSTAR and called on the community to enhance communication and cooperation in SAR data sharing to alleviate the scarcity of data and promote the development of SAR.

Conclusions
This research released a large-scale SAR image synthesis dataset for multi-class rotated vehicle detection and proposed a paradigm for realistically fusing SAR data from different domains.Upon evaluating nine different benchmark detectors, we found that fine-grained classification makes Mix MSTAR highly challenging, with considerable room for improving object detection performance.Additionally, to address concerns over potential artificial traces and data variance in synthetic data, we conducted two experiments to demonstrate the fidelity and effectiveness of Mix MSTAR.Finally, we summarized two potential applications of Mix MSTAR and called on the community to enhance communication and cooperation in SAR data sharing to alleviate the scarcity of data and promote the development of SAR.

Figure 1 .
Figure 1.Three data generation methods around MSTAR.(a) Some sample pictures based on GANs; (b) Some sample pictures from SAMPLE [25] based on CAD 3D modeling and electromagnetic calculation simulation; (c) A sample picture based on background transfer.

Figure 1 .
Figure 1.Three data generation methods around MSTAR.(a) Some sample pictures based on GANs; (b) Some sample pictures from SAMPLE [25] based on CAD 3D modeling and electromagnetic calculation simulation; (c) A sample picture based on background transfer.

Figure 2 .
Figure 2. The pipeline for constructing the synthetic dataset.

Figure 3 .
Figure 3. Vehicle segmentation label, containing a mask of the vehicle and its shadow and a rotated bounding box of its visually salient part.(a) The label of the vehicle when the boundary is relatively clear; (b) the label of the vehicle when the boundary is blurred.

Figure 2 .
Figure 2. The pipeline for constructing the synthetic dataset.

Figure 2 .
Figure 2. The pipeline for constructing the synthetic dataset.

Figure 3 .
Figure 3. Vehicle segmentation label, containing a mask of the vehicle and its shadow and a rotated bounding box of its visually salient part.(a) The label of the vehicle when the boundary is relatively clear; (b) the label of the vehicle when the boundary is blurred.

Figure 3 .
Figure 3. Vehicle segmentation label, containing a mask of the vehicle and its shadow and a rotated bounding box of its visually salient part.(a) The label of the vehicle when the boundary is relatively clear; (b) the label of the vehicle when the boundary is blurred.

Figure 4 .
Figure 4. (a) The pipeline for extracting grass and calculating the cosine similarity; (b) The histogram of the grass in Chips and Clutters.

Figure 6 .
Figure 6.Data statistics for Mix MSTAR (a) The area distribution of different categories of vehicles; (b) histogram of the number of annotated instances per image; (c) The number of vehicles in different azimuths; (d) The length-width distribution and aspect ratio distribution of vehicles.

Figure 6 .
Figure 6.Data statistics for Mix MSTAR (a) The area distribution of different categories of vehicles; (b) histogram of the number of annotated instances per image; (c) The number of vehicles in different azimuths; (d) The length-width distribution and aspect ratio distribution of vehicles.

Figure 7 .
Figure 7.The architecture of the Rotated Retinanet.

Figure 7 .
Figure 7.The architecture of the Rotated Retinanet.

Figure 8 .
Figure 8.The architecture of S 2 A-Net.

Figure 8 .
Figure 8.The architecture of S 2 A-Net.3.1.3.R 3 Det R 3 Det [44] is a refinement stage model that proposes the Feature Refinement Module (FRM) for reconstructing the feature map according to the refined bounding box.Each point in the reconstructed feature map is obtained by adding five feature vectors consisting of five points (four corner points and the center point in the refined bounding box) after interpolation.FRM can alleviate the feature misalignment problems that exist in refined

Figure 9 .
Figure 9.The architecture of R 3 Det.3.1.4.ROI Transformer ROI Transformer [45] is a two-stage model that adds a learnable module from horizontal RoI (HRoI) to rotated RoI (RRoI).It generates HRoI based on a small number of horizontal anchors and proposes RRoI via the offset of the rotated ground truth relative to HRoI.This operation eliminates the need to preset a large number of rotated anchors with different angles for directly generating RRoI.In the next step, the proposed Rotated Position Sensitive RoI Align extracts rotation-invariant features from the feature map and RRoI to enhance subsequent classification and regression.This study also examines the advantages of retaining appropriate context in RRoI for enhancing the detector's performance.The network architecture of ROI Transformer is shown in Figure 10.

Figure 9 .
Figure 9.The architecture of R 3 Det.3.1.4.ROI Transformer ROI Transformer [45] is a two-stage model that adds a learnable module from horizontal RoI (HRoI) to rotated RoI (RRoI).It generates HRoI based on a small number of horizontal anchors and proposes RRoI via the offset of the rotated ground truth relative to HRoI.This operation eliminates the need to preset a large number of rotated anchors with different angles for directly generating RRoI.In the next step, the proposed Rotated Position Sensitive RoI Align extracts rotation-invariant features from the feature map and RRoI to enhance subsequent classification and regression.This study also examines the advantages of retaining appropriate context in RRoI for enhancing the detector's performance.The network architecture of ROI Transformer is shown in Figure 10.

Figure 12 .
Figure 12.The architecture of the Gliding Vertex.

Figure 12 .
Figure 12.The architecture of the Gliding Vertex.

Figure 12 .
Figure 12.The architecture of the Gliding Vertex.

Figure 16 .
Figure 16.(a) Confusion matrix of Oriented RepPoints on Mix MSTAR; (b) P-R curves of models on Mix MSTAR.

Figure 17 .
Figure 17.Some detection results of different models on Mix MSTAR.(a) Ground truth; (b) result of S 2 A-Net; (c) result of ROI Transformer; (d) result of Oriented RepPoints.

Figure 18 .
Figure 18.(a) The result of the ROI Transformer on concatenated Chips; (b) class activation map of concatenated Chips.

Figure 18 .
Figure 18.(a) The result of the ROI Transformer on concatenated Chips; (b) class activation map of concatenated Chips.

Figure 23 .Figure 24 .
Figure 23.The style transfer of optical and SAR by using CycleGAN.(a) An optical car image with a label from the DOTA domain; (b) a transferred image on the Mix MSTAR domain.

Figure 23 .Figure 23 .Figure 24 .
Figure 23.The style transfer of optical and SAR by using CycleGAN.(a) An optical car image with a label from the DOTA domain; (b) a transferred image on the Mix MSTAR domain.

Table 1 .
Detailed information on existing public SAR vehicle datasets with large scenes.

Table 2 .
Basic radar parameters of Chips and Clutters in MSTAR.

Table 3 .
Analysis of grassland data from Chips and Clutters in the same period.

Table 4 .
Original imaging algorithm and improved imaging algorithm.

Table 5 .
The division of Mix MSTAR.

Table 6 .
Performance evaluation of models on Mix MSTAR.
1The bold format represents the best indicator, and the following tables are the same.

Table 7 .
AP50 of each category on Mix MSTAR.

Table 8 .
Best mAP of pretrained/unpretrained models on the Mini SAR validation set.

Table 9 .
mAP of pretrained/unpretrained/mixed trained models on Mini SAR.

Table 9 .
mAP of pretrained/unpretrained/mixed trained models on Mini SAR.

Table 9 .
mAP of pretrained/unpretrained/mixed trained models on Mini SAR.