A Hierarchical Convolution Neural Network ( CNN )-Based Ship Target Detection Method in Spaceborne SAR Imagery

The ghost phenomenon in synthetic aperture radar (SAR) imaging is primarily caused by azimuth or range ambiguities, which cause difficulties in SAR target detection application. To mitigate this influence, we propose a ship target detection method in spaceborne SAR imagery, using a hierarchical convolutional neural network (H-CNN). Based on the nature of ghost replicas and typical target classes, a two-stage CNN model is built to detect ship targets against sea clutter and the ghost. First, regions of interest (ROIs) were extracted from a large imaged scene during the coarse-detection stage. Unwanted ghost replicas represented major residual interference sources in ROIs, therefore, the other CNN process was executed during the fine-detection stage. Finally, comparative experiments and analyses, using Sentinel-1 SAR data and various assessment criteria, were conducted to validate H-CNN. Our results showed that the proposed method can outperform the conventional constant false-alarm rate technique and CNN-based models.


Introduction
Synthetic aperture radar (SAR) is an active microwave sensor, whose resolution-both in range and azimuth-can be improved via the pulse compression technique and synthetic aperture principle, to obtain high resolution remote sensing images.Moreover, another advantage of SAR imaging is its ability to operate on an all-weather/all-day-and-night basis [1].Its application has been of interest in a variety of fields [2,3], e.g., SAR-based ocean remote sensing is widely used for environmental monitoring, search and rescue, target recognition, etc. [4].Spaceborne SAR can also be operated over long periods in wide-area and real-time observations.In this context, it became a fundamental system for ship target recognition [5][6][7][8].
Typical ship target recognition using SAR imagery involves land and sea segmentation, target detection, target recognition, etc.In a large SAR image, target detection can be based on the feature difference between targets and backgrounds.In this process, a minimum region in one target chip containing the whole target can be confirmed [9], and the other part is considered background.Obvious feature differences normally exist between target and background regions, i.e., grayscale, multi-resolution, polarization, phase, etc., which form the basis for the design of many target detection methods.Hu et al. analyzed multidimensional SAR information using a linear time-frequency (TF) decomposition approach [10].Yuan et al. extracted the gradient ratio pattern for each pixel based on Weber's law, and used the local gradient ratio pattern histogram (LGRPH) for SAR target recognition [11].In addition, the conventional constant false alarm rate (CFAR) technique is a typical detection method based on the grayscale feature.However, complicated and cluttered backgrounds severely affect CFAR detection performance [12].
In recent years, ship target detection based on deep learning (DL) has been widely studied [13,14], using the typical model of convolutional neural network (CNN) [15].Liu et al. [16] presented a ship detection method, namely sea-land segmentation-based convolutional neural network (SLS-CNN), which combines a SLS-CNN detector, saliency computation, and corner features.Furthermore, Zhao et al. [17] proposed a spaceborne SAR ship detection algorithm based on low complexity CNN.Some other well-known CNN-based target detection methods include faster region-CNN (Faster R-CNN), you only look once (YOLO) list model, etc.For example, Li et al. [18,19] improved detection performance using Faster R-CNN, to successfully provide a densely connected multi-scale neural network [19].This method is used to solve multi-scale and multi-scene problems in SAR ship detection.Feature maps are fused by densely connecting different feature map layers, rather than information from single feature maps, which represent top-to-down feature map connections.The R-CNN method is used for target recognition in large scene SAR images [20].Furthermore, Hamza and Cai used YOLOv2 for ship detection [21], which introduced a multitude of enhancements into the original YOLO model.
However, these methods may be no longer effective when ghost replicas exist in an imaged scene.The ghost phenomenon is an intrinsic effect of SAR's ambiguity, both in azimuth and range [22,23].Range ambiguity occurs when different backscattered echoes-one related to a transmitted pulse and the other due to a previous transmission-temporarily overlap during the receiving operation [24].On the other hand, azimuth ambiguity is caused by the aliasing of each target's Doppler phase history.The Doppler frequency, which is higher than pulse repetition frequency (PRF), may lead to azimuth ambiguity [25].This phenomenon is particularly relevant for high reflectivity targets, which appears in SAR images as ghosts in low reflectivity areas [26].Moreover, according to the ghost generating principle, it is similar to its real target, rendering discrimination difficult.Azimuth ambiguity is prominent due to the spaceborne SAR's fast platform velocity and big azimuth Doppler bandwidth.
According to the ghost generating principle and characteristics, we provided a hierarchical CNN-based ship target detection method in spaceborne SAR imagery, i.e., H-CNN.Hierarchical processing includes two stages: the coarse detection and fine detection.First, regions of interest (ROIs) were extracted from a large imaged scene in the coarse-detection stage.Although most land and sea background-related clutter was removed, ghost replicas remained in the ROIs.Therefore, the fine detection stage was introduced to further refine target detection against ghost replicas.In the experiments, H-CNN was trained and tested using Sentinel-1 SAR data [27].In the following sections we first discuss H-CNN parameter configuration for optimal detection results.Then, the feature extraction quality is analyzed.Detailed texture and abstract semantic information are extracted using different convolutional layer operations.Finally, we conduct detection experiments to validate the H-CNN, and compare it to conventional CFAR technique and CNN models.

Ghost Phenomenon in Spaceborne SAR
Spaceborne SAR is an applied formation of SAR in space.Spaceborne SAR has some characteristic differences compared to airborne SAR [28][29][30][31], e.g., the former image normally has large data size due to its large antenna beam irradiation range, etc.
Ghost is an image representation of SAR ambiguity in range or azimuth direction.When PRF is too high, successive pulses may be aliased in one pulse period [32].The distance between the target and its range ambiguity ghost can be calculated as follows [33,34]: where n is the index of azimuth ambiguities, indicating the spatial location of ghost replicas in the azimuth direction, λ is the radar wavelength, f PRF is the PRF, f DR is the Doppler rate, and f DC is the Doppler centroid.
If the PRF is excessively low, the part of Doppler frequency higher than PRF is folded into the azimuth spectrum, resulting in the occurrence of azimuth ambiguity.Figure 1 illustrates azimuth ambiguity formation with azimuth antenna pattern and PRF.B D is the Doppler bandwidth and B D ≈ 2V/L a , where V is the SAR platform velocity and L a is the antenna size in the azimuth direction.When B D is greater than the value of PRF, as shown in Figure 1, undersampling causes aliasing in the azimuth spectrum.Blue and red dashed curves denote the first left and right replicas due to the sampling, respectively.where n is the index of azimuth ambiguities, indicating the spatial location of ghost replicas in the azimuth direction,  is the radar wavelength, If the PRF is excessively low, the part of Doppler frequency higher than PRF is folded into the azimuth spectrum, resulting in the occurrence of azimuth ambiguity.Figure 1  , where V is the SAR platform velocity and a L is the antenna size in the azimuth direction.When D B is greater than the value of PRF, as shown in Figure 1, undersampling causes aliasing in the azimuth spectrum.Blue and red dashed curves denote the first left and right replicas due to the sampling, respectively.The distance between azimuth ambiguity ghost and target can be calculated by Equation ( 2) [33,34]: where ' R is the slant range and V is the SAR platform velocity.Moreover, in the case of a scene where ships are moving on a smooth sea surface, bright targets against a dark background would be present in the SAR image.In such cases, ghosts are noticeably observed, and may impose severe difficulties during ship target detection.
According to spaceborne SAR parameters, theoretical range and azimuth ambiguity distances can be estimated by Equations ( 1) and (2), respectively.Taking for instance Sentinel-1 SAR data, we analyze its azimuth ambiguity in some SAR images.Its imaging geometry is shown in Figure 2a.Although it contains four imaging modes, we only show the interferometric wide (IW) swath mode.
Moreover, Sentinel-1 SAR system parameters play a significant role in the imaging, which contain platform speed, altitude of satellite to earth ground R , elevation angle  , PRF, etc. Table 1a,b show the Sentinel-1 satellite SAR system and a ship's example parameters, respectively.Three different PRFs exist in one group of Sentinel-1 data.Furthermore, according to the characteristics of spaceborne SAR, slant range is influenced by the Earth's curvature and distance from ground to satellite-their relationship is shown in Figure 2b.In other words, it can therefore be calculated using the satellite's altitude from the Earth's ground, radius of the Earth earth R , elevation angle, and incidence angle  .The distance between azimuth ambiguity ghost and target can be calculated by Equation ( 2) [33,34]: where R is the slant range and V is the SAR platform velocity.Moreover, in the case of a scene where ships are moving on a smooth sea surface, bright targets against a dark background would be present in the SAR image.In such cases, ghosts are noticeably observed, and may impose severe difficulties during ship target detection.
According to spaceborne SAR parameters, theoretical range and azimuth ambiguity distances can be estimated by Equations ( 1) and (2), respectively.Taking for instance Sentinel-1 SAR data, we analyze its azimuth ambiguity in some SAR images.Its imaging geometry is shown in Figure 2a.Although it contains four imaging modes, we only show the interferometric wide (IW) swath mode.Moreover, Sentinel-1 SAR system parameters play a significant role in the imaging, which contain platform speed, altitude of satellite to earth ground R, elevation angle β, PRF, etc. Table 1a,b show the Sentinel-1 satellite SAR system and a ship's example parameters, respectively.Three different PRFs exist in one group of Sentinel-1 data.Furthermore, according to the characteristics of spaceborne SAR, slant range is influenced by the Earth's curvature and distance from ground to satellite-their relationship is shown in Figure 2b.In other words, it can therefore be calculated using the satellite's altitude from the Earth's ground, radius of the Earth R earth , elevation angle, and incidence angle θ.Theoretical azimuth ambiguity distance can be obtained using Equation (2).When 1 n = , the results in the cases of three PRF are ~5031.4m, 4254.9 m, and 4940.6 m, respectively.The right graph of Figure 3 depicts the SAR image of the ship example and corresponding ghost replicas.We then extracted the azimuth direction sequence in one fixed range direction cell.In order to decrease the dynamic range of amplitude in azimuth direction, we expressed it in decibels.Finally, the sequence in azimuth direction is shown in the left graph of Figure 3.The distances between two ghosts and their target are approximately estimated to be ~4630 m and 4970 m, respectively, which are close to theoretical values mentioned above.
Discrimination difficulty is due to the fact that some traditional characteristics of a target and its corresponding ghost are similar, i.e., length-width ratio, area and shape complexity, etc. [35][36][37].We therefore need to dispose of special discrimination between target and ghost, to eliminate the negative effects of ghosts on the detection performance.

Altitude
Slant range

Interference wide
Range direction  Theoretical azimuth ambiguity distance can be obtained using Equation (2).When n = 1, the results in the cases of three PRF are ~5031.4m, 4254.9 m, and 4940.6 m, respectively.The right graph of Figure 3 depicts the SAR image of the ship example and corresponding ghost replicas.We then extracted the azimuth direction sequence in one fixed range direction cell.In order to decrease the dynamic range of amplitude in azimuth direction, we expressed it in decibels.Finally, the sequence in azimuth direction is shown in the left graph of Figure 3.The distances between two ghosts and their target are approximately estimated to be ~4630 m and 4970 m, respectively, which are close to theoretical values mentioned above.
Discrimination difficulty is due to the fact that some traditional characteristics of a target and its corresponding ghost are similar, i.e., length-width ratio, area and shape complexity, etc. [35][36][37].We therefore need to dispose of special discrimination between target and ghost, to eliminate the negative effects of ghosts on the detection performance.

Property Analyses of Ship Target and Ghost Replica
Some traditional characteristics are similar between a target and its corresponding ghost, i.e., length-width ratio, area and shape complexity, etc.It is therefore necessary to analyze their differences.The proposed method in this paper was designed based on the amplitude information in space dimension.Thus, we discuss the amplitude statistical feature of target chips and their ghost replicas.Amplitude distribution differences between target and ghost highlight their degree of distinction.In other words, a more obvious amplitude distribution difference makes the discrimination between target and ghost easier.First, one-to-one target chips and ghost replicas were collected from Sentinel-1 SAR data, all of which contain 100 groups.Amplitude normalization was performed for comparison convenience.For ghost replicas and target chips, the ratio of point number in the corresponding amplitude range to the overall pixel number was calculated as shown in Figure 4.Moreover, we enlarged local distribution results in the range of normalized amplitude from 0 to 0.02, which demonstrated that the amplitudes of most pixels are in this region.We found that the two distribution formations are similar, in that they first increase and then decline.When the normalized amplitude is higher than 0.02, the proportion difference of two distributions decreases and all the values are close to zero.

Architecture of the H-CNN Model
Traditional CNN consists of convolutional, pooling, and fully connected layers.The convolutional layer is used for feature extraction.Many convolutional kernels exist in every convolutional layer, and each pixel of kernel corresponds to one weight and one bias.Each neuron in the convolutional layer must be connected to several neighboring regions of the front layer.In addition, kernel size decides region size.In convolutional operation, kernels regularly slide in the whole feature map and feature extraction is realized as:

Property Analyses of Ship Target and Ghost Replica
Some traditional characteristics are similar between a target and its corresponding ghost, i.e., length-width ratio, area and shape complexity, etc.It is therefore necessary to analyze their differences.The proposed method in this paper was designed based on the amplitude information in space dimension.Thus, we discuss the amplitude statistical feature of target chips and their ghost replicas.Amplitude distribution differences between target and ghost highlight their degree of distinction.In other words, a more obvious amplitude distribution difference makes the discrimination between target and ghost easier.First, one-to-one target chips and ghost replicas were collected from Sentinel-1 SAR data, all of which contain 100 groups.Amplitude normalization was performed for comparison convenience.For ghost replicas and target chips, the ratio of point number in the corresponding amplitude range to the overall pixel number was calculated as shown in Figure 4.Moreover, we enlarged local distribution results in the range of normalized amplitude from 0 to 0.02, which demonstrated that the amplitudes of most pixels are in this region.We found that the two distribution formations are similar, in that they first increase and then decline.When the normalized amplitude is higher than 0.02, the proportion difference of two distributions decreases and all the values are close to zero.

Property Analyses of Ship Target and Ghost Replica
Some traditional characteristics are similar between a target and its corresponding ghost, i.e., length-width ratio, area and shape complexity, etc.It is therefore necessary to analyze their differences.The proposed method in this paper was designed based on the amplitude information in space dimension.Thus, we discuss the amplitude statistical feature of target chips and their ghost replicas.Amplitude distribution differences between target and ghost highlight their degree of distinction.In other words, a more obvious amplitude distribution difference makes the discrimination between target and ghost easier.First, one-to-one target chips and ghost replicas were collected from Sentinel-1 SAR data, all of which contain 100 groups.Amplitude normalization was performed for comparison convenience.For ghost replicas and target chips, the ratio of point number in the corresponding amplitude range to the overall pixel number was calculated as shown in Figure 4.Moreover, we enlarged local distribution results in the range of normalized amplitude from 0 to 0.02, which demonstrated that the amplitudes of most pixels are in this region.We found that the two distribution formations are similar, in that they first increase and then decline.When the normalized amplitude is higher than 0.02, the proportion difference of two distributions decreases and all the values are close to zero.

Architecture of the H-CNN Model
Traditional CNN consists of convolutional, pooling, and fully connected layers.The convolutional layer is used for feature extraction.Many convolutional kernels exist in every convolutional layer, and each pixel of kernel corresponds to one weight and one bias.Each neuron in the convolutional layer must be connected to several neighboring regions of the front layer.In addition, kernel size decides region size.In convolutional operation, kernels regularly slide in the whole feature map and feature extraction is realized as:

Architecture of the H-CNN Model
Traditional CNN consists of convolutional, pooling, and fully connected layers.The convolutional layer is used for feature extraction.Many convolutional kernels exist in every convolutional layer, and each pixel of kernel corresponds to one weight and one bias.Each neuron in the convolutional layer must be connected to several neighboring regions of the front layer.In addition, kernel size decides region size.In convolutional operation, kernels regularly slide in the whole feature map and feature extraction is realized as: where Z l i,j and Z l+1 i,j are the input and output results in (i, j) pixel of the lth convolutional layer, respectively.They are all named as feature maps.In addition, w l and b are weight and bias of convolutional kernel in convolutional layer l, respectively.f (•) is an activation function which is usually designed as sigmoid, rectified linear unit (ReLU) [38], etc.In this paper, ReLU is selected and is defined by: After convolutional layer feature extraction, feature maps are transmitted to the next pooling layer.The pooling operation is used for selecting a few points to replace the whole feature map.Classic pooling methods include max pooling-which we applied in this paper-mean pooling, etc.
Finally, feature maps are fully connected in the last layer, which is similar to the hidden layer of traditional feedforward neural network.In this layer, multi-dimensional feature map structures are reshaped.
Traditional CNN is a supervised network.It is usually optimized by the well-known stochastic gradient descent (SGD) algorithm [39,40], which is basically an improved version of the batch gradient descent (BGD) method.In every iterative procedure, all samples were computed using this optimization algorithm.Moreover, to solve the slow update problem, a group of samples were stochastically selected and used for gradient direction determination in one iterative procedure.In the next iteration, a new group of stochastically selected samples was applied for the parameter update.When the loss of function arrives at the minimum value and remains stable, all parameters, i.e., weight and bias, are confirmed.
In this paper, we provide the H-CNN method for ship target detection in the spaceborne SAR imagery, with the hierarchical training pattern.The first coarse-detection stage of H-CNN was used to discriminate between ROIs and background.The ship targets were further determined from the interference of ghost replicas during the fine-detection stage.In the test phase, the whole SAR image was cut into several chips, and processed using coarse-and fine-detection stages, during which ship targets are extracted from the whole SAR.Here, all SAR chips were input in the coarse-detection stage.The chips were extracted when different from background.In order to further mitigate ghost interference, chips extracted after the coarse-detection stage were discriminated during the fine-detection stage for the ship target detection.It should be noted that large quantities of sea chips were always present.Therefore, the coarse detection could ease the computational burden for the following step by removing plenty of background chips.Furthermore, the fine-detection stage focuses on the elimination of ghost interference.However, since the sliding step is smaller than chip size, the overlapping phenomenon may occur.We used non-maximum suppression (NMS) [41] to further dispose of coarse-detection stage results.Architecture of the H-CNN model is shown in Figure 5.
During the coarse-detection stage, the network was trained using target and background samples.This part of the network mainly focuses on ROI extraction from a large imaged scene.Since unwanted ghost replicas are major interference sources that remain in ROIs, coarse-detection stage outputs are inputs into the fine-detection stage network, which facilitates the discrimination between real targets and ghosts.In the meantime, the fine-detection stage network is trained using target and ghost samples.NMS is disposed to all ROIs, which are extracted during the coarse-detection stage.Based on this process, ship target detection in spaceborne SAR imagery can be realized.

Dataset
In order to verify the effectiveness of proposed method, we applied it to Sentinel-1 SAR data [27].Sentinel-1 satellite is an Earth observation satellite from the European Space Agency Copernicus Project.It consists of two satellites: Sentinel-1A and Sentinel-1B, and carries C-band SAR, which can provide continuous images in all-weather/all-day-and-night conditions.Nowadays, a series of operational services can be provided by Sentinel-1 SAR data, which include mapping of arctic sea ice and daily sea ice, marine environment monitoring, ground motion risk monitoring, forest mapping, etc.In this study, we collected data in the IW model, as shown in Figure 2. Its resolution was 5 m × 20 m, imaging field width is 250 km, and orbit altitude is 693 km.
To further ensure the training samples' reliability, each ship in the target sample set was verified using the Australian Maritime Safety Authority's (AMSA) information [42].These ship samples are collected in three Australian regions (North West, Great Australian Bight, and Bass Strait), which are indicated by white rectangles in Figure 6.To further guarantee the high diversity of ship types, we elaborate ship types using information provided on the AMSA website.For example, six-type ship SAR data are confirmed, i.e., cargo, tanker, dredging ship, fishing ship, tug, and other.Some samples of SAR target images and their corresponding optical images are shown in Figure 7.In most cases, cargo and tanker are larger than other ships, and thus their structures in SAR images are obvious.On the other hand, the dredging ship is small, which is indicated by the SAR and optical images.

Dataset
In order to verify the effectiveness of proposed method, we applied it to Sentinel-1 SAR data [27].Sentinel-1 satellite is an Earth observation satellite from the European Space Agency Copernicus Project.It consists of two satellites: Sentinel-1A and Sentinel-1B, and carries C-band SAR, which can provide continuous images in all-weather/all-day-and-night conditions.Nowadays, a series of operational services can be provided by Sentinel-1 SAR data, which include mapping of arctic sea ice and daily sea ice, marine environment monitoring, ground motion risk monitoring, forest mapping, etc.In this study, we collected data in the IW model, as shown in Figure 2. Its resolution was 5 m × 20 m, imaging field width is 250 km, and orbit altitude is 693 km.
To further ensure the training samples' reliability, each ship in the target sample set was verified using the Australian Maritime Safety Authority's (AMSA) information [42].These ship samples are collected in three Australian regions (North West, Great Australian Bight, and Bass Strait), which are indicated by white rectangles in Figure 6.To further guarantee the high diversity of ship types, we elaborate ship types using information provided on the AMSA website.For example, six-type ship SAR data are confirmed, i.e., cargo, tanker, dredging ship, fishing ship, tug, and other.

Dataset
In order to verify the effectiveness of proposed method, we applied it to Sentinel-1 SAR data [27].Sentinel-1 satellite is an Earth observation satellite from the European Space Agency Copernicus Project.It consists of two satellites: Sentinel-1A and Sentinel-1B, and carries C-band SAR, which can provide continuous images in all-weather/all-day-and-night conditions.Nowadays, a series of operational services can be provided by Sentinel-1 SAR data, which include mapping of arctic sea ice and daily sea ice, marine environment monitoring, ground motion risk monitoring, forest mapping, etc.In this study, we collected data in the IW model, as shown in Figure 2. Its resolution was 5 m × 20 m, imaging field width is 250 km, and orbit altitude is 693 km.
To further ensure the training samples' reliability, each ship in the target sample set was verified using the Australian Maritime Safety Authority's (AMSA) information [42].These ship samples are collected in three Australian regions (North West, Great Australian Bight, and Bass Strait), which are indicated by white rectangles in Figure 6.To further guarantee the high diversity of ship types, we elaborate ship types using information provided on the AMSA website.For example, six-type ship SAR data are confirmed, i.e., cargo, tanker, dredging ship, fishing ship, tug, and other.Some samples of SAR target images and their corresponding optical images are shown in Figure 7.In most cases, cargo and tanker are larger than other ships, and thus their structures in SAR images are obvious.On the other hand, the dredging ship is small, which is indicated by the SAR and optical images.Some samples of SAR target images and their corresponding optical images are shown in Figure 7.In most cases, cargo and tanker are larger than other ships, and thus their structures in SAR images are obvious.On the other hand, the dredging ship is small, which is indicated by the SAR and optical images.Ghost samples are extracted based on the corresponding target positions.Figure 8 shows a SAR image used in the test, where target chips and ghost replicas are highlighted by blue and yellow squares, respectively.In order to present the corresponding relationship, we labeled target as T-i, where the target chip is i.The ghost is labeled as G-i, where the ghost replica is i.We can identify 23 ship targets and 4 ghost replicas in this image.On this basis, target chips, ghost replicas, and background chips were collected, which contained 350 samples with the size of 40 pixels × 40 pixels, respectively, and were used for H-CNN training.Additional 149 Sentinel-1 SAR images with the size of 670 pixels × 643 pixels were applied to test the proposed networks performance.Altogether, 480 ships chips and 304 ghost replicas were present.To verify the effectiveness of H-CNN, training samples and test SAR images were acquired from different Sentinel-1 SAR data.The ship targets were confirmed by the maritime information on the AMSA website.Furthermore, we gained approximate corresponding ghost information based on the spaceborne SAR imaging theory, Sentinel-1 system parameters, and maritime information.The ghost confirmation method is shown in Section 2. Ghost samples are extracted based on the corresponding target positions.Figure 8 shows a SAR image used in the test, where target chips and ghost replicas are highlighted by blue and yellow squares, respectively.In order to present the corresponding relationship, we labeled target as T-i, where the target chip is i.The ghost is labeled as G-i, where the ghost replica is i.We can identify 23 ship targets and 4 ghost replicas in this image.On this basis, target chips, ghost replicas, and background chips were collected, which contained 350 samples with the size of 40 pixels × 40 pixels, respectively, and were used for H-CNN training.Additional 149 Sentinel-1 SAR images with the size of 670 pixels × 643 pixels were applied to test the proposed networks performance.Altogether, 480 ships chips and 304 ghost replicas were present.To verify the effectiveness of H-CNN, training samples and test SAR images were acquired from different Sentinel-1 SAR data.The ship targets were confirmed by the maritime information on the AMSA website.Furthermore, we gained approximate corresponding ghost information based on the spaceborne SAR imaging theory, Sentinel-1 system parameters, and maritime information.The ghost confirmation method is shown in Section 2. Ghost samples are extracted based on the corresponding target positions.Figure 8 shows a SAR image used in the test, where target chips and ghost replicas are highlighted by blue and yellow squares, respectively.In order to present the corresponding relationship, we labeled target as T-i, where the target chip is i.The ghost is labeled as G-i, where the ghost replica is i.We can identify 23 ship targets and 4 ghost replicas in this image.On this basis, target chips, ghost replicas, and background chips were collected, which contained 350 samples with the size of 40 pixels × 40 pixels, respectively, and were used for H-CNN training.Additional 149 Sentinel-1 SAR images with the size of 670 pixels × 643 pixels were applied to test the proposed networks performance.Altogether, 480 ships chips and 304 ghost replicas were present.To verify the effectiveness of H-CNN, training samples and test SAR images were acquired from different Sentinel-1 SAR data.The ship targets were confirmed by the maritime information on the AMSA website.Furthermore, we gained approximate corresponding ghost information based on the spaceborne SAR imaging theory, Sentinel-1 system parameters, and maritime information.The ghost confirmation method is shown in Section 2.

Discussion of Parameter Configuration of H-CNN
The key point of the proposed method is to mitigate the influence of ghost replicas on CNN models' detection performance.Particularly, hyperparameters of convolutional kernels play a key role in the H-CNN performance.In this part, we studied H-CNN configurations with a variety of kernel hyperparameters to obtain its optimal detection performance.Details of kernel hyperparameters involved in H-CNN are shown in Table 2.In order to conveniently present different parameter configurations, we defined a brief description of network structure as H-i-j.It presents structure cases i and j in coarse-and fine-stage detection, respectively.We discuss the influence of kernel numbers and sizes during coarse-and fine-detection stages on detection performance, respectively.Moreover, in each layer, the structure is shown as A@B × B-Maxpool C × C formation, where kernel number is A, the kernel size is B × B, and max-pool is operated in each region of C × C. Different networks were trained by the same samples.We only changed kernel numbers of the coarse-detection stage and other parameters were fixed, as shown in Table 2a.According to the detection results, we confirmed the optimal kernel numbers and sizes during the coarse-detection stage using network comparisons shown in Table 2b.Similarly, kernel numbers and sizes during the fine-detection stage were confirmed using network comparisons shown in Table 2c,d.Detection performance was evaluated using four typical measures, including figure of merit (FoM), precision, recall, and F-measure [19,43], respectively.They are defined as follows: where TP is the number of correct detected targets, TN denotes the number of falsely detected targets, and FP is the number of undetected targets. Case , and H-6-1.According to Table 2a, we identified differences in kernel numbers in stage1, while other parameters were similar.Hence, we could confirm kernel numbers during the coarse-detection stage by applying this comparison.Results in terms of FoM, precision, and F-measure showed that the optimal situation is H-1-1.As to the assessment in recall, H-1-1 is the second best one, but very close to H-5-1, which has the highest value in Figure 9a.It illustrates that compared to 4, 6, 8, 9, and 12, 3 is the best kernel number choice during coarse-detection stage.Therefore, the kernel number during coarse-detection stage was set to be 3 for the following comparison experiments.Finally, we discuss the influence of kernel size during the fine-detection stage on detection performance.According to Figure 9d, H-10-1 shows the best result, thus representing that the two layers kernel size during the fine-detection stage should be designed as 9 × 9.

Analyses of Feature Extraction by H-CNN
To analyze feature extraction quality, we first observed feature maps of target, background, and ghost chips.Figure 10 shows some feature map examples of H-CNN during the test.The target and ghost chips had clear boundaries compared to the background chip, thus the first layer's feature maps both in coarse-and fine-detection stages had obvious texture in target and ghost chips.On the other hand, feature maps in the last layer presented abstract semantic information.We found that feature maps of target and ghost in the last layer were hardly discriminated during the coarse-detection stage.However, feature map differences between target and background were obvious, thus target and background discrimination was easy to detect during the coarse-detection stage.During the fine-detection stage, feature maps differences between target and ghost in the last layer became more obvious, thus their discrimination difficulty decreased.According to Table 2b, kernel sizes alone during the coarse-detection stage in H-1-1, H-7-1, H-8-1, H-9-1, H-10-1, and H-11-1 were different.Hence, we could confirm this parameter via the detection results, which are shown in Figure 9b.Detection results of H-10-1 have a little superiority, i.e., best kernel size results are 11 × 11 and 8 × 8 in two layers of the coarse-detection stage.
On this basis, we further compared the results using different kernel numbers during the fine-detection stage, as shown in Table 2c.H-10-1 results are the best, indicating 3 as the kernel number during the fine-detection stage.
Finally, we discuss the influence of kernel size during the fine-detection stage on detection performance.According to Figure 9d, H-10-1 shows the best result, thus representing that the two layers kernel size during the fine-detection stage should be designed as 9 × 9.

Analyses of Feature Extraction by H-CNN
To analyze feature extraction quality, we first observed feature maps of target, background, and ghost chips.Figure 10 shows some feature map examples of H-CNN during the test.The target and ghost chips had clear boundaries compared to the background chip, thus the first layer's feature maps both in coarse-and fine-detection stages had obvious texture in target and ghost chips.On the other hand, feature maps in the last layer presented abstract semantic information.We found that feature maps of target and ghost in the last layer were hardly discriminated during the coarse-detection stage.However, feature map differences between target and background were obvious, thus target and background discrimination was easy to detect during the coarse-detection stage.During the fine-detection stage, feature maps differences between target and ghost in the last layer became more obvious, thus their discrimination difficulty decreased.In this part, we investigated the feature extraction quality of target chips and ghost replicas.If features are significantly different, the degree of distinction between two chips improves.The chips introduced in Section 3 were disposed by H-CNN and we collected their feature maps.The amplitude distribution was obtained by the same method.Distributions are shown in Figure 11.Amplitude for most focus points on the two regions, 0-0.02 and 0.96-1, are enlarged.The two distributions are dissimilar, especially in these two enlarged parts.Compared to Figure 4, distribution differences are obvious in Figure 11.It indicates that the distinguishable degree of features extracted by H-CNN is stronger than that of the original chips.In order to further quantitatively analyze feature extraction quality, we introduced a linear discrimination analyses (LDA) theory.It is well known that LDA is aimed at maximizing between-class to within-class scatter matrices ratio.Here, two scatter matrices, called the within-class and between-class scatter matrices, are defined as [44]: In this part, we investigated the feature extraction quality of target chips and ghost replicas.If features are significantly different, the degree of distinction between two chips improves.The chips introduced in Section 3 were disposed by H-CNN and we collected their feature maps.The amplitude distribution was obtained by the same method.Distributions are shown in Figure 11.Amplitude for most focus points on the two regions, 0-0.02 and 0.96-1, are enlarged.The two distributions are dissimilar, especially in these two enlarged parts.Compared to Figure 4, distribution differences are obvious in Figure 11.It indicates that the distinguishable degree of features extracted by H-CNN is stronger than that of the original chips.In this part, we investigated the feature extraction quality of target chips and ghost replicas.If features are significantly different, the degree of distinction between two chips improves.The chips introduced in Section 3 were disposed by H-CNN and we collected their feature maps.The amplitude distribution was obtained by the same method.Distributions are shown in Figure 11.Amplitude for most focus points on the two regions, 0-0.02 and 0.96-1, are enlarged.The two distributions are dissimilar, especially in these two enlarged parts.Compared to Figure 4, distribution differences are obvious in Figure 11.It indicates that the distinguishable degree of features extracted by H-CNN is stronger than that of the original chips.In order to further quantitatively analyze feature extraction quality, we introduced a linear discrimination analyses (LDA) theory.It is well known that LDA is aimed at maximizing between-class to within-class scatter matrices ratio.Here, two scatter matrices, called the within-class and between-class scatter matrices, are defined as [44]: In order to further quantitatively analyze feature extraction quality, we introduced a linear discrimination analyses (LDA) theory.It is well known that LDA is aimed at maximizing between-class to within-class scatter matrices ratio.Here, two scatter matrices, called the within-class and between-class scatter matrices, are defined as [44]: where S w is the within-class scatter matrices, S b is the between-class scatter matrices, X i is the samples i, E{•} is the mean value, c is the type number, P(ω i ) is the ω i sample number ratio to all sample numbers, M i is the mean value matrix of ω i samples, and M 0 is the mean value matrix of all samples.Furthermore, there are two criteria for evaluating feature extraction quality, J 1 and J 2 , as follows: where trace(•) is the operation of calculate matrix trace.According to the LDA theory, the bigger J 1 and J 2 , the stronger distinguishable degree it has.Taking fine-stage detection for instance, we calculated J 1 and J 2 of a feature map in two layers as shown in Table 3.It is obvious that criteria values of the L2 layer were bigger than those of L1 layer.In other words, features in the L2 layer had a stronger distinguishable degree than those in the L1 layer.

Detection Result Comparison
Comparative analyses of CFAR, traditional CNN, low complex CNN, and the proposed network are presented herein to validate the H-CNN.In the CFAR method, we used the cell average CFAR (CA-CFAR) to detect above SAR images [45] where the false alarm rate was set as 1 × 10 −3 .Moreover, the traditional CNN model consisted of two convolutional layers, two pooling layers, and one fully connected layer.Its parameter configuration was confirmed by detection result comparisons of multiple networks.Moreover, a low complex CNN was introduced by [17].H-CNN parameter configuration was set as aforementioned H-10-1.
In order to intuitively observe detection results based on different methods, we provided one instance as shown in Figure 12.It illustrates detection results of the SAR image of Figure 8, where targets and ghosts are labeled.We can see that all targets were detected, but the performance on ghosts was different.Hence, let us focus on the detection results of ghost replicas.G-2 was accurately detected as a ghost replica by these four methods.Other ghosts may be falsely detected by CFAR, traditional CNN, or low complexity CNN.For example, G-4 was identified as a target by CFAR, G-7 was also identified as a target by CFAR and traditional CNN.Only H-CNN was able to discriminate G-19 as a ghost replica.In other words, H-CNN could resist the interference of ghost replica and its detection performance outperforms other detection methods.
Furthermore, we calculated the statistical results to accurately illustrate detection performance.Detection results of CFAR, traditional CNN, low complexity CNN, and the proposed H-CNN are presented in Table 4.All the test data consisting of 149 Sentinel-1 SAR images with 480 ship targets and 304 ghost replicas are used herein.We can see that superiority of the proposed H-CNN is obvious.
More specifically, the proposed method could achieve more than 13.83% and 4.57% improvement compared to the CFAR technique and traditional CNN model, respectively.In addition, compared with low complexity CNN, the increase of 3.51%, 3.47%, and 2.54% in FoM, recall, and F-measure, respectively, could be achieved by H-CNN.

Conclusions
A ship target detection method was proposed in this paper based on hierarchical CNN in the spaceborne SAR imagery.Its major contributions are twofold.First, a hierarchical pattern was designed to allow the single attention of each stage for the ship target detection against different interference, i.e., sea clutter and ghost replicas.Second, we adopted the statistical analyses of feature maps in the last layer, which may facilitate the understanding of these abstract features of ship targets and ghosts in spaceborne SAR images.Specifically, in the coarse-detection stage of H-CNN, ROIs can be extracted from whole images.Moreover, ship targets were detected against ghosts in the fine-detection stage.According to spaceborne SAR characteristics, we analyzed the ghost-generating principle, which conforms to the actual data situation.H-CNN designation was based on the amplitude information of SAR image chip in space dimension, and amplitude distribution differences between target and ghost were then discussed.Amplitude proportion differences were obvious, but the envelope forms of the two distributions were similar.In the

Conclusions
A ship target detection method was proposed in this paper based on hierarchical CNN in the spaceborne SAR imagery.Its major contributions are twofold.First, a hierarchical pattern was designed to allow the single attention of each stage for the ship target detection against different interference, i.e., sea clutter and ghost replicas.Second, we adopted the statistical analyses of feature maps in the last layer, which may facilitate the understanding of these abstract features of ship targets and ghosts in spaceborne SAR images.Specifically, in the coarse-detection stage of H-CNN, ROIs can be extracted from whole images.Moreover, ship targets were detected against ghosts in the fine-detection illustrates azimuth ambiguity formation with azimuth antenna pattern and PRF.

Figure 2 .
Figure 2. Geometry of the Sentinel-1 SAR satellite operation in interferometric wide (IW) mode: (a) imaging geometry of the Sentinel-1 SAR system; (b) interpretation of Sentinel-1 satellite in orbit.

Figure 2 .
Figure 2. Geometry of the Sentinel-1 SAR satellite operation in interferometric wide (IW) mode: (a) imaging geometry of the Sentinel-1 SAR system; (b) interpretation of Sentinel-1 satellite in orbit.

Figure 3 .
Figure 3. Illustration of a ship target and corresponding ghost replicas in a Sentinel-1 SAR image.

Figure 4 .
Figure 4. Comparison of statistical amplitudes of ship target and ghost replica pixels in Sentinel-1 SAR images.

Figure 3 .
Figure 3. Illustration of a ship target and corresponding ghost replicas in a Sentinel-1 SAR image.

Figure 3 .
Figure 3. Illustration of a ship target and corresponding ghost replicas in a Sentinel-1 SAR image.

Figure 4 .
Figure 4. Comparison of statistical amplitudes of ship target and ghost replica pixels in Sentinel-1 SAR images.

Figure 4 .
Figure 4. Comparison of statistical amplitudes of ship target and ghost replica pixels in Sentinel-1 SAR images.

Figure 5 .
Figure 5. Architecture of the hierarchical convolutional neural network (H-CNN) model.

Figure 6 .
Figure 6.Location illustration: North West, Great Australian Bight, and Bass Strait of Australia, where Sentinel-1 SAR images are collected for the experiments.

Figure 5 .
Figure 5. Architecture of the hierarchical convolutional neural network (H-CNN) model.

Figure 6 .
Figure 6.Location illustration: North West, Great Australian Bight, and Bass Strait of Australia, where Sentinel-1 SAR images are collected for the experiments.

Figure 6 .
Figure 6.Location illustration: North West, Great Australian Bight, and Bass Strait of Australia, where Sentinel-1 SAR images are collected for the experiments.

17 Figure 7 .
Figure 7.Samples of SAR and optical image chips of various types of ship targets.

Figure 8 .Figure 7 .
Figure 8. Ship targets and ghost replicas in a SAR image sample in test.Twenty-three ships and four azimuth ghosts are indicated by blue and yellow squares, respectively.

17 Figure 7 .
Figure 7.Samples of SAR and optical image chips of various types of ship targets.

Figure 8 .Figure 8 .
Figure 8. Ship targets and ghost replicas in a SAR image sample in test.Twenty-three ships and four azimuth ghosts are indicated by blue and yellow squares, respectively.

Figure 9 .
Figure 9. H-CNN detection performance comparison with respect to different kernel hyperparameters in terms of various assessment criteria: (a) number of kernels in the coarse-detection stage; (b) kernel size in the coarse-detection stage; (c) number of kernels in the fine-detection stage; and (d) kernel size in the fine-detection stage.

Figure 9 .
Figure 9. H-CNN detection performance comparison with respect to different kernel hyperparameters in terms of various assessment criteria: (a) number of kernels in the coarse-detection stage; (b) kernel size in the coarse-detection stage; (c) number of kernels in the fine-detection stage; and (d) kernel size in the fine-detection stage.

17 Figure 10 .
Figure 10.Feature map examples in two stages of H-CNN for ship target, ghost replica, and sea clutter background in test.

Figure 11 .
Figure 11.Comparison of statistical amplitudes of ship target and ghost feature maps.

Figure 10 .
Figure 10.Feature map examples in two stages of H-CNN for ship target, ghost replica, and sea clutter background in test.

17 Figure 10 .
Figure 10.Feature map examples in two stages of H-CNN for ship target, ghost replica, and sea clutter background in test.

Figure 11 .
Figure 11.Comparison of statistical amplitudes of ship target and ghost feature maps.

Figure 11 .
Figure 11.Comparison of statistical amplitudes of ship target and ghost feature maps.

Table 1 .
Some parameters of the Sentinel-1 satellite SAR system and a ship target in an imaged scene:

Table 1 .
Some parameters of the Sentinel-1 satellite SAR system and a ship target in an imaged scene:

Table 2 .
Configurations of H-CNN with different network hyperparameters with respect to convolutional kernels: (a) number of kernels in the coarse-detection stage; (b) kernel size in the coarse-detection stage; (c) number of kernels in the fine-detection stage; and (d) kernel size in the fine-detection stage.

Table 3 .
Quantitative evaluation of feature extraction for fine-detection in different layers.

Table 4 .
Comparison of statistical detection results in terms of various assessment criteria.

Table 4 .
Comparison of statistical detection results in terms of various assessment criteria.