Fast SAR Autofocus Based on Ensemble Convolutional Extreme Learning Machine

: Inaccurate Synthetic Aperture Radar (SAR) navigation information will lead to unknown phase errors in SAR data. Uncompensated phase errors can blur the SAR images. Autofocus is a technique that can automatically estimate phase errors from data. However, existing autofocus algorithms either have poor focusing quality or a slow focusing speed. In this paper, an ensemble learning-based autofocus method is proposed. Convolutional Extreme Learning Machine (CELM) is constructed and utilized to estimate the phase error. However, the performance of a single CELM is poor. To overcome this, a novel, metric-based combination strategy is proposed, combining multiple CELMs to further improve the estimation accuracy. The proposed model is trained with the classical bagging-based ensemble learning method. The training and testing process is non-iterative and fast. Experimental results conducted on real SAR data show that the proposed method has a good trade-off between focusing quality and speed.


Introduction
Synthetic Aperture Radar (SAR) is an active microwave remote-sensing system. SAR has been widely applied to both military and civilian fields due to its all-time and allweather observation abilities [1]. However, the imaging quality of SAR is usually degraded by undesired Phase Errors (PEs). These PEs usually come from trajectory deviations and the instability of the platform velocity [2]. The uncompensated PEs will cause serious image blurring and geometric distortion of the SAR imagery [3]. The navigation system cannot provide precise information about these motion errors [4]. For high-quality imaging, especially high-resolution imaging, it is important to compensate for these PEs. Autofocus is a data-driven technique, which can directly estimate the phase error from the backscattered signals [5].
In recent decades, many autofocus algorithms have been developed. These methods can be classified into the following three categories: sub-aperture-based, inverse-filteringbased, and metric-optimization-based algorithms. The sub-aperture autofocus algorithm is also called Map Drift Autofocus (MDA) [6]. MDA divides the full-aperture rangecompressed data into equal-width sub-aperture data. Each sub-aperture datum is imaged separately to obtain a sub-map. The position offset is determined by finding the position of the cross-correlation peak between sub-maps [7]. The more sub-apertures that are divided, the higher the order of phase error that can be estimated [8]. Thus, the sub-aperture-based algorithms cannot be used to correct high-order phase errors, which are limited by the number of sub-apertures. The original MDA was developed to correct the phase errors in azimuth. Recent works focus on two-dimensional phase-error correction. In [9], the MDA was extended to highly squinted SAR by introducing a squinted-range-dependent map drift The remainder of this paper is organized as follows. In Section 2, the fundamental background of SAR autofocus is explained. Section 3 presents our approach to SAR autofocus. Section 4 describes the dataset, outlines the experimental setup, and presents the results. In Section 5, the results obtained in the performed experiments, the practical implications of the proposed method, and future research directions are discussed. Finally, Section 6 concludes the paper.

Fundamental Background
SAR autofocus is a data-driven parameter-estimation technology. It aims to automatically estimate the phase error from the SAR-received data. The residual phase error in the distance direction is generally so small that it can be ignored after the correction of range cell migration. The phase errors that needs to be corrected mainly occur in the azimuth direction [40]. The azimuth phase error estimation and compensation usually occur in the range-doppler domain. Suppose we have a complex-valued defocused image X ∈ C N a ×N r , where N a , N r are the number of pixels in the azimuth and range, respectively. Denote X as the range-doppler domain data matrix of X. The one-dimensional azimuth phase error compensation problem can be formulated as [41] Y n a n r = 1 N a where Y ∈ C N a ×N r is the compensated image matrix; k is frequency index in azimuth. n a , n r are the azimuth and range index subscripts of matrix X, respectively. φ k is the k-th element of the phase error vector φ ∈ R N a ×1 . Let D be a square diagonal matrix composed of the elements of vector φ on the main diagonal, i.e., D φ = diag(exp{−jφ}), where diag(·) represents the diagonalization operation. Thus, Equation (1) can be expressed in the form of matrix multiplication as follows: where F a ,F a represent the Fourier transform and the inverse Fourier transform in azimuth, respectively.
The key problem of autofocus is how to estimate φ from defocused image X. Phase Gradient Autofocus is a simple autofocus algorithm and has been widely used. Denote X ∈ C N a ×N r as a defocused SAR image. First, find the dominant scatters (targets with large intensities) of each range line. Then, center shift these strong scatters along the azimuth direction to obtain a center-shifted image Z. This method assumes that the complex reflectivites, except for the dominant scatters, are distributed as zero-mean Gaussian random noises [41]. To accurately estimate the phase error gradient from these dominant targets, the center-shifted image Z is windowed. Denote Z ∈ C N a ×N r as the range doppler domain data (apply azimuth Fourier transform to Z) of Z. The phase gradient estimation based on Maximum Likelihood (ML) can be formulated aŝφ where Z * is the complex conjugation of Z,φ is the estimated phase error gradient vector, and ∠ is the phase operation. Another commonly used gradient estimation method is Linear Unbiased Minimum Variance (LUMV) algorithm. Let G be the gradient matrix of Z in azimuth, i.e., G k,: = Z k,: − Z k−1,: , where k = 0, 1, · · · , N a − 1 and Z −1,: = 0 ∈ C 1×N r . The LUMV-based phase error gradient estimation is expressed bŷφ where Imag(·) represents taking the imaginary part of a complex number. Different from PGA, the metric-based autofocus algorithms estimate phase errors by optimizing a cost function or a metric function. The cost function has the ability to evaluate the focus quality of the image. In the field of radar imaging, entropy is usually used to evaluate the focusing quality of an image. The better the focus, the smaller the entropy. Denote X ∈ C H×W as a complex-valued image; the entropy is defined as where H, W are the height and width of the image, respectively, |X| ij is the element in the i-th row and j-th column of amplitude image |X| ∈ C H×W , ln is the natural logarithm, and scalar C ∈ R can be computed by [24] Contrast is another metric used to evaluate an image's focusing quality. In [30], contrast is defined as the ratio of the mean square error of the target energy to the mean value of the target energy where E(·) denotes the mathematical expectation operation. The better the image focus quality, the greater the contrast, and vice versa. The minimum-entropy based autofocus (MEA) algorithm aims at minimizing |Y| 2 n a n r C ln |Y| 2 n a n r |Y| 2 n a n r ln|Y| 2 n a n r + lnC, where φ is the phase error vector, Y is the compensated image and can be computed using Equation (1). Since C is a constant, minimizing Equation (8) is equivalent to minimizing the following equation |Y| 2 n a n r ln|Y| 2 n a n r .
Utilize the gradient descent method, one can optimize Equation (9); the iterative update formula can be expressed as where µ is learning rate, φ t+1 is the updated phase error vector, t = 0, 1, · · · , N iter is iteration counter, and N iter is the maximum iteration number. The partial derivative of L with respect to φ can be formulated as 1 + ln|Y| 2 n a n r ∂|Y| 2 n a n r where k = 0, 1, · · · , N a − 1. According to [24], the final expression is 1 + ln|Y| 2 n a n r 2Imag −jX * kn r X kn r = 2Imag 1 + ln|Y| 2 n a n r jX * kn r X kn r = 2Imag where F can be calculated by azimuth Fourier transform 1 + ln|Y n a n r | 2 Y n a n r exp −j 2π N a kn a = F a 1 + ln|Y n a n r | 2 Y n a n r .
In general, for different types of phase error, φ can be modeled in different forms. Modeling φ can reduce the number of parameters that need to be optimized and the complexity of the problem. In this paper, we focus on the polynomial type phase error, which can be formulated as where p ∈ R N a ×1 is the azimuth frequency vector, which can be normalized to is the polynomial coefficient vector and Q is the order of the polynomial. The minimum-entropy-based methods are not restricted by the assumptions in PGA, but require many iterations to converge. As a result, these methods are more robust than PGA, and have a higher focus quality, but suffer from slow speed. In this paper, we focus on the development of a non-iterative autofocus algorithm based on machine learning. An ensemble-based, machine-learning model is proposed to predict the polynomial coefficients. The azimuth phase errors are computed according to Equation (14). The SAR image can be focused by compensating for the errors in azimuth using Equation (2).

Materials and Methods
In this section, ensemble learning and extreme learning machine are briefly introduced, and the proposed ensemble-learning-based autofocus method is described in detail.

Ensemble Scheme
Ensemble learning combines some weak but diverse models with certain combination rules to form a strong model. Key to ensemble learning are individual learners with diversity and the combination strategy. In ensemble learning, individual learners can be homogeneous or heterogeneous. A homogeneous ensemble consists of members with a single-type base learning algorithm, such as the decision tree, support vector machine or neural network, while a heterogeneous ensemble consists of members with different base learning algorithms. Homogeneous learners are most commonly used [42].
Classical ensemble methods include bagging, boosting, and stacking-based methods. These methods have been well-studied in recent years and applied widely in different applications [43]. The key idea of a boosting-based algorithm is: the samples used to train the current individual learner are weighted according to the learning errors of the previous individual learner. Thus, the larger the errors in a sample used by the previous individual learner, the greater the weight that is set for this sample, and vice versa [44]. Therefore, in the boosting-based algorithm, there is a strong dependence among individual learners.
It is not suitable for parallel processing and has a low training efficiency. The bagging (bootstrap aggregating) ensemble method is based on bootstrap sampling [37]. Suppose there are N training samples and M individual learners; then, N samples are randomly sampled from the original N samples to form a training set. M training sets for M individual learners can be obtained by repeating M times sampling. Therefore, in the bagging-based method, there is no strong dependence between individual learners, which makes it suitable for parallel training. In this paper, the bagging-based ensemble method is utilized to form data diversity.
In ensemble learning, three combination strategies have been widely used, including averaging, voting, and learning-based strategies [45]. For the regression problem, the first method is usually utilized, i.e., averaging the outputs of M individual learners to obtain the final output. The second strategy is usually used for classification problems. The winner is the candidate with the maximum total number of votes [46]. The learning-based method is different from the above two methods; it takes the outputs of M individual learners as the inputs of a new learner, and the combination rules are automatically learned. To combine the results of multiple individual autofocus learners, we propose a metric-based combination strategy. In other words, the winner is the candidate with the optimal metric value (such as minimum-entropy or maximum-contrast. The framework of our proposed ensemble-learning-based autofocus algorithm is illustrated in Figure 1, where "PEC" represents the phase error compensation module, which is formulated by Equation (2). In Figure 1, there are M homogeneous individual learners. Each learner is a Convolutional Extreme Learning Machine (CELM). Denote X ∈ C N a ×N r as a defocused SAR image, where N a , N r are the number of pixels in azimuth and range, respectively. We can obtain M estimated phase errror vectors φ (1) , φ (2) , · · · , φ (M) . These vectors are used to compensate for the defocused image X, and M focused images Y (1) , Y (2) , · · · , Y (M) are obtained. Finally, our proposed metric-based combination strategy is applied to these images to obtain the final result. For example, if entropy is utilized as the metric, then the final focused image can be expressed as Similarly, if contrast is utilized as the metric, then the final focused image can be expressed as

Convolutional Extreme Learning Machine
The original ELM is a three-layer neural network (input, hidden, output) designed for processing one-dimensional data. Denote x ∈ R d×1 as the input vector, and L as the number of neurons in the hidden layer. Let a i ∈ R d×1 represent the weight between input x and the i-th neuron of hidden layer, and let b i ∈ R be the bias. The output of the i-th hidden layer neuron can be expressed as where g is a nonlinear piecewise continuous function (activation function in traditional neural networks). The L outputs of the L hidden layer neurons can be represented as Denote β ∈ R L×K as the weight, ranging from the hidden layer to output layer; K is the number of neurons in the output layer. For a classification problem, K is the number of classes; for a regression problem, K is the dimension of the vector to be regressed. The output of ELM can be formulated as Suppose there is a training set with N training samples: is the truth-value vector (for the classification problem, t is the one-hot class label vector). The hidden layer feature matrix of these N samples is The classification or regression problem for ELM is to optimize where Equation (19) can be solved by an iterative method, orthogonal projection method or singular value decomposition [34,47]. When σ 1 = σ 2 = p = q = 2, Equation (19) has the following closed-form solution [32] where I is an identity matrix. The process of solving β does not need iterative training, and it is very fast. The original ELM can only deal with one-dimensional data. For two-dimensional or a higher dimensional input, it is usually flattened to a vector. This flattened operation destroys the original spatial structure of input data and leads ELMs to perform poorly in image-processing tasks. To overcome this problem, Huang et al. [48] proposed a Local Receptive-Fields-Based Extreme Learning Machine (ELM-LRF). Differing from the traditional Convolutional Neural Network (CNN), the size and shape of the receptive field (convolutional kernel) of ELM-LRF can be generated according to the probability distribution. In addition, CNN uses a back-propagation algorithm to iteratively adjust the weights of all layers, while ELM-LRF has a closed-form solution.
In this paper, we propose a Convolutional Extreme Learning Machine (CELM) method for phase error estimation. The network structure of a single CELM is illustrated in Figure 2. It contains a convolutional (Conv) layer, an Instance Normalization (IN) layer [49], a Leaky Rectified Linear Unit (LeakyReLU) nonlinearity [50], a Global Average Pooling (GAP) layer in range, a flattening layer, and an output layer. As mentioned above, in order to simplify the prediction problem, we use CELM to estimate the polynomial coefficients instead of phase errors. In Figure 2, K denotes polynomial coefficients and equals Q − 1, where Q is the order of the polynomial.
Range GAP Flatten 1x256 Figure 2. The structure of a single convolutional, extreme-learning machine for autofocus. The CKS in azimuth is set to 63; the convolution stride is 1.
The detailed configuration of CELM is shown in Table 1. Suppose there is a complex SAR image of 256 pixels in both height and width. Denote C o as the number of channels produced by convolution, and n as the number of images in a batch. The output size of each layer in CELM is also displayed in Table 1. As shown in Figure 2 and Table 1, there is only one convolutional layer in a CELM. The convolution stride is set to 1. In Figure 2, the convolution kernel sizes for azimuth and range are 63 and 1, respectively.

Layer Number
Layer Type Output Size where N is the number of inputs, and N a , N r , C i are the height, width and channels of X, respectively. In this paper, the convolution kernels between channels do not share weights. Denote A ∈ R C o ×C i ×H k ×W k as the weight matrix of the convolution kernels, where H k , W k are the height and width of the convolution kernel. C o is the number of channels produced by the convolution. The convolution between A and X can be formulated as where n = 0, 1, · · · , N − 1, * represents the classic two-dimensional convolution operation, and X n,c i ,:,: is the c i -th channel of the n-th image of X, and O ∈ R N×C o ×H o ×W o . In this paper, C i equals 2, since the defocused complex-valued SAR image is first converted into a two-channel image (real channel image and imaginary channel image) before being fed into CELM. As the phase distortion is in azimuth, we use azimuth convolution to extract features. Thus, the weight of the convolutional layer is a matrix with size C o × 2 × r a × 1, where C o is the number of channels produced by the convolution, 2 is the number of channels of the input image, r a is the kernel size in azimuth.
where C o , H o , W o are the channels, height, and width of O, respectively. The mean value µ and standard variance σ can be calculated by After convolution and instance normalization, a LeakyReLU activation is applied to the normalized features O. Mathematically, the LeakyReLU function is expressed as where γ is the negative slope, set to 0.01 in this paper. DenoteÕ = LeakyReLU(O) as output features of the LeakyReLU nonlinearity. By appying the GAP operation toÕ in the range direction for dimension reduction, the features after pooling can be expressed as whereH is the features after the range GAP. Thus, each feature map is reduced to a feature vector. For an image, C o feature vectors will be generated. These C o feature vectors are flattened to a long feature vector h ∈ R L×1 after the flatten operation. Combine the N feature vectors h 1 , h 2 , · · · , h N into a feature matrix Similar to ELM-LRF, the convolution layer weights are fixed after random initialization. The weights β from hidden layer to the output (polynomial coefficients) can be solved by Equation (20).

Model Training and Testing
In this paper, the classical bagging ensemble-learning method is applied to generate diverse data and train CELMs. The model trained with the bagging-based method is called Bagging-ECELMs. Suppose there is a training dataset S train = {X n , α n } N train n=1 , and a validation dataset S valid = {X n , α n } N valid n=1 , where X n ∈ C N a ×N r is the n-th defocused image, α n ∈ R K×1 is the polynomial phase error coefficient vector of X n , and N a and N r are the number of pixels in azimuth and range, respectively. Denote M as the number of CELMs. In order to train the M CELMs, N samples are randomly selected from the training set S train as the training samples of a single CELM, and M training sets are obtained by repeating this process M times. The validation dataset S valid was utilized to select the best factor λ in Equation (19). Assuming that there are N λ regularization factors are set in the experiment, then each CELM will be trained N λ times.
The training of a single CELM consists of two main steps: randomly initializing the input weights A (the weights of the convolution layer) and calculating the output weights (Equation (20)). The input weights are randomly generated and then orthogonalized using singular value decomposition (SVD) [48]. Assuming that there are C o convolutional output channels, the convolution kernel size is r a × 1, where r a is the kernel size in the azimuth and 1 is the kernel size in the range. Firstly, generate 2C o convolution kernel weights {w i ∈ R r a ×1 } 2C o i=1 with standard Gaussian distribution. Secondly, combine these weights into a matrix A init in order Thirdly, orthogonalize the weight matrix W ∈ R r a ×2C o with SVD, and obtain the orthogonalized weight A = [a 1 , a 2 , · · · , a 2C o ] ∈ R r a ×2C o . Finally, reshape the weights A into a matrix with size C o × 2 × r a × 1 to obtain the final input weights A.
The pseudocode for training Bagging-ECELMs is summarized in Algorithm 1, where the entropy-based combination strategy is utilized (Equation (15)). The testing process of Bagging-ECELMs model is very simple; see Algorithm 2 for details. orthogonalize A (m) utilize SVD 6: for n λ = 1 to N λ do 7: compute feature matrix H train of S (m) train using Equation (26) 8: compute output weights using Equation (20) 9: compute feature matrix H valid of S valid using Equation (26) 10: compute the estimated phase error coefficients H valid β 11: compute the phase error vector using Equation (14) and focus each validation image using Equation (2) 12: compute the entropy s of all the focused images 13: if s < s min then 14: β (m) ← β 15 compute the phase error using Equation (14) and focus X using Equation (

Experimental Results
This section presents the results obtained with the proposed autofocus method. Firstly, the used datasets are described in detail. Secondly, implementation details, together with the obtained results, are presented and discussed. All experiments were run in PyTorch1.8.1 on a workstation equipped with an Intel E5-2696 2.3GHz CPU, 64GB RAM, and an NVIDIA 1080TI GPU. Our code is available at https://github.com/aisari/AutofocusSAR (accessed on 25 June 2021).

Dataset Description
The data used for this work were acquired by the Advanced Land Observing Satellite (ALOS) satellite in fine mode. The ALOS satellite was developed by the Earth Observation Research Center, Japan Aerospace Exploration Agency, began to serve in 2006, and ended in 2011. ALOS is equipped with a Phased Array L-band Synthetic Aperture Radar (PALSAR).
The PALSAR has three working modes: fine mode, scanning mode and polarization mode. Specific parameters of the PALSAR in fine mode are shown in Table 2, where PRF represents Pulse Repetition Frequency, i.e., sampling rate in azimuth. As shown in Table 2, there are two resolution modes in fine mode: high-resolution (HR) and low-resolution (LR). With high resolution, the azimuth resolution is about 5 m, the slant range resolution is up to 5m, and the ground resolution is about 7 m. Nine groups of SAR raw data were used in the experiment, covering the areas of Vancouver, Xi'an, Heiwaden, Hefei, Florida, Toledo and Simi Valley. More detailed information, containing the scene name, acquisition date, effective velocity (V r ) and Pulse Repetition Frequency (PRF), is given in Table 3. All the raw data can be acquired from https://search.asf.alaska.edu/ (accessed on 25 May 2018) by searching the scene name. A world map of the nine areas is available from our code repository.
The range doppler algorithm is utilized to process the raw data. Since the original image is very large, we selected a subregion with a size of 8192 × 8192 for each image. The imaging results of the nine sub-images, processed by the range doppler algorithm, are shown in Figure 3. The selected areas include sea surface, urban areas, rural areas, mountains, and other terrains with varying texture complexity, which is an important guarantee for verifying the performance of the autofocus algorithms.  We generated azimuth phase errors by simulating an estimation error of equivalent velocity. Of course, the phase errors could also be generated by directly generating poly-nomial coefficients. The range of velocity estimation error was setatn an interval of [V r − 25 m/s, V r + 25 m/s], the sampling interval was 2 m/s, and the range doppler algorithm was used to process imaging. Thus, for every SAR raw data matrix, 25 defocused complexvalued SAR images would be generated. The images corresponding to sequence numbers 2, 3, 4, 5, 8 in Table 3 were used to construct the training dataset. The images corresponding to sequence numbers 6, 7 in Table 3 were used to construct the validation dataset. The images corresponding to sequence numbers 1, 9 in Table 3 were used to construct the testing dataset. Image patches with size 256 × 256 were selected from these images to create the dataset. We randomly selected 20,000 image patches for training from the 5 × 25 = 125 defocused training images. A total of 8000 validation image patches were selected from the 2 × 25 = 50 defocused validation images. 8000 testing image patches were selected from the 2 × 25 = 50 defocused testing images.
The entropies of the above unfocused training, validation, and testing images were 9.9876, 10.2911, and 10.0474, respectively. The contrast levels in the above unfocused training, validation, and testing images were 3.3820, 1.9860, and 3.4078, respectively.

Performance of the Proposed Method
In this experiment, the degree of the polynomial (Equation (14)) was set to Q = 7; thus, each CELM had K = 6 output neurons. AN entropy-based combination strategy was used in this experiment. To analyze the influence of CELMs number on focusing performance, M was chosen from M = {1, 2, 4, 8, 16, 32, 64}. All CELMs had the same modules as illustrated in Figure 2. The number of convolution kernels was set to C o = 32. The regularization factor λ was chosen from {0.01, 0.1, 1, 10, 100}. For each CELM, 3000 samples were randomly chosen from the above training dataset to train the CELM. The batch size was set to 10. The NVIDIA 1080TI GPU was utilized to train and testing.
Firstly, we analyzed the influence of convolution kernel size (CKS) r a on the performance of the proposed model. In this experiment, the number of CELMs was set to 1, and the kernel size in azimuth was chosen from {1, 3, · · · , 63}. After training, the entropy and contrast metrics were evaluated on the training, validation, and testing datasets, respectively. The results are illustrated in Figure 4. As we can see from Figure 4a,b, when r a = 17, the performance was best. The corresponding entropy and contrast on testing dataset were 9.9931 and 3.7952, respectively.   Secondly, the influence of the number of CELMs with the same CKS on focusing performance was analyzed. In this experiment, the number of CELMs was chosen from set M. The CKS in azimuth of all CELMs were set to 3 and 17, respectively. The training time (see Algorithm 1 for training details.) of the model on the 1080TI GPU device is displayed in Tables 4 and 5. After training, we tested the trained model on the testing dataset. Then, the entropy, contrast and testing time were evaluated, and the results are shown in Tables 4 and 5. It can be seen from Tables 4 and 5 that the higher the number of CELMs, the better the focusing quality, but the focusing time increases. Furthermore, regardless of the number of CELMs, the performance of Bagging-ECELMs with CKS 17 is much better than that of Bagging-ECELMs with CKS 3.  Thirdly, the influence of the number of CELMs with different CKS on focusing performance is analyzed. Suppose there are M CELMs; the azimuth CKS of the m-th CELM is set as r Equation (28)  After training all the CELMs, our proposed model is evaluated on the above training, validation, and testing dataset. The results are illustrated in Figure 5 and Table 6. In Figure 5, when the number of CELMs is 0, there is no autofocus. As is known, the smaller the entropy, the greater the contrast, indicating that the focusing quality is better. We can conclude that the higher the number of individual learners (CELMs), the higher the focusing quality. The autofocus time of the proposed model is approximately linear with the number of CELMs. However, when the number of CELMs is large, increasing the number of individual learners has little effect on the focus quality.
The detailed numerical results are given in Table 6. The entropy, contrast and testing (Algorithm 2) time metrics are evaluated on the testing dataset. The training time metric is evaluated on the training and validation dataset; see Algorithm 1 for details. As we can see from Table 6, the training time of the proposed model is directly proportional to the number of individual learners. Comparing the results in Tables 4-6 and Figure 4, it can be found that the size of convolution kernel has a great influence on the performance of the model. When the optimal kernel size is unknown, using different kernel sizes can yield more optimal solutions. Finally, to verify the effectiveness of the proposed combination strategy, the classical average combination strategy, which averages the outputs of M CELMs, is tested. In this experiment, a different CKS is used, which can be computed by Equation (28). The performances with different numbers of CELM in the testing dataset are shown in Table 7. The training time, evaluated on training and validation datasets, is also provided. From Tables 6 and 7, we can conclude that our proposed entropy-based combination strategy can obtain a higher focus quality. The reason the average method does not work well is that the phase errors predicted by different CELMs may be cancelled out by each other.

Comparison with Existing Autofocus Algorithms
In this experiment, we compared the proposed method with the existing autofocus methods of PGA-ML, PGA-LUMV [16], and MEA [51]. The training, validation and testing datasets described in Section 4.1 were used. In the original PGA algorithm, the window size was set manually. If not set properly, the algorithm will not converge. However, it is difficult to manually set the window size for the above 8000 test images. We implemented an adaptive method to determine the window size. Denote Z as the complex-valued image data where dominant scatters are center-shifted. The threshold value Tk which determines the window size, is calculated by the following formulas v = 20log 10 where N a , N r are the number of pixels in azimuth and range. Denote i s , i e as the positions that satisfy The results of different autofocus algorithms on the testing dataset are shown in Table 8. In Table 8, MEA-1, MEA-10, MEA-100 represent the MEA algorithms with learning rates 1, 10 and 100, respectively. As is known, the image with lower entropy and higher contrast has a better focus quality. As shown in Table 8, our proposed method and MEA have a better focus quality than PGA-based methods. In order to intuitively show the focus performance of different methods, three scenes with different texture complexities and defocusing levels were selected in the experiment. Figure 6 shows the autofocus results of the PGA-LUMV, MEA and the proposed autofocus algorithms. It can be seen from the figure that the proposed algorithm and MEA algorithm are suitable for different scenes. However, the phase-gradient-based methods depend on strong scattering points, so PGA-LUMV fails for the scene without strong scattering points, as shown in Figure 6j. The phase error curves of the three scenes, estimated by the above three methods, are shown in Figures 7-9, respectively. It can be seen from Figures 7 and 9 that the 1st image and 3rd image have large phase errors and are seriously defocused. However, the 2nd image has small phase errors. Wecan see that the phase errors estimated by our proposed method are the closest to the results of MEA.    In the experiment, we also evaluated the focus speed of the above four algorithms on a testing dataset. The NVIDIA 1080TI GPU and Intel E5-2696 CPU device were used for these algorithms. The results are shown in Tables 9 and 10, respectively. It should be noted that the PGA-based algorithms performed more slowly on GPU than on CPU. This is because the center-shifting dominant scatter operations can not be effectively parallelized. It is well-known that PGA has fast convergence and a sufficient performance for low-frequency errors, but is not suitable for estimating high-frequency phase error [41]. Meanwhile, MEA requires more iterations and more time to converge, but can obtain a more accurate phase error estimation. From the results in Tables 8-10, we can conclude that our proposed algorithm has a good trade-off between focusing speed and quality.

Discussion
SAR autofocus is a key technique for obtaining high-resolution SAR images. The minimum-entropy-based algorithm usually has a high focusing quality but suffers from a slow focusing speed. The phase-gradient-based method has a fast focusing speed but performs poorly (or even does not work) in a scene where a dominant scatterer does not exist. Our proposed machine-learning-and ensemble-learning-based autofocus algorithm (Bagging-ECELMs) has a good trade-off between focusing quality and speed. The experimental results presented in Section 4.3 provide evidence for these conclusions. In Section 4.2, the performance of our proposed method is thoroughly analyzed. Firstly, we found that the convolution kernel;s size has a great influence on the performance of the model. Traversing all convolution kernel sizes is often inefficient and sometimes impossible. Utilizing different kernel sizes can obtain a performance closer to the optimal solutions (see Tables 4-6). Secondly, our proposed metric-based combination strategy is much more effective than the classical average-based combination strategy. The phase errors predicted by different CELMs may have different symbols, which will lead to phase error cancellations. Last, but not least, we can easily conclude that our proposed Bagging-ECELMs method performs much better than a single CELM.
However, our proposed Bagging-ECELMs method has the following three disadvantages. Firstly, this model can only be utilized for phase errors that can be modeled as a polynomial. Secondly, a high number of samples is needed for training. Finally, the focusing quality is slightly worse than that based on minimum entropy. Bagging-ECELMs can replace PGA when is used to correct polynomial-type phase errors. When a higher image focusing quality is required and the type of phase error is unknown, the MEA method should be used. The prediction results of Bagging-ECELMs can also be used as the initial values of MEA, to accelerate the convergence speed of MEA. In summary, Bagging-ECELMs is more suitable for real-time autofocus applications, while MEA is more suited to high-quality autofocus applications. Different from MEA and PGA, Bagging-ECELMs is nonparametric at the testing phase and easier to use.
In future research, our work will focus on three aspects. Our proposed algorithm will be extended to correct sinusoidal phase errors. Boosting-or divide-and-conquer-based ECELMs will be developed. Although the method proposed in this paper has a good tradeoff between focusing quality and speed, it is still possible to enhance this by improving the combination strategy and network structure.

Conclusions
In this paper, we propose a machine-learning-based SAR autofocus algorithm. A Convolutional Extreme Learning Machine (CELM) is constructed to predict the polynomial coefficients of azimuth phase error. In order to improve the prediction accuracy of a single CELM, a bagging-based ensemble learning method is applied. Experimental results conducted on real SAR data show that this ensemble scheme can effectively improve the accuracy of phase error estimation. Furthermore, our proposed algorithm has a good trade-off between focus quality and focus speed. Future works will focus on sinusoidal phase error correction, a novel combination strategy, and developing ECELMs based on boosting or divide-and-conquer. Faster and more accurate SAR autofocus algorithms based on deep learning will also be studied.

Acknowledgments:
The authors wish to acknowledge the anonymous reviewers for providing helpful suggestions that greatly improved the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: