A Novel Method for Sea-Land Clutter Separation Using Regularized Randomized and Kernel Ridge Neural Networks

Classification of clutter, especially in the context of shore based radars, plays a crucial role in several applications. However, the task of distinguishing and classifying the sea clutter from land clutter has been historically performed using clutter models and/or coastal maps. In this paper, we propose two machine learning, particularly neural network, based approaches for sea-land clutter separation, namely the regularized randomized neural network (RRNN) and the kernel ridge regression neural network (KRR). We use a number of features, such as energy variation, discrete signal amplitude change frequency, autocorrelation performance, and other statistical characteristics of the respective clutter distributions, to improve the performance of the classification. Our evaluation based on a unique mixed dataset, which is comprised of partially synthetic clutter data for land and real clutter data from sea, offers improved classification accuracy. More specifically, the RRNN and KRR methods offer 98.50% and 98.75% accuracy, outperforming the conventional support vector machine and extreme learning based solutions.


Introduction
In the context of target tracking and target identification, radar returns from unwanted objects, termed as clutter, compete with the intended true returns. The exact candidates for clutter may vary depending on the context. For instance, when ground objects are of interest, radar returns from nearby sea, waves, and targets in the sea can be considered as clutter. However, when tracking sea-borne targets, returns from ground objects can be considered as clutter. Regardless of the origin of the clutter, their variability is often substantially high so as to render the tracking and identification of targets a difficult process. This difficulty in tracking and identification of targets becomes even more severely challenging when dealing with multiple clutter types. This scenario is very common at and near the interfaces of multiple surface types, such as air and land, air and sea, or land and sea, where the actual targets of concern are bound only to a single surface. For example, a ground based radar should focus only on ground based targets, and thus, returns from sea-bound targets should be treated as clutter. For this reason, discriminating between different clutter types is crucially vital for robust tracking and identification of objects from one surface type [1]. In this paper, we focus on deriving a novel, machine learning method for discriminating sea-land clutter. frequency ground wave radar. Another example is [17], where OCSVM was used to detect ship wakes using active sonar systems. In [18], another variant of SVM, cost sensitive support vector machine (2C-SVM), was used to handle clutter from multiple models of clutter, in particular the Gaussian and K-distributed clutters, and approximate the Neyman-Pearson detector for a specific false alarm rate. More recently, in [19], a KNN classifier was proposed to yield a pixel-level sea-land segmentation of the scene. The different solution types mentioned above are compared in Table 1. In this paper, we propose two novel techniques for discriminating and classifying the sea and land clutters using two supervised machine learning techniques, namely the regularized randomized neural network (RRNN) and kernel ridge regression (KRR). Although our approaches are supervised learning based, the overall focus here is the classification of clutter as opposed to suppression of clutter or direct target discrimination [2,9,[14][15][16][17][18]20]. As such, we use all possible information towards the classification process rather than discarding useful information in favor of target discrimination. Furthermore, the key challenges of distinguishing and classifying the sea-land clutter revolve around high quality features that can provide this aid in the classification between these two classes of clutter. To render the classification process more meaningful, we pay special attention to the feature space, on which the training process relies. In particular, the proposed feature space for the sea and land clutters includes classic parameters covering the statistical characteristics of clutters, short time energy, amplitude variation, and the frequency and properties around autocorrelation. The feature space we propose here, to the best of our knowledge, is novel and different from that of [2][3][4][5][6]. In doing these, we make the following contributions: 1. We propose a novel set of features for improving the classification performance, namely the energy, zero-crossing rate, the autocorrelation, and the shape parameter V (ECAV). 2. We present two novel classification systems based on the RRNN and KRR utilizing the ECAV feature space; and 3. We conduct an in-depth, exhaustive analysis of the proposed methods and compare against the routine baselines, namely the SVM and ELM, to observe the generalization and robustness of the proposed system.
The rest of the paper is organized as follows: In Section 2, we provide the essential background surrounding clutter models and the RRNN and KRR techniques. We then introduce our proposed techniques in Section 3. In Section 4, the experimental results are discussed, and the complexity of the proposed algorithm is analyzed. We finally conclude the paper in Section 5.

Clutter Modeling
Among many different approaches for modeling clutter [21], modeling them using a statistical (or probabilistic) distribution is a robust and proven technique. Example clutter models include Rayleigh, log-normal, Weibull, and K distributions [6,22].
In general, if a radar looks at a natural surface with a low resolution and at a not very low grazing angle so that no dominant shadowing and multi-path effects are experienced, the resultant clutter distribution is very likely to model the Rayleigh distribution. However, if the resolution cell becomes very small or if illuminated areas contain discrete man-made scatters, the exact number of effective scatters may decrease. With this, the distribution usually becomes much wider than the Rayleigh distribution. Furthermore, if the low grazing angle scatters, multi-path effects, and shadowing are accounted for, the amplitude statistics deviate from the Rayleigh distribution. In these cases, the log-normal, Weibull, and K distributions are the most commonly used distributions to approximate the resulting clutter distribution [6,22].
With this, the amplitude probability density function of a Weibull distribution is: where δ is the scale parameter and α is the shape parameter of the statistic model, which are usually used to model the sea, weather, and land clutter. The amplitude probability density function of the K distribution is: where Γ(·) is the gamma function and K v−s (·) is the S-order modified Bessel function. The model is characterized by the scale parameter β and shape parameter v. It is worth noting that the shape parameter is the measure of clutter spikiness. In other words, the clutter becomes spikier with v decreasing and becomes Gaussian distributed when v approaches infinity. The K distribution describes the clutter as a compound distribution, which consists of two components: the speckle and the local power. It is the particular case of the compound Gaussian (CG) model.

RRNN and KRR
Feed forward neural networks have been extensively used in many domains, primarily for their ability to approximate complex non-linear functions. More specifically, feed forward neural networks are a popular choice for classification tasks. However, there are several aspects that determine the overall performance, which is often judged by the accuracy of the classification. First, the training process, and thus the resulting classification performance on data other than training, relies on the amount of training data used for training, which, in our case, are a scarce resource. Secondly, the overall network performance depends on the optimal selection of the network parameters, which are often referred to as hyperparameters. The parameters for optimal tuning include the number of layers, the activation functions of the layers, the learning rate, the number of neurons in each layer, the loss function used to quantify the convergence, and the optimization function used to minimize the loss function. This is particularly a problem when the dimensionality of the hyperparameter space is large. A large body of work that investigates the hyperparameter optimization problem can be found in [23,24].
The RRNN is, however, a feed forward neural network with a single hidden layer. The weights of the hidden layer are chosen randomly, as opposed to relying on back-propagation or gradient descent, and the output layer is trained by a simple linear learning rule. This model reduces the dimensionality of the hyperparameter space, and the process of finding the optimal values for the hyperparameters becomes rather simplified. In other words, given an RRNN and a feature space, the network is trained by randomized selection of features during a single cycle of training. In particular, a nonlinear squashing function is used to compute the weighted sum of the inputs and to map the input data into the feature space. This does not imply that the parameters describing the hidden nodes, which are independent of the training data, are unconstrained. The parameters for the hidden nodes are randomly generated by adhering to a given continuous probability distribution function.
This approach, outlined and discussed in [25], is different from multi-layer based neural networks and offers excellent efficiency. Once the input weights and biases are chosen during the training, the overall output weight (of the output layer) setup can be considered as a linear system. Furthermore, the output weights (linking the hidden layer to the output layer) can analytically be determined through a simple generalized inverse operation of the hidden layer output matrices.
Consider a basic RRNN model with L neurons in the hidden layer and a training set with N arbitrary distinct samples of (x j , t j ). The output function for the network can then be expressed as [25]: where x j , w i , and β i are the input, input-to-hidden weight, and hidden-to-output weight vectors, respectively. Furthermore, g (.) and b i are the nonlinear piece-wise activation function and the bias of a hidden node i.
we render a nonlinear feature mapping between the feature space and the outputs. While the input weights are expressed between a pair of input and hidden nodes, the output weights connect the i-th hidden node to the output nodes. Thus, Equation (3) becomes: The output weights vector β should be determined by minimizing the L 2 norm. In other words, where T is the target matrix of the training dataset and H is the output matrix of hidden layers, given by: The approximation of β can be given by β * = H † T, where H † is the Moore-Penrose generalized inverse of H. In order to obtain better generalization performance of the learning network, the smallest training error and the smallest norm of the output weights must be sought as in [25], i.e., where C is the regularization parameter between the training error and the norm of the output weights. If N ≥ L (i.e., H has more rows than columns), then the approximated value of β can be given by: where I is an identity matrix of dimension L. Conversely, if N < L (i.e., H has more columns than rows), then the approximated value of β can be given by: where I is an identity matrix of dimension N.
Ridge regression is one of the classical statistical algorithms, and it is often used to transform a non-linear regression function in the original input space as a linear regression function in the feature space. Hence, the dual version of the ridge regression algorithm allows the use of kernel functions. Furthermore, the ridge regression is easy to generalize to KRR using the kernel technique. Assume that we have a training set {( . . , N}. For a linear system y = ωx, ridge regression minimizes the following cost [26]: where λ > 0 and fixed. Furthermore, λ controls the trade-off between the bias and variance of the estimate. The predictionŷ given by the ridge regression on a new unlabeled data point x is [26]: where K is matrix of dot products of the vectors, and κ is vector of dot products of x and vectors of the training set such that: and In fact, these are also the kernel functions of interest. When using feature vectors, each data point x i is replaced by a corresponding feature vector Φ(x i ). Thus, Using this in Equation (11), the representation of the output weights and the estimated output of the system can be expressed as: and: respectively. In this paper, we employ the RRNN and KRR algorithms and the ECAV based feature extractor to build a novel classification system. Figure 1 shows the overall organization of the system, consisting of four key steps, namely dataset management, preprocessing, feature extraction, and classification. The detailed aspects of these steps are as follows:

Overview of the System
• Dataset management: In general, real clutter data are rarely available due to the nature of the study. However, in our case, we obtained the sea clutter from the IPIX radar. The land clutter, however, has to be simulated. This component of the system handles the aspect of simulating the land clutter and mixes with the real sea clutter. More specifically, the land clutter is modeled as the Weibull amplitude compound Gaussian distribution.
• Preprocessing: To facilitate easier signal processing operations, the radar returns (whether simulated or real) have to be expressed in the quasi-stable state form. This is feasible only if the signals are pre-treated for short time interval information. In this stage, the sea and land clutter data would be separated into a number of short time interval fragments using framing and windowing techniques.

•
Feature extraction: The features of sea and land returns, based on their inherent nature and amplitude statistics, are extracted from the short time interval information. These features can reflect the different properties between the two clutters in the time domain, such as the energy, the zero-crossing rate, the autocorrelation function, and the shape parameters of the distribution. • Classification: Two popular machine learning algorithms, namely RRNN and KRR, are used to classify the sea and land clutter. Both algorithms rely on the features that have been extracted above.
In the following subsections, each of these is covered in detail.

Sea and Land Clutter Preparing
In this paper, sea clutter data were acquired from the IPIX (Intelligent PIXel) radar shared by McMaster University, Canada, with a carrier frequency of 9.3 GHz, pulse repetition frequency of 1 kHz, and range resolution of 30 m. The dataset also includes four polarization states. The datasets include a large number of pure sea clutter with targets embedded wherever possible. The dataset also includes pure sea clutter (without any targets). However, the dataset does not include any land clutter [27].
With the dataset not containing land clutter, we addressed this issue by complementing the dataset with land clutter through simulation. However, modeling land clutter is a rather complex process. First, the land clutter is not spatially uniform. This stems from the fact that land surfaces contain natural and man-made structures, such as uniform terrains, farmlands, vegetation, buildings, roads, and skyscrapers. Secondly, the backscatter may vary with the weather conditions. These two, collectively, render any development of a complete theoretical or precise mathematical model for land clutter a difficult process. Otherwise, the theoretical model under certain simplified conditions can effectively assist research and engineering applications. There are two types of land clutter modeling methods, namely the method based on electromagnetic scattering mechanism, such as the Georgia Institutes of Technology (GIT) model, the Ulaby model, and alike models, and the method based on statistics [28]. However, using models based on statistics is easier, efficient, and versatile. For this reason, statistics based methods are considerably common when modeling land clutter. Among these, the distribution of the amplitude can describe the temporal and spatial variation of land clutter. The experimental results mentioned in [28,29] indicated that the corresponding statistical distributions are Weibull, which is a particular case of the compound Gaussian (CG) distribution. More specifically, the Weibull distribution can be used to describe the land clutter amplitude of uniform terrain for L and X band radars, almost mimicking high resolution radars for land clutters The CG distribution is widely used for high resolution radar echoes. It can be presented as the product of a non-negative random process and a Gaussian process. The former is known as a texture component, while the latter is known as a speckle component [21,22]. Thus, where τ(n) and x(n) are the real non-negative and complex zero-mean Gaussian process components, respectively. In [30], the compound Gaussian (CG) model was used to describe clutter returns as a complex process. The real, nonnegative modulation component is τ; hence, τ > 0. The corresponding modulating variate has the following probability density function (PDF). where: , and: where τ > 0, α is the shape parameter, and the range 0 < α ≤ 2 guarantees that the generated clutter conforms to the Weibull distribution. δ is the scale parameter, and Γ(·) is the gamma distribution. r is the independent variable of the density function.

Preprocessing of Clutter Data
Conventional and common signal processing algorithms, such as the discrete Fourier transform (DFT) and the fast Fourier transform (FFT), are suitable when the signal to be processed is in the steady state. However, radar clutter is often time-varying, particularly when considered across the whole time-span. Thus, direct application of FFT or DFT and relevant techniques cannot be used directly on radar clutter returns. However, the signal can be considered piecewise steady state signals over a number of time segments. In other words, by segmenting the whole time series into a number of segments, each segment represents a (temporary) steady state process, on which the existing techniques can be applied. This property of relative stability, which is referred to as the quasi-steady state of signal in the short time interval, is an important aspect of preprocessing of clutter data, in particular prior to the feature extraction. More specifically, the overall time-series must be segmented into a number of frames so that they can be framed into short time intervals [31].
One of the key aims (apart from establishing the near steady state signal) of windowing is to avoid the feature mutation between consecutive frames. This is achieved by using a frame shift by s points apart. Let x(t) be the sampled signal x(t) with a length of l time steps. Let d be the length of each frame. Then, the signal x(t) can be segmented into N frames, where: Furthermore, let there be a windowing function w(t). The discrete signal x(m) can be preprocessed as follows: where 0 ≤ m ≤ l and * is the convolution operator. In other words, Equation (19) represents the unit impulse response FIR filter. With this, the signal in the i-th frame f (x i (k)) is: where 1 ≤ i ≤ N and k = 1, 2, · · · , d. The selection of frame length d plays an important role in the extraction of relevant information. In our case, we determined the appropriate value for d using experimental evaluation.

Short Term Time-Domain Signal Processing and Parameter Estimation
Radar returns from sea, land, and targets of interest are always mixed. Although sea-land maps may be of help, as discussed before, it is not always possible to make use of pre-loaded maps, or they may not offer the full capability to distinguish between clutter types and targets. As such, the selection of features that would aid the classification algorithms is of paramount importance. There exists a body of work on the aspects of feature selection for clutter classification. Features of interest include characteristics of the sea clutter, such as the amplitude value, the shape parameter of the distribution, and high order features like kurtosis, skewness, and reflection coefficients. The features we opt for in this paper are based on windowed signals and include the signal short time energy, the zero-crossing rate, the autocorrelation function, and the shape parameters to structure the feature space. We discuss these features in detail below. The simulation results based on the IPIX processing radar datasets in 1998 and simulated land clutter data are demonstrated for the ease of understanding.

Short time energy:
The radar echo intensity, which is representative enough of the return energy, can be used to discriminate between the two classes of clutter [22]. From the signal processing perspective, both the short time amplitude function and the energy function can characterize the energy information of the quasi-steady state signals. The latter have a magnifying effect on different signal characteristics. In our work, we extract the time domain short time energy as a distinguishing indicator and compare it with the amplitude feature, as in [2]. The mathematical formulation is as follows: Assume that the short time energy of the signal in the i-th frame is E(i). The general formulation of the short time energy of the whole signal x(t) is defined as: where N is the number of frames. Figure 2 shows the raw and short-term signals of the sea-land clutter ((a) and (b)). We also show the distribution of the energy amplitude as a histogram for each of these clutters by binning energy amplitudes accordingly. From the histograms, it can observed that the energy for sea clutter is distributed from 5.5 × 10 5 to 7.0 × 10 5 J, while the energy for the land clutter is distributed from 0.95 × 10 6 to 1.15 × 10 6 J. The range of the distributions can easily facilitate identifying the classes of these clutters.   2. Short time zero-crossing rate: Given a radar system with a predetermined configuration, such as the frequency, polarization, and resolution, the echo from the clutter would vary depending on a number of parameters as discussed before. As for returns from sea and land, it is worth noting that the sea surface is partially homogeneous, while the land surface is heterogeneous with complex terrain conditions, in particular including the discrete scatters such as buildings and other structures. These differences lead to different amplitude profiles for the clutters from sea and land [29]. One of the important metrics for estimating the frequency value of a signal is zero-crossing (ZC) [32], which essentially accumulates a number of times a given signal crosses the zero-line. Albeit that it is simple, it is an effective technique: it not only has much fewer calculation requirements than the traditional fast Fourier transform (FFT) approach, but also demonstrates a high sensitivity to the amplitude change of the signal [32]. In this work, the ZC based feature space is proposed to distinguish the clutter from sea and land. Let C(i) denote the short time zero-crossing rate (STZCR) of the signal in the i-th frame, namely f (x i (k)), which is defined as: where 1 ≤ i ≤ N, 1 ≤ k ≤ d, and sgn(·) is a symbolic function.
The general formulation of the STZCR of the whole signal x(t) is: Figure 3 shows the short time zero-crossing rate and the histogram of the amplitude variation for the sea and land clutter. Although both signals appear to be similar, the distribution of the zero-crossing rate for the sea clutter is more concentrated around 250, when compared against the land clutter. This is primarily due to the homogeneous nature of the sea surface. 3. Short time autocorrelation function: Although radar echoes from sea and ground surfaces pose different challenges to processing, autocorrelation both in the spatial and temporal domain may provide more insight into the behavior of the signals. In this paper, we focus on the autocorrelation in the time-domain. Let A(i) be the autocorrelation function (ACF) of the signal in the i-th frame, f (x i (k)). A(i) can be expressed as: where d is the length of the frame and m is the delay. The general formulation of the whole signal x(t) is: Figure 4 shows the result of the autocorrelation for one of the frames (100th frame), of the sea and land clutter signals. We also show the corresponding amplitude histogram for both cases.
Although the peaks of the result are the same, the attenuation periods (to zero) for the of sea and land clutter are different. 4. Shape parameter of the clutter distribution: Although the Weibull and K distribution are two different distributions, they both can be derived from the same general form with different values in their parameters [33]. The shape parameter v represents the texture of the clutter. Homogeneous clutters lead to high values for v, while strong returns such as land clutter are represented by small values [10] of the same. Hence, the shape parameter can be a useful feature in the classification of clutter. The probability density distribution function of different distributions with different shape parameters can be found in [6]. In this paper, we obtained this feature using the moment estimation approach outlined in [5].
The general formulation of the shape parameterV(x(t)) of the whole signal x(t) is: (1),v(2), · · · · · · ,v(N)} (26) Figure 5 shows the distribution of the clutters (both sea and land), computed as described above.  With the four features outlined above, one may build the overall feature space F(x) over an input signal x(t) as follows: where M is the number of features of the clutter and N is the number of samples, which is the same as the number of frames in the data. The raw inputs are not suitable to be used as features for the RRNN and KRR classifiers. However, the preprocessed version eliminates the dimensionality issues, hence being suitable to be utilized by the RRNN and KRR classifiers.

RRNN Based Classification Algorithms
Both the RRNN and KRR based algorithms outlined above are two, single hidden layer based feed forward neural networks. Given a training dataset D = {x i , t i }, i = 1, 2, ..., N, and activation function g (·), a high-level algorithmic description of the RRNN is given in Algorithm 1, where H is the output matrix of hidden layer, β * is the hidden-to-output weights, and w and b are the input-to-hidden weights and bias, respectively. What needs to be emphasized here is that once the node parameters of the hidden layer are randomly generated, they will be fixed during training and solving the output weight with the least-squares approach. H ← Equation (6) 10: if N ≥ L then 11: β * ← Equation (8) 12: else 13: β * ← Equation (9) 14: end if 15: end for

Experimental Results and Analysis
To evaluate the effectiveness of the proposed classification algorithms, we performed a number of experiments in a methodical way. We discuss the procedure and the experimental results in the subsections that follow. We also discuss the results in detail in the relevant subsections. In all these cases, the data for the tests were as discussed in Section 3, where we mixed the real sea clutter data from the IPIX radar with simulated ground clutter data. In all the experiments, the dataset was split into an 80:20% ratio for training and testing. We also used a modest computing system with an Intel Core-2 processor (San Francisco, CA, USA), clocked at 2.30 GHz with 16 GB RAM. The dataset used for the evaluation in the following subsections contains the sea clutter from an IPIX radar with 60,000 measurements over a region R 1 and the land clutter, which is modeled as Weibull amplitude compound-Gaussian distribution. We used the same dataset configuration (referred to as DS-0) across all of the experiments below, unless otherwise stated.

Impact of Frame Length on Classification Performance
To understand the impact of frame length on the classification performance, we varied the frame length among {128, 256, 512, 1024}, while using the proposed features, namely the energy, the zero-crossing rate, the autocorrelation, and the shape parameter v (ECAV), for the RRNN and KRR networks. We show the resulting performance in Table 2 and Figure 6. In general, the training time decreased with the frame length, for both methods. For a fixed number of frames, when the frame length increased, the length of the signal decreased, and thus, the resulting feature set became smaller (see Equation (17)), leading to fewer computations during the classification. Furthermore, the best testing accuracy corresponded to the frame length d = 512, for both RRNN and KRR. Although, in theory, this can provide a more detailed description of the signal characteristics, a signal with a limited length leads to fragmented information. The rest of the experiments outlined in the following subsections used a frame length of d = 512.

Evaluation of the RRNN Classifier
The overall performance of the RRNN classifier varied with a number of parameters, even when the frame length was fixed. The two main important parameters of interest were the number of nodes in the hidden layer L and the regularized parameter C, where L ∈ {100, 200, 300, 400, 500, 700, 1000, 1500} and C ∈ {10 −5 , 10 −4 , . . . , 10 5 }. The overall search space was L × C. To understand the overall performance behavior of the RRNN, we performed a grid search over the space of L × C.
The search results are shown in Figure 7. We first show the variation of the performance across the L × C search space (Figure 7a). We then extracted the optimal L value, for which the performance was the best and show the variation of the performance across the range of C values (Figure 7b). We then repeated the same for the optimal C value ( Figure 7c).
The results show that a near-optimal performance can be sustained over a large range of the parameter space. In other words, the overall performance was smooth and was insensitive to the variation of the parameters in this range. In the non-smooth region of the performance, the parameter L was a dominant factor of the performance for C ∈ {0, . . . , 10 5 }. Parameters L and C jointly influenced the overall performance for C ∈ {10 −4 , . . . , 10 −1 }.
From Equation (6) , the value of the parameter L determined the hidden layer output matrix H, which in turn influenced the cost error between the estimated output and target output. From Equation (7), the regularized parameter C was a balanced parameter between the training error and the norm of the output weight.

Evaluation of the KRR Classifier
Similar to the RRNN, the overall performance of KRR also depends on the selection of hyperparameters. However, in the case of KRR, the parameters of interest are the width of the Gaussian kernel σ and regularized parameter C, where σ ∈ {2 −13 , 2 −12 , . . . , 2 13 } and C ∈ {10 −5 , 10 −4 , . . . , 10 5 }. The overall search space is σ × C, and as before, we performed a grid search on this space.
The Gaussian kernel function used by the KRR is a type of radial basis function, representing the nonlinear feature mapping from the hidden layer to the outputs. The width parameter σ of the KRR controls the radial scope of the function. Carefully adjusting the radial scope can provide a compromise between over-fitting and under-fitting. In addition to σ, KRR obtained a better generalized performance through the regularization parameter C.
We show the overall performance results in Figure 8. We first show the variation of the performance across the σ × C search space (Figure 8a). We then extract the optimal σ value, for which the performance was the best, and show the variation of the performance across the range of C values (Figure 8b). We then repeat the same for the optimal C value (Figure 8c). As can be observed, the best performance was almost the same as for the RRNN with the relevant parameters. However, the KRR classifier was more sensitive to parameter variation than the RRNN classifier. With sharp, non-smooth variation, the performance of the classifier varied rather sharply and thus indicated that a careful selection of parameters is important for achieving the best possible performance. Although the performance varied rather sharply, improved performance was observed over a subset of the parameter space σ 0 × C 0 , where C 0 ∈ {10 0 , 10 1 , 10 2 , 10 3 } and σ 0 ∈ {2 2 , 2 3 , . . . , 2 13 }.

Impact of Features on the Classification Performance
To assess the performance of the proposed feature selection, we evaluated the impact of different features on the overall performance of the RRNN and KRR networks.
Although a large set of features can be engineered, we opted for three common feature sets, extracted through two different wavelets, namely the db3 and f k6 wavelets, and the amplitude, shape parameter V, and correlation function (AVC). These features are common in the context of clutter separation and commonly used in signal separation tasks [12,13,[34][35][36]. Both the db3 and f k6 wavelets were used to extract four feature vectors from each frame.
Both the RRNN and KRR classifiers were tested on the same dataset, using different features, and the resulting performance is reported in Table 3. It is worth noting that we also compared the performance of the networks with no features specified (marked as "None"). In other words, the networks consumed only the raw data, and no features were specified explicitly. We also illustrate this in Figure 9, showing both the classification performance and training times. As can be observed, the proposed feature set offered the best possible performance. In addition to this, the training time for KRR was better than the RRNN.

Comparison against Other Classifiers
To assess the relative performance of RRNN and KRR classifiers, we used the support vector machine (SVM) and extreme learning machine (ELM) classifiers as the baselines due to [14]. To make the comparison fair, we maintained the same feature set across all these classifiers and maintained d = 512.
First, we performed the evaluation over the existing dataset configuration, which was used in all experiments hitherto. We show the overall resulting performance in Figure 10. As can be observed, RRNN offered the best and KRR the second best classification performance. The testing accuracy and training time of RRNN were both better than ELM and SVM. To further test the overall performance of the classifiers, we introduced two additional dataset configurations, resulting in an exhaustive evaluation of the classifiers over three different dataset configurations, namely: • Dataset Configuration-1 (DS-1): sea clutter from an IPIX radar with 120,000 measurements over a region R 1 . The land-clutter is modeled as a Weibull amplitude compound-Gaussian distribution.
In other words, the sea-clutter has twice as much as DS-0. • Dataset Configuration-2 (DS-2): sea clutter from an IPIX radar with 60,000 measurements over a region R 2 . The land-clutter is modeled as a Weibull amplitude compound-Gaussian distribution. • Dataset Configuration-3 (DS-3): sea clutter from an IPIX radar with 60,000 measurements over a region R 1 . The land-clutter is modeled as a log-normal distribution.
We show the overall performance of the RRNN and KRR classifiers against the SVM and ELM classifiers tested over dataset configurations DS-1, DS-2, and DS-3 in Figure 11. As can be observed, the performance of the KRR and RRNN classifiers was almost the same and was better than ELM and SVM in training accuracy, which is because ELM achieves the optimal result by minimizing the cost error between the estimated output and target output, while SVM obtains high-dimensional feature mapping by selecting the appropriate kernel function. Meanwhile both RRNN and KRR consider the trade-off between the output weight and cost error, which improved the classifier's performance. However, the impact of training a network on an unbalanced dataset is not clear. When comparing the training time of the classifiers, the RRNN offered the best advantage over the others, and it was insensitive to different datasets. However, KRR was sensitive to different datasets, such as a significant increasing in training time under DS-1. Two recent methods, namely SVM and ELM, showed slightly worse accuracy performance than the method we adopted. Despite the training times being similar, this was due to SVM adopting the sequential minimal optimization strategy to solve the quadratic programming problem, and RRNN fixed the random parameters in the hidden layer rather than iteratively updating them to minimize the output error, which is the common pattern of traditional neural networks. We also can see that DS-1 and DS-2 were more suitable for the RRNN, and DS-2 was more suitable for KRR; however, both classifiers' testing accuracy were reduced under DS-3.

Conclusions
The identification and separation of clutter from true measurements is one of the essential operations in many target tracking applications. One of the specific problems that is encountered in many target tracking applications is the classification of clutter types, particularly at the interfaces, such as sea-land or air-land, among many others.
Conventionally, these problems are often addressed using model based approaches, such as binary hypothesis testing, or by using well-formulated statistical models of the clutter types. However, with clutter (and target) statistics being highly variable, measurements from radar rarely fit these statistical models. In other words, unless the models are data-adaptive, they are unlikely to perform very well on classification.
Machine learning based techniques are known to be data adaptive and have been used to classify clutter types [2,5,6,[8][9][10], often by exploiting certain features of the datasets. In this paper, we proposed two different approaches to this problem, namely the regularized randomized neural network (RRNN) and the kernel ridge regression neural network (KRR), along with a novel feature set, collectively referred to as ECAV, which captures the energy, the zero-crossing rate, the autocorrelation function, and the distribution shape parameter V.
The exhaustive evaluation based on a number of mixed datasets showed that the proposed classifiers offer superior classification performance when used against the proposed feature set and when compared against other classifiers, namely support vector machine and extreme learning machine.