Using Vector Agents to Implement an Unsupervised Image Classiﬁcation Algorithm

: Unsupervised image classiﬁcation methods conventionally use the spatial information of pixels to reduce the effect of speckled noise in the classiﬁed map. To extract this spatial information, they employ a predeﬁned geometry, i.e., a ﬁxed-size window or segmentation map. However, this coding of geometry lacks the necessary complexity to accurately reﬂect the spatial connectivity within objects in a scene. Additionally, there is no unique mathematical formula to determine the shape and scale applied to the geometry, being parameters that are usually estimated by expert users. In this paper, a novel geometry-led approach using Vector Agents (VAs) is proposed to address the above drawbacks in unsupervised classiﬁcation algorithms. Our proposed method has two primary steps: (1) creating reliable training samples and (2) constructing the VA model. In the ﬁrst step, the method applies the statistical information of a classiﬁed image by k-means to select a set of reliable training samples. Then, in the second step, the VAs are trained and constructed to classify the image. The model is tested for classiﬁcation on three high spatial resolution images. The results show the enhanced capability of the VA model to reduce noise in images that have complex features, e.g., streets, buildings.


Introduction
In the remote sensing context, the purpose of image classification is to extract meaningful information (land cover categories) from an image [1]. This process is generally performed via a supervised or unsupervised method [2]. In the supervised classification, the algorithm is first trained using ground data and then applied to classify the image pixels. Unsupervised algorithms only use the information that is contained in the image to classify pixels without requiring any training data [3,4]. As the algorithm does not require training samples, it is easy to perform with minimum human intervention and cost [5,6]. The algorithms use the pixel values through a set of statistical rules or cost functions to label pixels in the feature space [7]. The process is typically performed via an iterative mechanism by minimising the spectral distances (similarities) between pixel values and the calculated cluster centers in the feature space, regardless of their locations in the image. This causes a speckled appearance (also known as the salt-and-pepper noise) in which there are isolated pixels or small regions of pixels in the scene.
The above drawback is typically addressed by incorporating the spatial information between pixels into the clustering process [1]. To determine the neighbouring pixels, the algorithms use a fixed-size window or segmented objects [8]. In the fixed-size window structure, the algorithm, e.g., hidden Markov model [1], or textural information [9,10] applies the spectral-spatial information extracted from a fixed-size window that is centred on each pixel to extract the spatial information and label image pixels. The methods show better classification outcomes in comparison to the conventional clustering algorithms. However, the clustering results are not always adequate because there is no specific rule to determine the optimum size and shape of these local windows [8,11]. As a result, the classified maps may contain significant amounts of misclassifications, especially when there are heterogeneous or complex features in a scene, e.g., roads or buildings. The main assumption for these methods is that objects have similar geometric structures. Thus, a fixed neighbourhood distance can model all possible spatial interactions between and within objects in a scene.
To tackle this limitation, the algorithms usually utilise image segmentation for extracting spatial information, which is specified by the geometry of segmented objects. These methods employ different techniques, such as graph-cut-based segmentation [3], edge features [12], statistical region merging (SRM) [13], or robust fuzzy c-means (RFCM) [14] to create segments and extract the spatial information from images. The structure of these algorithms allows them to increase the accuracy of output maps without setting any fixed neighbourhood distances. Despite the advantages that these methods offer, such as flexible neighbourhood distances, there is no unique solution for the image segmentation problem [15,16]. This is usually solved based on parameters, e.g., scale and colour weight, typically determined based on trial and error [17]. This can lead to poor results when the algorithm uses an over-or under-segmented map to extract the spatial information.
The issue is typically addressed via a hierarchical network of segmented or classified images generated at different scales in the image or feature space [18][19][20][21][22]. For example, Gençtav et al. [19] proposed a hierarchical segmentation algorithm to determine segments. They used homogeneity and circularity of segments to formulate the spatial and spectral relationship between small segments in the lower level of the hierarchy and meaningful segments in the higher level of the hierarchy. Hesheng and Baowei [20] updated the fuzzy C-means (FCM) function to integrate the spatial information at different scales into spectral information to label pixels in the feature space via an iterative process. Kurtz et al. [18] used segmented images at different resolutions to classify images at different levels of spatial detail in an urban area. Fang et al. [21] implemented a multi-model deep learning framework formulated based on an over-segmentation map and semantic labelling to classify an image. The method utilised an iterative mechanism via a fixed geometry to merge segments into each other and relabel pixels to produce a real land cover map. In contrast, Yin et al. [22] proposed a top-bottom structure using graphs for an unsupervised hierarchical segmentation process. The spatial connectivity between pixels was formulated via nodes and edges, and the algorithm used the average intensity of each region to tune the weight of each edge. These methods showed that they can generate better clustering maps even when there are complex features in an image. However, there is no unique setting to formulate the concept of the scale and hierarchical structure between objects at different scales [23]. In other words, spatial relationships between objects in the scene can be subjective to the parameters defined by a human expert that reduces the flexibility of the neighbourhood system. Some advanced classification algorithms use a cyclic mechanism of image segmentation and classification to address a flexible neighbourhood system to extract spatial information and reduce noises. These approaches use an iterative process of image segmentation and classification to integrate spatial and spectral information, use expert knowledge, and reduce noise within objects in an image [24][25][26][27]. The approach allows the algorithms to alter the geometry of objects during the classification process. For example, Baatz et al. [24] used a set of geometric operators to enable segments to change their geometry during a classification process. Hofmann et al. [27] proposed a classification method that enables objects to negotiate at object and pixel levels to change their geometry. This structure is mainly applied by supervised approaches as they require training samples and expert knowledge.
All the above methods have two main geometric characteristics in common to create a flexible neighbourhood system. First, they construct the internal and external boundaries Remote Sens. 2021, 13, 4896 3 of 21 of an object in an image separately. This is because the geometry applied by these methods lacks the necessary complexity to take advantage of topological relationships between and within objects in a scene, e.g., car and street, or chimney and roof. For example, to address a street object which includes cars, the algorithm segments the street into multiple parts and then the street object is formed by merging the segments, regardless of the spatial relationship between cars and streets. Second, they use a segmentation process to define the initial geometry of objects. Thus, the geometric changes are restricted to the object level, not the pixel. For example, when the initial geometry of forest and shadow objects are formed in a scene, the forest object cannot capture a shadow pixel located in the boundary between two classes, and vice versa.
To overcome the above drawback, we propose a dynamic and unified geometry constructed on the Vector Agents (VAs). The VAs are a distinctive type of Geographic Automata (GA) [28] that can draw and find their geometry and state and interact with each other and their environment in a dynamic fashion [29]. The dynamic structure of the VAs enables them to support a flexible neighbourhood system where objects determine their neighbourhood distances, rather than a human expert. The method also applies a unified geometric structure that allows the interior (holes) and exterior boundaries of objects to simultaneously be modelled in a geographical area, i.e., a scene is represented by remote sensing images. This geometry gives the power to the VAs to automatically identify and remove isolated pixels and regions in the image when they lie within objects. The proposed method distinguishes itself from other classification methods based on the following spatial capabilities: i.
Construct and change the interior and exterior geometry of objects in an image simultaneously; ii.
Describe the topological relationships between objects in the image; iii. Support geometric changes of objects at the pixel and object level with minimum human intervention; iv.
Remove salt-and-pepper noise using the geometry of objects in the image.
The remainder of this paper is organised as follows: Section 2 demonstrates the structure of the VA model, and Section 3 presents the clustering results of the proposed method. Experimental results are discussed in Section 4. Finally, Section 5 concludes this paper.

Proposed Method
The proposed approach works in two main steps: unsupervised creation of training samples and construction of the VA model. The method first selects a collection of reliable samples from the image that is clustered by the k-means algorithm in feature space. In the construction step, the proposed method applies the selected samples to train the classifier set inside the VA model. The VAs are then automatically created and added to the image to model the cluster objects in the image.

Creation of Training Samples
Let X = (x 1,1 , x 1,2 , x 1,3 , · · · , x R,C ) ∈ R d denote a multispectral d-dimensional image with R × C pixels and x r,c is the feature vector of size d at position (r, c) in the image (Figure 1a). Let Y = (y 1 , y 2 , y 3 , · · · , y m ) ∈ C denote the set of land-cover labels in the image X; where y i ∈ k, k = {1, 2, · · · , K} is a set of K class labels and K is already known. Let Z = (z 1,1 , z 1,2 , z 1,3 , · · · , z R,C ) ∈ R 2 , z r,c ∈ k, be the classified image ( Figure 1b) that represents the ground data. where , , , and are the mean reflectance, standard deviation in band i of all pixels in each cluster, and the value of pixel ∈ in band i, respectively. is a constant that can be set between the range of 1-3. in Equation (1) determines how close or far the selected samples can be from the centroid of the cluster. For example, if λ is set to 1, the selected pixels are close to the cluster centroid. Considering Equation (1) VAs use these samples to train the Support Vector Machines (SVMs) within the VA model to implement transition rules. The LIBSVM classification library for support vector machines developed by [30] was applied. In our case, the SVM classifier is trained according to the Radial Basis Function (RBF) kernel. C, regularisation parameter and G, the bandwidth, of the RBF kernel are selected using the n-fold cross-validated grid search algorithm where n is set to 10.

Construction of the VA Model
In general, VAs are a distinctive type of geographic automata (GA) [28] that can change their geometry and interact with other objects in a simulation space [29]. The main elements of the VA model are geometry (L) and geometry rules (ML), state (S) and transition rules (TS) and neighbourhood (N) and neighbourhood rules (RN) defined by [31]. In the construction step, we update these components to classify the image. Figure 2 demonstrates the elements of the VA model for image classification. In each iteration, the VAs use their sensors to capture information from feature and image space simultaneously. This information includes the location and geometric structure of VAs in the image and the pixels' spectral information in the feature space. The clusters are first constructed through k-means using the Euclidean distance with 100 iterations. The number of clusters, K, and the dimension of feature space, d, are set to 3 and 4 in Figure 1, respectively. The Euclidean distance is calculated based on a vector of spectral values specified by the location of pixels and cluster centres, which are randomly selected in feature space. The algorithm calculates the spectral distance at each iteration, reassigns the pixels to the nearest cluster centre, and then updates the vector. This structure allows the algorithm to minimise the variance within each cluster based on parameters, such as the number of clusters and iterations specified by an expert user. The following rule is then applied to identify reliable training samples for each cluster from X.
where p l , σ p,i , and p i are the mean reflectance, standard deviation in band i of all pixels in each cluster, and the value of pixel p ∈ X in band i, respectively. λ is a constant that can be set between the range of 1-3. λ in Equation (1) determines how close or far the selected samples can be from the centroid of the cluster. For example, if λ is set to 1, the selected pixels are close to the cluster centroid. Considering Equation (1), , reliable training samples are determined, where l is the number of initial training samples, z i ∈ Z and x i ∈ X (Figure 1c).
VAs use these samples to train the Support Vector Machines (SVMs) within the VA model to implement transition rules. The LIBSVM classification library for support vector machines developed by [30] was applied. In our case, the SVM classifier is trained according to the Radial Basis Function (RBF) kernel. C, regularisation parameter and G, the bandwidth, of the RBF kernel are selected using the n-fold cross-validated grid search algorithm where n is set to 10.

Construction of the VA Model
In general, VAs are a distinctive type of geographic automata (GA) [28] that can change their geometry and interact with other objects in a simulation space [29]. The main elements of the VA model are geometry (L) and geometry rules (M L ), state (S) and transition rules (T S ) and neighbourhood (N) and neighbourhood rules (R N ) defined by [31]. In the construction step, we update these components to classify the image. Figure 2 demonstrates the elements of the VA model for image classification. In each iteration, the VAs use their sensors to capture information from feature and image space simultaneously. This information includes the location and geometric structure of VAs in the image and the pixels' spectral information in the feature space. Sens. 2021, 13, x FOR PEER REVIEW 5 of 22 The VAs then apply this information to execute rules and strategies in the image via their effectors, i.e., point, line, and polygon. The spatial relationship between these elements is defined and formulated in the following section. We used a Java implementation of the Repast V.2.5 (Recursive Porous Agent Simulation Toolkit, the University of Chicago, Chicago, USA) modelling framework [32] along with a generic Vector Agent library developed by [30] to implement the proposed method.

Geometry and Geometry Methods
In the context of image classification, the geometry component (L) of a VA stores the vertices that define the boundary of the VA. is the connected subset of the image space formed by the pixels belonging to the VA. In contrast to the conventional agentbased classification approaches that use a predefined geometry to classify an image [33,34], the VAs can automatically construct their geometry. The geometric methods ML enable VAs to change the boundary and interact with other VAs in the simulation domain. These methods can be summarised as follows:

•
Vertex displacement: This places a new vertex and connects two vertices together by two half-edges, specified according to a single direction (Figure 3a  The VAs then apply this information to execute rules and strategies in the image via their effectors, i.e., point, line, and polygon. The spatial relationship between these elements is defined and formulated in the following section. We used a Java implementation of the Repast V.2.5 (Recursive Porous Agent Simulation Toolkit, the University of Chicago, IL, USA) modelling framework [32] along with a generic Vector Agent library developed by [30] to implement the proposed method.

Geometry and Geometry Methods
In the context of image classification, the geometry component (L) of a VA stores the vertices that define the boundary ∂X VA of the VA. X VA is the connected subset of the image space formed by the pixels belonging to the VA. In contrast to the conventional agent-based classification approaches that use a predefined geometry to classify an image [33,34], the VAs can automatically construct their geometry. The geometric methods M L enable VAs to change the boundary ∂X VA and interact with other VAs in the simulation domain. These methods can be summarised as follows:

•
Vertex displacement: This places a new vertex and connects two vertices together by two half-edges, specified according to a single direction (Figure 3a  illustrates how the VA uses the above rules to create a polygon that includes a hole. These holes can be applied to define the geometry of isolated pixels or regions (connected pixels) that may exist inside the information classes in the image, e.g., the blue pixels within the green area in Figure 1b Figure 4 illustrates how the VA uses the above rules to create a polygon that includes a hole. These holes can be applied to define the geometry of isolated pixels or regions (connected pixels) that may exist inside the information classes in the image, e.g., the blue pixels within the green area in Figure 1b. Through this geometry, the VAs can reduce the salt-and-pepper noise in the image and increase the quality of the classified outputs.

State and Transition Rules
The state S of a VA is the VA class that is determined via the SVM classifier trained on the selected samples generated in the creation step. TS rules allow the VAs to find and update their classes and evaluate pixels in the image. A VA is added to the image as a point if it can satisfy the following rule: • The candidate pixel xc and its immediate neighbours in the image must be members of the same class. VAs use the SVM classifier to evaluate such membership.
The VA points then start evolving geometrically by capturing nearby pixels. In each iteration, the VA adds a new pixel as a point to its geometry if the new pixel and VA are in the same class. In the event that a new point is added to the VA, it updates its geometry (L) and state (S) using ML and TS rules. If the candidate pixel belongs to another VA, and these VAs are in the same class, then the VA applies the joining/removing rule to create a new geometry. In this event, all attributes of the removed VA are transferred to the active VA in the image. Figure 5a shows different sizes and shapes for the exterior boundaries of the information classes, while the size of holes is restricted by the large holes ( Figure  5c).  Figure 4 illustrates how the VA uses the above rules to create a polygon that includes a hole. These holes can be applied to define the geometry of isolated pixels or regions (connected pixels) that may exist inside the information classes in the image, e.g., the blue pixels within the green area in Figure 1b. Through this geometry, the VAs can reduce the salt-and-pepper noise in the image and increase the quality of the classified outputs.

State and Transition Rules
The state S of a VA is the VA class that is determined via the SVM classifier trained on the selected samples generated in the creation step. TS rules allow the VAs to find and update their classes and evaluate pixels in the image. A VA is added to the image as a point if it can satisfy the following rule: • The candidate pixel xc and its immediate neighbours in the image must be members of the same class. VAs use the SVM classifier to evaluate such membership.
The VA points then start evolving geometrically by capturing nearby pixels. In each iteration, the VA adds a new pixel as a point to its geometry if the new pixel and VA are in the same class. In the event that a new point is added to the VA, it updates its geometry (L) and state (S) using ML and TS rules. If the candidate pixel belongs to another VA, and these VAs are in the same class, then the VA applies the joining/removing rule to create a new geometry. In this event, all attributes of the removed VA are transferred to the active VA in the image. Figure 5a shows different sizes and shapes for the exterior boundaries of the information classes, while the size of holes is restricted by the large holes ( Figure  5c). Through this geometry, the VAs can reduce the salt-and-pepper noise in the image and increase the quality of the classified outputs.

State and Transition Rules
The state S of a VA is the VA class that is determined via the SVM classifier trained on the selected samples generated in the creation step. T S rules allow the VAs to find and update their classes and evaluate pixels in the image. A VA is added to the image as a point if it can satisfy the following rule:

•
The candidate pixel x c and its immediate neighbours in the image must be members of the same class. VAs use the SVM classifier to evaluate such membership.
The VA points then start evolving geometrically by capturing nearby pixels. In each iteration, the VA adds a new pixel as a point to its geometry if the new pixel and VA are in the same class. In the event that a new point is added to the VA, it updates its geometry (L) and state (S) using M L and T S rules. If the candidate pixel belongs to another VA, and these VAs are in the same class, then the VA applies the joining/removing rule to create a new geometry. In this event, all attributes of the removed VA are transferred to the active VA in the image. Figure 5a shows different sizes and shapes for the exterior boundaries of the information classes, while the size of holes is restricted by the large holes ( Figure 5c). Remote Sens. 2021, 13, x FOR PEER REVIEW 7 of 22 (a) (b) (c) Figure 5. (a-c) display the evolving process of the soil VA, water VA, and tree VA at different successive time steps in the image by using the transition and geometric rules.

Neighbourhood and Neighbourhood Rules
The neighbourhood component N for a VA is a collection of VAs that are located within the neighbour distance of the VA in the image. If they can stratify the neighbourhood rule (RN) in Equation (2): where , , , is the Euclidean distance in simulation space between and at time step, t. VAs use this rule to interact with each other geometrically through a set of advanced geometric methods, such as joining/removing or growing/shrinking, defined and implemented by [33]. As the spectral information of pixels is the only available information, VAs solely use the joining/removing method to geometrically interact with each other in the image. Figure 6 shows an example of the joining/removing method in which two VAs implement RN to merge into each other.

Implementation of VA for Unsupervised Classification
The flexibility of the VA approach to both segment and classify objects dynamically allows a comprehensive image classification approach whereby VAs are born, evolve, and sometimes die throughout the image space until the objects are extracted from the image. The process starts by seeding a desired number of VAs as points in the image whose coordinates correspond to the centre of image pixels. This seeding process can obey a specific sampling scheme (e.g., fully random, stratified random, systematic unaligned random, etc.). In this paper, a systematic unaligned random sampling scheme is chosen so that VAs are seeded for every class at various locations throughout the image as prescribed in the previous section. These points or VAs will evolve via an iterative process into polygons by capturing nearby pixels based on transition and neighbourhood rules via their geometry methods. In the event that all VAs are passive, the algorithm only con-

Neighbourhood and Neighbourhood Rules
The neighbourhood component N for a VA is a collection of VAs that are located within the neighbour distance of the VA in the image. If they can stratify the neighbourhood rule (R N ) in Equation (2): where d VA i,t , VA j,t is the Euclidean distance in simulation space between VA i and VA j at time step, t. VAs use this rule to interact with each other geometrically through a set of advanced geometric methods, such as joining/removing or growing/shrinking, defined and implemented by [33]. As the spectral information of pixels is the only available information, VAs solely use the joining/removing method to geometrically interact with each other in the image. Figure 6 shows an example of the joining/removing method in which two VAs implement R N to merge into each other.

Neighbourhood and Neighbourhood Rules
The neighbourhood component N for a VA is a collection of VAs that are located within the neighbour distance of the VA in the image. If they can stratify the neighbourhood rule (RN) in Equation (2): where , , , is the Euclidean distance in simulation space between and at time step, t. VAs use this rule to interact with each other geometrically through a set of advanced geometric methods, such as joining/removing or growing/shrinking, defined and implemented by [33]. As the spectral information of pixels is the only available information, VAs solely use the joining/removing method to geometrically interact with each other in the image. Figure 6 shows an example of the joining/removing method in which two VAs implement RN to merge into each other.

Implementation of VA for Unsupervised Classification
The flexibility of the VA approach to both segment and classify objects dynamically allows a comprehensive image classification approach whereby VAs are born, evolve, and sometimes die throughout the image space until the objects are extracted from the image. The process starts by seeding a desired number of VAs as points in the image whose coordinates correspond to the centre of image pixels. This seeding process can obey a specific sampling scheme (e.g., fully random, stratified random, systematic unaligned random, etc.). In this paper, a systematic unaligned random sampling scheme is chosen so that VAs are seeded for every class at various locations throughout the image as prescribed in the previous section. These points or VAs will evolve via an iterative process into polygons by capturing nearby pixels based on transition and neighbourhood rules via their geometry methods. In the event that all VAs are passive, the algorithm only con-

Implementation of VA for Unsupervised Classification
The flexibility of the VA approach to both segment and classify objects dynamically allows a comprehensive image classification approach whereby VAs are born, evolve, and sometimes die throughout the image space until the objects are extracted from the image. The process starts by seeding a desired number of VAs as points in the image whose coordinates correspond to the centre of image pixels. This seeding process can obey a specific sampling scheme (e.g., fully random, stratified random, systematic unaligned random, etc.). In this paper, a systematic unaligned random sampling scheme is chosen so that VAs are seeded for every class at various locations throughout the image as prescribed in the previous section. These points or VAs will evolve via an iterative process into polygons by capturing nearby pixels based on transition and neighbourhood rules via their geometry methods. In the event that all VAs are passive, the algorithm only considers the geometry location of pixels as spatial criteria to capture unclassified pixels. Figure 7 displays the classified map that is generated by the VA model. A comparison between

Experiments and Results
To study the performance of the VA model, three high-resolution satellite remotely sensed images have been used. They are described below along with the details of clustering methods and the metrics applied to measure the effectiveness of the proposed method and the outputs of the proposed method.

Datasets
The proposed approach was experimentally tested on three hyperspectral images: Pavia Centre and Pavia University datasets collected by a ROSIS sensor and an AVIRIS scene of Salinas Valley, California (Figure 8). The number of spectral bands is 102 for Pavia Centre, 103 for Pavia University, and 224 for Salinas Valley. Considering the computational efficiency, four bands for each image are selected using the PCA function in MATLAB. In the first experiment, a subset of 199 × 199 is cut from the Pavia Centre image with a pixel size of 1.3-metres (Figure 8a). The image includes four information classes: water (C1), tree (C2), bare soil (C3), and bridge (C4) (Figure 8b).

Experiments and Results
To study the performance of the VA model, three high-resolution satellite remotely sensed images have been used. They are described below along with the details of clustering methods and the metrics applied to measure the effectiveness of the proposed method and the outputs of the proposed method.

Datasets
The proposed approach was experimentally tested on three hyperspectral images: Pavia Centre and Pavia University datasets collected by a ROSIS sensor and an AVIRIS scene of Salinas Valley, California (Figure 8). The number of spectral bands is 102 for Pavia Centre, 103 for Pavia University, and 224 for Salinas Valley. Considering the computational efficiency, four bands for each image are selected using the PCA function in MATLAB. In the first experiment, a subset of 199 × 199 is cut from the Pavia Centre image with a pixel size of 1.3-metres (Figure 8a). The image includes four information classes: water (C1), tree (C2), bare soil (C3), and bridge (C4) (Figure 8b).

Image Clustering
To evaluate the performance of the proposed approach for removing noises from the classified images, we compared the results of the VA model with the conventional classification approaches in two modes.
Pavia Centre and Pavia University datasets collected by a ROSIS sensor and an AVIRIS scene of Salinas Valley, California (Figure 8). The number of spectral bands is 102 for Pavia Centre, 103 for Pavia University, and 224 for Salinas Valley. Considering the computational efficiency, four bands for each image are selected using the PCA function in MATLAB. In the first experiment, a subset of 199 × 199 is cut from the Pavia Centre image with a pixel size of 1.3-metres (Figure 8a). The image includes four information classes: water (C1), tree (C2), bare soil (C3), and bridge (C4) (Figure 8b). For the second experiment, we used a subset of 199 × 199 cut from the University of Pavia image with a pixel size of 1.3-metres (Figure 8c). It covers an urban area including five information classes: buildings and asphalt (C1), shadow (C2), bare soil (C3), meadows (C4), and painted metal sheets (C5) (Figure 8d). In the third experiment, a subset of hyperspectral AVIRIS with the size of 198 × 198 pixels was applied to test the VA method (Figure 8e). The geometric resolution is 3.7 m. The scene covers an agricultural zone in California, containing seven information classes: Vinyard_untrained (C1), Bro-coli_green_weeds_2 (C2), Grapes_untrained (C3), Fallow_rough_plow (C4), Fal-low_smooth (C5), Stubble (C6), and Celery (C7). (Figure 8f).

Image Clustering
To evaluate the performance of the proposed approach for removing noises from the classified images, we compared the results of the VA model with the conventional classification approaches in two modes.

i. Unsupervised
We assume that the only available information is the number of clusters and there is no semantic information. In this setting, a classical k-means algorithm is applied, and the spatial information is imposed to the algorithm at two different stages: pre-and post-classification as follows: • Spectral-Spatial classification (SSC) The algorithm utilises the density function first to group pixels together in the feature space, using a combination of spectral similarity and spatial proximity between pixels. It then applies the k-means algorithm to label pixels, based on a new vector determined by the pixel values and the features associated with its corresponding regions in the segmented image.

i. Unsupervised
We assume that the only available information is the number of clusters and there is no semantic information. In this setting, a classical k-means algorithm is applied, and the spatial information is imposed to the algorithm at two different stages: pre-and postclassification as follows: • Spectral-Spatial classification (SSC) The algorithm utilises the density function first to group pixels together in the feature space, using a combination of spectral similarity and spatial proximity between pixels. It then applies the k-means algorithm to label pixels, based on a new vector determined by the pixel values and the features associated with its corresponding regions in the segmented image.

•
Majority Filtering (MF) The algorithm uses a three-by-three fixed window to replace labels in the classified image by k-means based on the majority of their contiguous neighbouring pixels. The number of neighbours is set to eight to retain edges.
ii. Semi-supervised The algorithms use a limited number of training samples at two levels to train the SVM model and classify the images: pixel and object.
• SVM: We used the SVM method described in Section 2 to label the pixels without using spatial information. Training samples were manually selected from the ground truth data to train the SVM.

•
Mean Shift Segmentation (MSS): The method first uses the MSS algorithm to segment the image. The SVM model is then applied to label segments. In this scenario, we manually selected the training objects.

•
Multiresolution Segmentation (MRS): The algorithm employs the MRS function, which is formulated based on a combination of spectral homogeneity and shape homogeneity to segment the image. Like the MSS method, the training objects are manually selected to train the SVM classifier.
In the above methods, the segmentation parameters are manually defined. Since there are no rules to determine those parameters, we conducted different segmentation experiments to initialise the segmentation parameters to generate segments with less speckle.

Evaluation Metrics
To evaluate the results quantitatively, the VA objects are compared with their corresponding reference objects in the ground truth maps. We use the following metrics to assess the accuracy of the VA maps [35]: True Positive (TP), False Positive (FP), and False Negative (FN) are correctly detected pixels, wrongly detected pixels, and unrecognised pixels, respectively.
We also evaluate the spatial connectivity and fragmentation of the objects in the classified maps. The Perimeter/Area (P/A) ratio of classified objects is applied to assess the local neighbourhood connectivity between pixels within clusters in the classified maps.

Results
We first used the k-means algorithm to cluster the images. From the classified maps, 15 pixels are randomly selected for each cluster using Equation (1), where λ is set to 1. The VA model then applies the selected samples to train its SVM classifier. The parameters of the k-means and SVM algorithm were set as defined in Section 2. The trained VAs are then added to the image to extract clusters from images ( Figure 9).
For accuracy assessment among different methods, the initial clusters were mapped into information classes. This is because, in unsupervised classification each, information class may contain more than one cluster. For example, in the University of Pavia dataset, the number of clusters and information classes were considered seven and five, respectively. Remote Sens. 2021, 13, x FOR PEER REVIEW 11 of 22 For accuracy assessment among different methods, the initial clusters were mapped into information classes. This is because, in unsupervised classification each, information class may contain more than one cluster. For example, in the University of Pavia dataset, the number of clusters and information classes were considered seven and five, respectively.

Discussion
The main contribution of the proposed work is to introduce a geometry-led approach with minimum human intervention to reduce salt-and-pepper noise in classified maps. To assess the performance of the methods, different conventional spectral-spatial clustering methods were applied (described in Section 3.2). In the next section, the results of the proposed method will be discussed in more detail using different datasets.

Speckled Noise Analysis
From Figure 10b,c,e-g, it can be seen how the use of spatial information can reduce the number of isolated pixels within the classified images. For example, the water object, C1, in Figure 10a,d, where only spectral information is applied for classification, contains some isolated bridge pixels shown by the black circle. The black circles in Figure 10a,e  Figure 8b; (d-f) seven different types of the VA mapped into five information classes: shown in Figure 8d; (g-i) seven information classes shown in Figure 8f were considered to assess the results of the VA map.

Discussion
The main contribution of the proposed work is to introduce a geometry-led approach with minimum human intervention to reduce salt-and-pepper noise in classified maps. To assess the performance of the methods, different conventional spectral-spatial clustering methods were applied (described in Section 3.2). In the next section, the results of the proposed method will be discussed in more detail using different datasets.  Figure 10b,c,e-g, it can be seen how the use of spatial information can reduce the number of isolated pixels within the classified images. For example, the water object, C1, in Figure 10a,d, where only spectral information is applied for classification, contains some isolated bridge pixels shown by the black circle. The black circles in Figure 10a,e show that using the MF method effectively reduces noises within the C1 object. However, the MF method was unable to completely remove the isolated region (grey pixels) within the black circle due to its fixed geometric structure. An additional filtering step can be applied to reduce these noises. On the other hand, it can affect the quality of boundaries, especially if there are low local variations between objects with different classes. bourhood systems, the results are subjective to the function and parameters specified by a human expert. For example, the MSS map provided a better appearance for the bridge objects (the red circles in Figure 10) compared to the MRS method. In contrast, the tree objects are more homogenous in the MRS map. The other issue for the geometry applied by these methods is that it lacks the ability to directly model the objects in the scene during the classification. Consequently, these methods can use spatial relationships once the image is segmented. Conversely, the geometry of the VA model allows the algorithm to directly link an object in the simulation space to its corresponding object in the scene. It provides an actual geometry that allows the method to automatically formulate the topological relationship between and within objects without setting any geometric parameters. The output results in the VA map in Figure 10b show how the use of this geometry can improve the appearance of the classified maps and remove speckles within objects. For instance, there are no isolated pixels within C1 and C4 objects.

Accuracy Analysis
In Table 1, the highest values of each metric for each class are marked in bold and the lowest values are underlined. As can be seen from Table 1, the best results are achieved by the VA model. The OA value for the VA model is 99.97% in Table 1, which indicates that most pixels are correctly detected. However, this improvement is not significant, i.e., The generated objects by the SSC, MSS and MRS methods are more patch-like than the created objects by the MF, SVM and k-means methods. The yellow circles illustrate the ability of these methods to deal with heterogeneous objects, i.e., C2 objects. This can be explained by the fact that these methods use a flexible neighbourhood system to classify the image. Whilst these methods provide a better appearance than the fixed neighbourhood systems, the results are subjective to the function and parameters specified by a human expert. For example, the MSS map provided a better appearance for the bridge objects (the red circles in Figure 10) compared to the MRS method. In contrast, the tree objects are more homogenous in the MRS map. The other issue for the geometry applied by these methods is that it lacks the ability to directly model the objects in the scene during the classification. Consequently, these methods can use spatial relationships once the image is segmented.
Conversely, the geometry of the VA model allows the algorithm to directly link an object in the simulation space to its corresponding object in the scene. It provides an actual geometry that allows the method to automatically formulate the topological relationship between and within objects without setting any geometric parameters. The output results in the VA map in Figure 10b show how the use of this geometry can improve the appearance of the classified maps and remove speckles within objects. For instance, there are no isolated pixels within C1 and C4 objects.

Accuracy Analysis
In Table 1, the highest values of each metric for each class are marked in bold and the lowest values are underlined. As can be seen from Table 1, the best results are achieved by the VA model. The OA value for the VA model is 99.97% in Table 1, which indicates that most pixels are correctly detected. However, this improvement is not significant, i.e., the OA of the k-means is 99.83. This is due to the imbalanced distribution of the ground data, almost 85% of ground data belong to the C1 class. Thus, the accuracy of the C1 class overshadows the inaccuracy of the C3 and C4 classes. Moreover, the ground data do not cover the whole C4 object, i.e., the bridge, even though it has a sharp and clear exterior boundary. The OA of the SVM model in this experiment is slightly lower than the OA of the k-means. This is because the C2 and C3 pixels are misclassified. However, the recall values show better performance of the SVM model to reduce the isolated pixels within C1, C2, and C4 compared to the k-means. Table 1 shows the high performance of the VA model to identify pixels in C4 objects. The results of the VA model are better than the MRS method, with almost 7% improvement for the precision and Jaccard-index values. This is because the parameters specified by an analyst form the geometry of the segments, and the method models interior and exterior boundaries of objects separately. Thus, some isolated pixels or regions exist within segments that can change features' spectral behaviour, especially when they are close to borders in the feature space. The results in Table 1 also confirm the improved performance of the VA model compared to the MSS method, by more than 3%, to extract the C4 pixels.
Although the Recall values for the C4 class are 100%, the classified maps show some isolated pixels within the classified maps (see Figure 10d,e). Thus, the C4 recall values do not accurately reflect the actual geometric structure of the bridge object in the classified maps. This is because the ground data in Figure 8b do not cover the whole bridge structure. The F-score of the VA model is better than the F-score of the MRS and MMS algorithms for C2 and C3 objects. This indicates that the proposed geometry of the VA can accurately model heterogeneous objects such as C2 and C3 in the scene.
For better analysis of the performance of the VA model, we used the geometric parameters of two heterogeneous classes, C3 and C4, in this experiment. In Table 2, the P/A ratio is calculated using the geometric information of the objects in the C3 and C4 classes.  Table 2 shows that the VA algorithm used 17 objects to extract C2 and C3 pixels. This number for the SSC map is 63, while the difference between calculated values for C2 and C3 class in Table 1 is less than 1% for these two methods. That means the C2 and C3 objects are less fragmented and more connected in the VA map. The number of regions for C2 and C3 classes in the ground truth map (Figure 8b) are 17 and 3, respectively. From Table 2, it can be seen that there is almost 3% difference between the total area of the C3 objects in the SSC and VA, while there is a 66% difference between the perimeters of the C3 objects. This means that the C3 objects in the SSC map are more fragmented. The reason for this is the isolated pixels and regions within the C3 objects, which increase fragmentation. These holes can be cut via an additional process. For example, the algorithm can use an additional filter to identify the holes within objects and cut these objects. However, the results would be subjective to the size and shape of filters and the shapes of holes. Additionally, this process can decrease the level of automation for unsupervised algorithms.
The geometric information of the C2 objects in Table 2 illustrates that the VA algorithm reduces the noises within C2 objects better than the SSC method. The MRS method applied 22 patches to cover the area of 10,225.76 m 2 , while the VA model used 11 patches to extract C2 objects with an area of 9335.86 m 2 . The MSS and VA algorithms generated six patches to address the C3 objects. However, the P/A ratio of the VA model is 0.02% lower than the MSS method for the Class 3 objects. This means the connectivity between C3 pixels in the VA map is better than the MMS map.

Dataset 2 4.2.1. Speckled Noise Analysis
It can be clearly observed from Figure 11b,e-g how the use of spatial information can significantly reduce the speckled noises and produce a better visual appearance for the classified maps.  Figure 11 illustrates that the MF algorithm creates less noise within the C2 objects compared to the k-means algorithm. For example, the size of the isolated regions within the shadow object, located next to the building in the south of the classified map, is decreased in the MF map. The MSS and MRS algorithms perform better than the MF method to remove the isolated pixels within shadow objects. This is because the shadow objects  Figure 11 shows that the VA map generates a better appearance for the roof objects in the C1 class than the other methods. For example, the red circles in Figure 11 illustrate no isolated pixels or regions within the C3 area in the VA map. This is because the VAs use a unified geometry that enables them to simultaneously model objects' inner and outer boundaries in the image. If there is a geometric hole in the VA, it implies that the object inside the VA has a different class from the VA. The VA then removes this object by simply reconstructing its geometry. In contrast, the geometry applied by the MSS and MRS methods lacks this capability to determine the spatial connectivity within objects in an image. Therefore, an additional step is usually required to reformulate the neighbourhood system to address these holes after classification. Figure 11 illustrates that the MF algorithm creates less noise within the C2 objects compared to the k-means algorithm. For example, the size of the isolated regions within the shadow object, located next to the building in the south of the classified map, is decreased in the MF map. The MSS and MRS algorithms perform better than the MF method to remove the isolated pixels within shadow objects. This is because the shadow objects have sharp and crisp exterior boundaries, and the size of isolated regions is relatively small. However, the distribution of the shadow objects within the MSS and MRS maps are different because these methods use different functions to segment the image. Figure 11 confirms no isolated pixels or regions within the classified maps for the C3 and C2 objects.
For the C4 objects, the SVM method shows better performance than the other methods ( Figure 11d). For example, the VA map displays a few isolated regions within the C4 area, highlighted by the blue circle. There is no one-to-one relationship between information class and cluster in the scene (Figure 9f). Thus, the neighbouring VAs in the same class but with different clusters cannot join together to remove these regions.
The classified maps show no isolated pixels or regions within the C5 objects as they have distinctive spectral behaviours. However, a visual assessment between the classified maps in Figure 11 and the ground truth map in Figure 8d illustrates that the MSS and MRS methods significantly change the exterior boundaries of the C5 objects. Table 3 includes the recall, precision, Jaccard index, and F-score values of the classified maps. In the table, the highest and lowest values are highlighted. It can be seen that the VA model generated better recall values for all clusters compared to the MSS and MRS methods. For example, there is an improvement of 10% for the VA recall value for the C2 objects compared to those in the MRS map. Although there are no isolated pixels and regions within C2 objects in the MRS map, the method could not accurately address the exterior boundaries of the shadow objects in the classified map. This is because the static geometry of the method does not allow the classifier to change the boundary of the objects during the classification step. The recall value for the C1 objects in the VA map is 99.44%, a 10% improvement compared to the MMS results. A visual assessment between the C1 objects in Figures 8d and 11f illustrates that the MSS algorithm could not accurately delineate the exterior boundaries of the C1 objects. This is due to the fact that the method uses a unique formula initialised by a human expert to model different shapes and neighbourhood distances. The red circles in the classified maps show no isolated regions or pixels within the roof objects in the VA map. Table 3 shows that the results of the SSC method are better than the MSS and MRS methods. For instance, the F-score of the SSC map for the C5 objects is 20% more than the MRS method. The use of an advanced method that allows the classifier to modify the boundaries of segments during the classification step and use different parameters can improve the results for these methods. However, these changes are against the nature of unsupervised approaches that aim to increase the level of automation for image classification.  Table 3 shows the better performance of the SVM method compared to the other methods to identify the C3 pixels. For instance, the recall value of the VA model is 89.40%, while this number is 98.93% for the SVM method. This is because the C4 class in the VA map contains more than one cluster. As a result, the adjacent VAs could not merge with each other in the C4 zones (Figure 9f) in order to remove the isolated C3 areas (the blue circles in Figure 11b). This also decreases the recall value of the VA model for the C4 class compared to the SVM method. As the C4 class has the highest number of training samples, which is more than 60% of the total ground pixels, the accuracy of the VA model is reduced in comparison to the SVM method. From Table 3, it can be seen that the VA model does return better results for objects in the C5 class than the other methods.

Accuracy Analysis
Compared with k-means, the VA model improves the OA to a large degree by reducing the noises. The overall accuracy (OA) of the VA model and k-means is 89.19% and 79.56%, respectively. It confirms that using VAs can improve the accuracy of the results by more than 10%. However, the results of the SSC and VA approaches are almost 4% different. This is because the C4 class contains more than one cluster. Accordingly, the C4 VA could not join together and remove isolated C3 areas in the scene.
As most ground truth samples belong to the C4 class, this significantly affects the performance of the VA model compared to the SVM method in this experiment. Additionally, the ground pixels do not entirely contain the whole objects for man-made features. For instance, there are many isolated pixels and regions within the building objects generated by the SVM and SSC method, while there are no ground pixels for these areas.
Like the first experiment, we use the geometric information of the C1 and C4 classes to have a better assessment of the VA model ( Table 4). The VA model used 40 patches to address the C1 objects. This number is 342 patches for the SSC method, while the area difference between the C1 objects in these two methods is less than 3%. That means that spatial connectivity between C1 objects in the VA map is higher than the SSC map. Although the MSS and MRS methods used fewer patches to address the C1 and C4 objects, they were unable to address the boundaries within these clusters accurately. As with the previous experiments, we use a combination of unsupervised and semisupervised methods to assess the performance of the proposed method. It can be clearly observed from Figure 12 that the VA, MSS, and MRS methods provide more homogenous areas than the other methods. For instance, the k-means map contains a lot of salt-andpepper noises for Class 1 and Class 3, which gives the output a speckled appearance. Figure 12 also demonstrates that the VA map offers a better visual appearance than the SSC map. There are lots of isolated pixels within the SVM and SSC maps. The VA map shows no speckles within the C2 object. In contrast, a few speckles, especially near the boundaries, can be observed in the other classified maps. The MF map illustrates that the use of filters is effective in reducing noise. However, it can significantly affect the boundaries of objects in the scene. The classified maps show that there are a few isolated regions or pixels within the C4-C7 objects. However, the exterior boundaries of C4-C7 objects are different. This is due to the various geometric structures which are applied by these methods.   Table 5 lists precision, Jaccard index, F-score, and Recall values of the classification maps based on the different methods. In the table, the highest and lowest values are marked. The values in Table 5 confirm that the use of spatial information can significantly improve the performance of the k-means method. The highest recall value for Class 1 belongs to the MSS method, indicating that most C1 pixels are identified. Although the Recall value of the VA model is 20% lower than the MSS method, the difference between the F-score values of these two methods is 7%. That indicates the FP value of the MSS is higher than the VA model for Class C1, which decreases the recall value of the MSS map for the C3 class by 13%. Although the recall value of the VA for the C1 class is lower than the  Table 5 lists precision, Jaccard index, F-score, and Recall values of the classification maps based on the different methods. In the table, the highest and lowest values are marked. The values in Table 5 confirm that the use of spatial information can significantly improve the performance of the k-means method. The highest recall value for Class 1 belongs to the MSS method, indicating that most C1 pixels are identified. Although the Recall value of the VA model is 20% lower than the MSS method, the difference between the F-score values of these two methods is 7%. That indicates the FP value of the MSS is higher than the VA model for Class C1, which decreases the recall value of the MSS map for the C3 class by 13%. Although the recall value of the VA for the C1 class is lower than the MSS method, Figure 12 shows that there are no isolated regions within the C1 object. While the MSS map displays a few isolated areas within the C1 and C3 objects in the MSS map. Table 5. Comparison between the results of k-means, SSC, SVM, and VA approach. The total number of C1-C7 pixels in Figure 8f is 27,659.  Figure 12 shows a few isolated pixels within the classified maps for Class 2 and Class 4-7. However, the VA model exhibits better performance for the objects in Class 2 and Class 4-7 than the MSS, MRS, and SSC methods. The reason for this is that the misclassified pixels lie on the boundaries between objects, and the boundaries between objects are determined prior to classification. Additionally, the boundary changes during classification are limited to merging segments within the same class. In contrast, the geometry of the VA model allows the objects to affect the geometry of each other at both pixel and object levels, as demonstrated in Figure 6.

Class
The OA of the SVM method and VA model is 76.34% and 80.01%, respectively. This can be due to the high spatial distribution of C1 pixels within the C3 class in the SVM map and vice versa. Moreover, the SVM method only applies spectral information to label pixels. As we expected, the VA model improves the OA accuracy of the k-means by almost 4.5%. The VA model also demonstrates better performance than the SSC method, with more than 2% improvement. The main reason for this improvement is the geometry of the VAs, which gives them the power to reduce the noises spatially in the image. Although the OA value of the MSS method is 2% more than the VA method, the recall values of the VA method for C2-C7 objects are higher than the MSS method. This indicates that the proposed geometry by the VA can effectively remove the speckled noises.
We extracted the geometry information of the C1 and C3 classes in the classified maps to calculate the P/A ratio. In contrast to the previous experiments, the ground pixels are concentrated in one part of the GT map and cover the whole C1 and C3 objects in the image (Figure 8f).
From Table 6, it can be clearly observed that the VA map has a lower level of fragmentation compared to the SSC method. For example, the P/A value for the C1 class is 0.09 in the VA map, which is 0.21 in the SSC map. This is because the small and isolated patches in the classified maps, i.e., Figure 12c,d, increase the perimeter of features in the image. To address the C1 objects in the image, the VA method generated only 15 patches, the lowest number among all methods in Table 6. This can be explained by the fact that the VA model classifies the pixels in the scene instead of the feature space. Additionally, the unified geometric structure of the VA model allows them to remove the small holes within the objects in the image as shown in Figures 4 and 5. For all the above experiments, the processing was carried out using Repast running on an Intel CPU at 3.40 GHz with 16 GB of memory. The average time for the VA method to classify the whole image was 450 s. Although the proposed algorithm is a little slow, the level of automation of the proposed method is higher than the MSS, MRS, MF, and SSC methods, which are formulated based on geometric parameters specified by a human expert. As the neighbourhood distances are automatically defined by the VAs, not a human expert, the results of the proposed method can be more consistent and robust.

Conclusions
Conventional k-means methods use pixels in isolation to cluster an image. Because of this, they lack the ability to deal with limitations, such as salt-and-pepper noise. This drawback is usually addressed through the use of a combination of spectral and spatial information. To extract spatial information around each pixel, classification algorithms generally apply a static geometry formulated based on a local fixed window or an irregular polygon. The properties of the geometry are generally defined by an expert user. The algorithm then applies this spatial information and the spectral information of pixels to classify the image or relabel pixels. The primary assumption in this structure is that there is a unique mathematical formula that can be applied to formulate spatial connectivity between all features in an image. If this assumption is violated, e.g., when there are complex or heterogeneous features in a scene, the clustering results may contain significant amounts of misclassifications.
In this paper, we presented a geometry-led approach, constructed based on the VA model, to remove the salt-and-pepper noise without setting any geometric parameters, e.g., scale, in unsupervised image classification. In the presented algorithm, we applied a unified and dynamic geometry, instead of using a predefined geometry or a hierarchical structure, to create an actual flexible neighbourhood system for extracting spatial information and removing speckle noise. The experimental results demonstrated the desirable performance of the VA model. For example, the P/A values in Tables 2, 4 and 6 highlighted that the VAs increase spatial connectivity between pixels and provide a better visual appearance  Tables 1, 3 and 5 also indicate better performance of the VA model to remove noise than the MSS and MRS methods.
For future research, we plan to improve the performance of the proposed method by reducing the processing time. This can be performed by adding the learning capabilities to VAs in order to find the shortest route to determine the boundary lines, which can help the algorithm to save memory and reduce simulation time. Another area for future research can be to adapt the proposed method for object extraction from remotely sensed imagery, e.g., road extraction.