Robust Object Categorization and Scene Classiﬁcation over Remote Sensing Images via Features Fusion and Fully Convolutional Network

: The latest visionary technologies have made an evident impact on remote sensing scene classiﬁcation. Scene classiﬁcation is one of the most challenging yet important tasks in understanding high-resolution aerial and remote sensing scenes. In this discipline, deep learning models, particularly convolutional neural networks (CNNs), have made outstanding accomplishments. Deep feature extraction from a CNN model is a frequently utilized technique in these approaches. Although CNN-based techniques have achieved considerable success, there is indeed ample space for improvement in terms of their classiﬁcation accuracies. Certainly, fusion with other features has the potential to extensively improve the performance of distant imaging scene classiﬁcation. This paper, thus, offers an effective hybrid model that is based on the concept of feature-level fusion. We use the fuzzy C-means segmentation technique to appropriately classify various objects in the remote sensing images. The segmented regions of the image are then labeled using a Markov random ﬁeld (MRF). After the segmentation and labeling of the objects, classical and CNN features are extracted and combined to classify the objects. After categorizing the objects, object-to-object relations are studied. Finally, these objects are transmitted to a fully convolutional network (FCN) for scene classiﬁcation along with their relationship triplets. The experimental evaluation of three publicly available standard datasets reveals the phenomenal performance of the proposed system.


Introduction
Recent advances in imaging technology have demonstrated that remote sensing (RS) imagery now has a higher resolution than reported previously. RS images are currently being employed in a variety of research disciplines, including object categorization [1], image reconstruction [2], change detection analysis [3], land-use classification [4], scene classification [5], and environmental monitoring [6]. Scene classification for RS images is crucial in practical applications since it aims to assign a scene category to each RS image on the basis of its semantic information.
Scene classification for RS images, which attempts to assign a scene category to each RS image on the basis of its semantic content, is critical in practical applications. Generally, accurate aerial scene classification requires excellent feature extraction. Apart from classic methods based on hand-crafted features [7], recent years have seen incredible performances achieved through deep convolutional neural network (CNN)-based approaches [8]. Moreover, CaffeNet [9], AlexNet [10], VGG Net [11], GoogLeNet [12], and ResNet [13] are all regularly used CNN models. Thus, CNNs have exhibited an exceptional capacity to extract discriminative features from aerial scenes. Despite the outstanding results obtained using CNN-based approaches, the task of extracting useful features from aerial scene imagery continues to face several difficulties.
To begin, in comparison to natural scenes, aerial scene images exhibit a significant degree of intraclass diversity. Specifically, items belonging to the same scene type may appear in a variety of sizes and orientations. Additionally, the appearance of the same scene may be altered owing to the varied imaging environments, such as the height of the equipment for image capturing and the solar altitude. Secondly, scene images from distinct classes may contain identical items and structural differences, resulting in a minor degree of interclass dissimilarity. In general, a strong depiction of aerial imagery is critical for gaining a competitive edge in this field. As a result, the features that we employ and how we apply them are becoming increasingly significant in the domain of aerial scene classification.
In this paper, we present an efficacious framework to significantly enhance the classification accuracy for remote sensing imagery. Initially, we incorporate a fuzzy C-means segmentation to partition the scene into homogeneous regions as segments of different objects in the scene. After segmentation, a Markov random field (MRF) model is adopted as a postprocessing and labeling technique. During postprocessing, the segmented regions of the image are more clearly segregated as disconnected parts are converted to connected components and, finally, unique labels are assigned to segmented objects using the probabilistic approach. Once the segments have been labeled, they can be used to extract features using classical and CNN-based methods. As a deep feature extractor, we deploy a pretrained CNN while super-pixel patterns, spectral-spatial features (SSFs), and Haralick texture features are extracted as classical features. A parallel feature fusion is incorporated to fuse all the extracted features. The fused feature set is transmitted to multiple kernel learning (MKL) for object categorization in the remote sensing imagery. These categorized objects are then analyzed for the object-to-object relationship (OOR) present in the scene imagery. Finally, these relationships triplets and categorized objects are fed to a fully convolutional network (FCN) for scene classification. We evaluated our system over three publicly available datasets. Moreover, the comparison of our results with various state-of-the-art (SOTA) methods demonstrates significant improvements over other SOTA techniques. The key contributions of this research are as follows:

•
We employed MRF as a postprocessing and labeling technique after segmentation to avoid the challenges encountered during segmentation while using other segmentation techniques, i.e., accurate scene classification. • CNN and classical features including Haralick features, spectral-spatial features, and super-pixel patterns are fused to improve the classification accuracy. • MKL-based categorization significantly enhances the performance of object categorization. • Probability-based OOR relations are introduced to contextually analyze the relationship between the objects present in the remote sensing scenes. • After object categorization and OOR exploration, FCN is applied for the remote scene classification.
The rest of the paper is organized as follows: Section 2 discusses related works. Section 3 provides an overview of the proposed method, which includes segmentation, labeling, feature extraction, and their fusion. Section 4 gives the details of the datasets used, the experimental design, and the outcomes. Lastly, in Section 5, we provide the conclusions of this study.

Related Work
Exploring the locations among several objects, their calibration and positioning, and the impact of scenic imagery are complicated issues in the domain of aerial and remote sensing images. We conducted a literature review across multiple domains, including object categorization, object segmentation, labeling, and scene classification to develop appropriate dynamics and metrics for the presented approach.

Object Categorization
The area of object categorization involves various challenges for researchers, including locating objects, detecting and analyzing their relationships, finding occluded components, and separating classes for desirable outcomes. Over the last decade, the bag-of-features model has undoubtedly been the most popular and effective paradigm for imagery categorization and classification. Numerous intriguing works have focused on the bag-of-features concept [14]. Martin et al. [15] developed a Bayesian inference model to assess each object's previous knowledge to track several objects. It then revised the potential mass function to allow for more precise object recognition and convergence rate for accurate classification. In [16], they offered a unique class-specific illustration technique for object categorization. Initially, they used a Gaussian mixture model (GMM) to describe the features of images inside that class. Image and GMM models were then compared in terms of their respective Euclidean distances, which were utilized to represent each image. This was achieved by concatenating the representations of all the classes. In this method, they could express an image by combining the class-specific features, as well as the visual components. In [17], an effective technique was presented to classify the indoor-outdoor scenes by employing multi-object categorization. They used two different approaches to segment the images, and then object categorization was performed using multiple kernel learning (MKL) by considering local descriptors with the combination of signatures of a specific region. After finding the object relationships, they applied multiclass logistic regression to classify the scenes.
Wong et al. [18] presented an approach for online object detection and classification of the image's object classes. They proposed using kernel learning to rapidly track all the objects in a scene rather than relying on past knowledge of a single object. The Neo-vision2 tower benchmark dataset was used to develop a biologically inspired approach for detecting an object's contours and motion. Sumbul et al. [19] developed methods that incorporated the attention of a multisource region network that computed the pre-source feature illustration and was then distributed across the network's members on the basis of their representations of object locations. They employed multispectral approaches to achieve better accuracy.

Scene Classification
Previously published research utilized low-level cues to categorize objects and scenes. These low-level cues include histograms of gradients [20], statistical analysis of structural information for texture discrimination [21], GIST [22], and scale-invariant feature transform (SIFT) [23]. However, these solutions depended on technical expertise and expert knowledge to generate feature representations, which have limits when it comes to representing large amounts of scene data. To overcome the shortcomings of low-level feature-based classification approaches, several approaches have been devised to improve the efficiency of scene classification by aggregating the collected local low-level visual cues into a midlevel scene illustration. One of the most extensively used systems based on mid-level visual features is bag of visual words (BoVW) [24]. It constructs a visual dictionary using k-means clustering, and mid-level visual information is extracted and achieved through feature encoding. Their model used BoVW and its advanced versions to classify scenes on numerous occasions. Additionally, some other mid-level features based on traditional approaches exist, including spatial pyramid matching [25], improved fisher kernel [26], and vectors of locally aggregated descriptors [27].
However, previously described approaches, which rely on low-and mid-level features retrieved from RS imagery, are not particularly sophisticated and, hence, cannot adequately reflect the semantic information contained in images. Recent research has demonstrated that deep learning approaches, particularly CNN, perform exceptionally well in computer vision applications due to their great feature extraction capacity. Additionally, RS image scene classification falls under the category of high-level image processing tasks that are strongly connected to computer vision. RS images have a poor resolution at an early stage, and the scenes to be identified are large-area land cover, in contrast to natural images used in computer vision, which focus on small-scale items. As a result, it has trouble incorporating deep learning-based algorithms into the categorization of RS image scenes. However, the RS images now have a high spatial resolution, while the disparity amongst RS and natural images has also been minimized; hence, the possibility of incorporating different remote sensing visualization techniques into image processing has increased. Numerous CNN-based algorithms for scene classification have been introduced recently [28]. Rather than relying on low-and mid-level cues, CNN-based approaches may extract hierarchical features from RS images. Additionally, the majority of CNN-based approaches make use of models that have been pretrained on ImageNet [29], including AlexNet [10], VGG [11], ResNet [13], and DenseNet [30]. Hu et al. [31] validated the efficiency of CNN models utilizing convolutional layer features. Li et al. suggested a unique filter bank in [32] for simultaneously capturing local and global data in order to improve the results of classification. They investigated the effect of various training procedures on the categorization process. Their system includes three different training approaches: feature extraction and fine-tuning through a pretrained CNN framework, and fully trained networks. The experimental findings revealed that, when compared to the other two procedures, the fine-tuning strategy achieved a better classification accuracy.

Proposed System Methodology
This section demonstrates a novel object categorization and scene classification (OCSC) model that categorizes the objects present in the remote scene imagery. Moreover, it classifies the scenes on the basis of these categorized objects. Initially, a remote sensing image is considered for segmentation by employing fuzzy C-means (FCM) algorithms. Then, these segmented objects are further processed to improve the segments and labeled via MRF. The labeled objects are then analyzed for feature extraction by CNN, while classical features including Haralick features, SSFs, and super-pixel patterns are also extracted. After the fusion of these extracted features, MKL is applied to categorize the unique objects in the remote scene images. Once the categories of the objects are separated, the OOR is computed on the basis of probability triplets. Finally, these OOR probabilities and object categories are taken as the input of FCN for remote sensing scene classification. Figure 1 illustrates the hierarchal view of our system.

Preprocessing Stage
Un-sharp masking [33] for image sharpening is performed during preprocessing to provide an enhanced image with sharp edges. Three parameters are used to produce un-sharp masking: amount, radius, and threshold. The amount parameter is used to adjust the contrast between the edges and is typically specified as a percentage. Radius defines the thickness of the edge and can be increased. A threshold is used to control the image's brightness level. We set the radius and amount parameters to 0.75% and 1.25%, respectively, during our study. The following formula can be used to obtain a sharper image: where I sh represents the sharpened image, I o specifies the original image, a blurred image is represented by I b and amt is to describe the amount parameter which denotes the strength of the sharpening effect.

Object Segmentation via Fuzzy C-Means
This section describes the fuzzy C-means (FCM) approach [34,35] for segmentation. Initially, homologous components are spotted on the basis of pixels that are considered data points, consistent with the method. Rather than being assigned to a single defined cluster, each pixel demonstrating a fuzzy logic is then considered to be a member of numerous clusters. By iteratively minimizing the objective function, the FCM fragments the image. Additionally, these features constrain ideal image clusters by reducing cluster weights using the squared error objective function A N (P, Q) as follows: where n illustrates the number of data points with r real numbers in the i-th cluster, c denotes the clusters, p r ij reflects the membership of x j pixels in the i-th cluster, and q i expresses the centroid of the cluster.
where A N (P, Q), the distance between each pixel and the cluster center, may be calculated using P and Q. When the minimal distance from the pixel to the cluster center is observed, a high membership value is allocated to the well-suited pixel. Using the typical FCM approach, a high level of computational complexity is produced because of the analysis of spatial values at each iteration that is used to quantify the distance from the cluster center to the relevant pixel in an image. Figure 2 shows the outcomes of segmenting the images from the UCM dataset.

Labeling via Markov Random Field
A Markov random field (MRF) [36,37] can be described in formal terms by a set of sites S = {1, . . . , N}. These are N pixel places. A collection of random variables {w n } N n = 1 and a set of neighbors {N n } N n = 1 are connected with each of the N locations. To qualify as a Markov random field, the model must adhere to the following Markov property: As a result, a Markov random field (MRF) can be considered to be an undirected model that specifies the conditional probabilities of variables as a product of potential functions such that where φ j [•] is the j-th potential function, which never yields a negative value. This value is determined by the state of a subset of the variables C j ⊂ {1, . . . , N}. This subset is referred to as a clique in this context. The partition function, denoted by Z, is a normalizing constant that ensures the resulting probability distribution is correct. We used MRF for postprocessing of segmented regions. The segmented regions having discontinuities are initially connected by considering the multiple key points on boundaries and connecting these key points to accurately separate the segmented regions. Then, these regions with a boundary around the connected regions having pixels with similar features are grouped together and assigned a unique label. Figure 3 illustrates the results of MRF labeling on a selection of images from the AID. Figure 3 illustrates the results of MRF labeling on a selection of images from the AID.

Feature Extraction
To categorize the objects in remote sensing imagery, various classical and deep features are analyzed. Classical features including Haralick texture features, SSFs, and super-pixel patterns are computed on the basis of statistical techniques while deep learning-based features are extracted using a pretrained CNN model. This section covers the feature computation, fusion, and selection processes in detail.

CNN Features
To extract CNN features [38], VGG-16 (a pretrained CNN model) is incorporated. Deng et al. [39] trained this model on the ImageNet dataset. The model is simple and comprises an input layer and 13 convolutional layers. The input layer considers the images with dimensions of 320 × 320 × 3 as input. There are also five pooling layers (max pooling) following the three fully connected layers. The window size for max-pooling is 2 × 2. The rectified linear unit (ReLU) is used as an activation function in hidden layers. To extract effective CNN features, a transfer learning method is applied that exploits the already learned features to make the model useful as compared to training a new model from scratch. The general architecture of CNN features extraction is shown in Figure 4.

Haralick Features
Remote sensing images of several objects may appear identical in color but have distinct texture patterns. This inspired us to integrate texture features that behave as local descriptors. To obtain texture features, we used a cooccurrence matrix. The four local features are derived from a matrix of cooccurrences termed Haralick features [40]. Haralick assumed that this matrix contains texture information, and texture features are subsequently computed from this matrix. The cooccurrence matrix contains 14 factors; however, only four are commonly used. These four texture features, energy (E), contrast (C), correlation (Cor), and entropy (H), are computed mathematically by the following equations: It was demonstrated that these four parameters were sufficient to produce acceptable results in a classification test. These four parameters are listed with their values in Table 1.

Spectral-Spatial Features (SSFs)
Mathematical morphology [41] is one of the well-known paradigms that equips operators with the ability to generate high-quality SSFs [42]. Erosion and dilation are basic mathematical morphology operations that examine an image's geometrical structures by comparing them to small patterns called structuring elements.
Attribute filters (AF): Various flat regions of the image, or areas of the image that have comparable gray levels are used to extract various types of information, specified by the feature names. An image's equivalent tree representation can be used to effectively build attribute filters as in [43]. By applying a threshold to all of the image's mapped values, the following sets of higher-and lower-level sets (i.e., flat zones) are created that can be further classified into subcategories: where ConComp denotes the connected components of the generic image. An inclusion relationship [33] exists between the interconnected components that are obtained by either the lowest-or the highest-level sets. Attribute profiles (APs): APs define a generic collection of profiles that make use of the attribute filter's flexibility to conduct a more thorough investigation of the scene.
Extended attribute profiles: Because hyperspectral sensors acquire data across many spectral bands, extended attribute profiles (EAPs) based on morphological attribute filters are used to analyze hyperspectral high-resolution images. The EAPs are based on the application of the APs to hyperspectral data.
where PC denotes one principal component obtained by applying principal component analysis to the data. Extended multi-attribute profiles (EMAPs): Many features can be used to extract spatial elements more effectively; hence, EMAPs combine multiple EAPs into a single data structure.
The spatial information extraction in the EMAP is substantially more powerful than a single EAP; however, processing these features incurs a substantial cost in terms of computation, as the max-tree and min-tree are generated just once for each PC and are processed with various attributes at multiple stages. The visual results of SSFs over areal images are presented in Figure 5.

Super-Pixel Pattern
We present a method for creating super-pixels following [44] that is faster and more memory-effective than current approaches. It demonstrates state-of-the-art boundary conformance and enhances the segmentation efficiency. Simple linear iterative clustering is a modification of k-means for super-pixel creation, with two critical differences: the first one is reducing optimization time by narrowing the search area based on super-pixel size, which leads to significantly fewer distance calculations, and the second one describes that there is no dependence of the number of super-pixels k on how many pixels N there are; hence, the complexity is reduced to a linear function. It is possible to regulate the size and coherence of the super-pixels using color and spatial distance combined as a weighted distance metric.
Super-pixels correspond to clusters in color-image plane space. This causes an issue in determining the distance measure Dist F . To compute the distance between a pixel i and cluster center C k , distance measure Dist F is used. A color space [l a b] T having a range of known values is considered for color representation of every pixel. The pixel's position x y T , on the other hand, may take a range of values that vary according to the size of the image. We need to compute two distances, i.e., normalized color distance and spatial distance. We then combine these two distances into a single measure by their respective maximum distances within a cluster, Nor spt and Nor col . In doing so, Dist F is written as Results of super-pixel patterns computed over some remote sensing images from UCM dataset are presented in Figure 6.

Feature Fusion
The CNN, Haralick features, SSFs, and super-pixel patterns are computed separately as Feature CNN , Feature Haralick , Feature SS , and Feature SP , respectively. All these feature vectors are merged as in [45] to form a complete fused feature vector and normalized before fusion, to ensure that the individual feature vectors elements do not surpass other elements. Once normalization is performed, the CNN, Haralick, SSF, and super-pixel patterns are pooled to form a complete fused feature vector.
A high-dimensional feature vector is obtained as a result of the two-level decomposition of complex images while feature analysis is executed. Consequently, an inadequate classification is witnessed when the input feature vectors have high dimensions. Therefore, reducing the size of feature vectors is important in order to reduce computational costs and improve performance. For the purpose, GA-based [46] feature selection is employed to obtain the reduced dimensional feature vector Feature Fin :

Object Categorization: Multiple Kernel Learning
The proposed system employs MKL [17] to achieve object categorization on the basis of multiple regions and signatures of the regions in complex remote sensing imagery, as shown in Figure 7. During object categorization, an image I having a number of c clusters obtained from the segmented and labeled objects that are presented in various distinct colors is taken to extract descriptor D I , which describes the region R of the image I. Now, to compute the signature x I , a function f R from local descriptors D I as f R : D I → x I is incorporated. Mathematically, f R can be written as follows: where Center c represent clusters center c, entire descriptors are described by |c| in the clusters c for all the images from a class, descriptors of image I that belong to cluster c are shown as D icI , and the mean of those centered descriptors belonging to c is denoted by µ c , while µ j,c is the signature computed from the image I. Then, a vector VEC I,C is obtained from µ I,c . The computation of signature vector x I of image I for all c is performed by the concatenation of all VEC I,C .

Probability-Based Object-to-Object Relations (OORs)
After recognition of multiple objects in a complex scene, the relationship between these objects is identified. To enhance the scene recognition performance, object-to-object relations (OORs) [47] are computed on the basis of contextual information regarding objects. As complex scenes comprise multiple co-occurring visual features, these OORs significantly recognize patterns to understand the scenes. For instance, a car is likely to be seen on roads instead of the sky or water, while a ship is likely to be in the sea or water instead of on roads. To determine the OORs, several features and relative positions of the objects are considered. Initially, to find the weight of the j-th target object for j ∈ {1, 2, . . . , n} with respect to another relevant i-th object for i ∈ {1, 2, . . . , n}, a dot product is computed as follows: where the visual cures of the j-th and i-th object are represented by f j and f i , respectively. The distance between the j-th and i-th object is denoted by (j, i). Lastly, to determine the relation of the j-th object with other objects, the following expression is used: where the visual features of the i-th object are denoted by f n i . While the relations are computed between the objects, the scene labels are predicted by the classifier on the basis of these OORs. Figure 8 presents a schematic view of OORs.

Scene Recognition: Fully Convolutional Network
Once the OOR is determined, object triplets and probabilities are forwarded to the FCN that classifies the scenes by incorporating the object category and contextual relationship between those objects. FCN [48] is an architecture that is mostly used for semantic segmentation. FCN employs locally connected layers, including convolution, pooling, and up-sampling, in a variety of ways. Avoiding dense layers results in fewer parameters (i.e., making the networks faster to train). Additionally, because all connections are local, an FCN can be used with varying image sizes. The network is composed of a down-sampling path for extracting and interpreting context, as well as an up-sampling path for localization.
A fully convolutional network (FCN) with the following hyperparameters is used to classify the remote sensing scenes: a learning rate of 0.01, a batch size of 16, and 32 conv_block1 filters, 64 conv_block2 filters, 128 conv_block3 filters, 264 conv_block4 filters, and 512 conv_block5 filters. Although we could choose a learning rate with a floatingpoint value between 0.0001 and 0.1, a learning rate of 0.01 led to the best results during our training process for remote scene classification over the benchmark datasets under consideration, i.e., UCM, AID, and RESISC45 datasets. Similarly, the batch size can range from 1-100, but the power of 2 is mostly chosen as the batch size; we chose 16 (2 4 ) following its better performance during training. Figure 9 depicts the results of scene classification on a benchmark dataset.

Experimental Results
To evaluate the training/testing performance of the proposed model, we used the leave-one-out cross-validation method on three publicly available datasets: AID, RESIEC45 dataset, and UCM dataset.

Aerial Images Dataset
The Aerial Images Dataset (AID) [49] is a newly created large-scale aerial image collection. The AID comprises 30 classes having 10,000 images. Each class is composed of 200-400 images, and every image contains at least two objects and at most eight objects. The dataset covers the following aerial scene types: airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, and viaduct. Figure 10 presents some example images from the AID.

RESISC45 Dataset
The RESISC45 dataset [50] is one of the well-known benchmarks for remote sensing image scene classification. This dataset was created by Northwestern Polytechnical University (NWPU); therefore, it is also named NWPU-RESISC45, and it consists of 31,500 remote sensing images of 45 various scene classes. Each class comprises 700 images with a minimum of two and maximum of 10 objects in each class. These classes are airplane, airport, basketball diamond, baseball court, beach, bridge, forest, golf course, etc. Figure 11 shows a few classes of he f NWPU-RESISC45 dataset.

UCM Dataset
The UCM dataset [51] is a benchmark that is publicly available for research purposes. The dataset comprises 21 classes with 100 images in each class. The number of objects in each class may vary from two to seven depending on the class scenario. The dimensions of the images are 256 × 256 pixels. For several cities across the country, the USGS National Map Urban Area Imagery collection was used to manually extract the imagery. The classes are labeled as agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis court.

Experimental Evaluation
In this section, we present the recognition accuracies based on the confusion matrices computed over three complex datasets: the AID, UCM dataset, and RESISC45 dataset. For OCSC, we used an FCN as a classifier, and the proposed system was evaluated by the leaveone-subject-out (LOSO) cross-validation technique. Figure 13 demonstrates the results over the UCM dataset with an average of 98.75% scene classification accuracy. Figure 14 presents a classification accuracy of 97.73% over the AID, and Figure 15 demonstrates an average accuracy of 96.57% over the RESISC45 dataset.
Class-wise accuracies may be studied with the color code against each class label on the left of the graph, which is specified for the corresponding class. The mixture of different colors on the right denotes different classes present in the result which may be encoded as misclassification. Misclassification is interpreted as a color in the graph line above that specific class, which is a false positive (FP), or a color in the mixture below the original class, which is a false negative (FN). For instance, the FL class in the AID shows both FPs and FNs in the graph along with correct predictions, where CH and DS are FPs, while FR and IN are FNs shown in the graph.   The recognition results of the UCM dataset show that GC, HB, and PG had lower accuracies compared to other scene classes. However, the overall recognition accuracy was better and comparable with other state-of-the-art methods. There are a total of 21 classes in the UCM dataset; out of those, we achieved remarkable performance on 18 classes, while the other three classes had good results, nearly equivalent to other existing methods.
Similar to the UCM dataset, we observed better performance over the AID compared to other SOTA techniques, as presented in Figure 14. Most of the classes demonstrated remarkable results in terms of accuracy. Higher accuracy was achieved by more than 20 classes including RV, SQ, SM, ST, and VD, while some other classes (DR, DS, FL, and PD) still need improvement. For instance, the IN class achieved an accuracy of 90% as shown in Figure 14, which demonstrates that 2% of cases were incorrectly recognized as FL and 8% of cases were misclassified as "DS". Likewise, class-wise accuracies may be studied with the color against each class label on the left of the graph, where a mixture of different colors on the right denotes misclassification.
Analogous to that of the UCM and AID, the OCSC model demonstrated excellent performance when evaluated over the RESISC45 dataset. Figure 15 illustrates that most of the classes depicted exceptional performance in terms of recognition accuracy including PG and MW with accuracies of 99%, where MW was misclassified 1% of the time as MK, while the lowest accuracy was noted for the CL class, where CL was misclassified as CM, DR, and DS 9%, 8%, and 4% of the time, respectively.
In this section, experimental evaluation was performed on benchmarks including the AID, UCM dataset, and RESISC45 dataset. At first, the CNN and classical features (i.e., SSFs, Haralick features, and super-pixel patterns) were given to the most commonly used classifier artificial neural network (ANN), and its results were obtained. Then, the same features were given to a deep belief network (DBN) for recognition. Finally, a comparison of the recognition results using conventional approaches with that of the proposed OCSC model using FCN was performed. Tables 2-4 present the comparison results of precision, recall, and F1 Score over the AID, RESISC45 dataset, and UCM dataset, respectively. In this section, we present the precision, recall, and F-1 measures computed over three complex datasets, the AID, UCM dataset, and RESISC45 dataset. We applied ANN and DBN for the remote sensing scene classification and compared the results with FCN (proposed) model. Although there were some comparable results in a few classes over the AID, we overall observed a significant improvement compared to the other well-known classifiers. A few classes including BR and DR showed better recall using DBN, while PD had better precision using DBN; however, results were overall excellent in all classes using the proposed model. Similarly, the mean values of precision, recall, and F1 score were highest when applying FCN (proposed model). A similar pattern was observed when we applied three different classifiers over the RESISC45 dataset. We experienced a better precision value for BR and ST classes, while AT, BC, BH, FW, and OP classes had better recall value compared to the proposed method when a DBN was applied to the same dataset. Nevertheless, the mean precision, recall, and F1 score were the highest amongst the three well-known classifiers. For a comprehensive evaluation, we compared the proposed system with various existing state-of-the-art (SOTA) methods including the self-attention feature selection module represented by SAFENet [52], label augmentation via ResNet18 + LA + KL [53], ACNet [54] for exploring local and global features integrated with some attention techniques for remote scene classification, ARCNet-VGGnet16 [55], Deep Fusion [56] using two-stream deep architecture for high-resolution aerial images classification, Fusion by Addition [57], and Siamese ResNet50 [58]. We compared the mean accuracy of scene classification, and the results are illustrated in Table 5. It is demonstrated that the boosted performance of our proposed OCSC system outperformed the other reported methods in terms of mean accuracy. Specifically, comparing BoVW and SAFENet depicts an increase in the accuracy of scene classification that validates the effectiveness of feature fusion in our model. Furthermore, there is also an increase in the scene classification accuracy compared to ACNet over the AID and FESIEC45 dataset, although somewhat low but comparable accuracy was observed on the UCM dataset.

Ablation Study
We presented various features including CNN, Haralick, spectral-spatial, and superpixel patterns. Here, we discuss the focal point of whether each of the features adds something new to the system to determine if all these features are essential for the OCSC system. To answer this, we conducted experiments to validate the influence of feature fusion and used a greedy approach that incrementally added features to our system starting with the best ones, i.e., CNN. Initially, we started experiments with CNN features only and achieved scene recognition accuracies of 91.37%, 91.88%, and 90.55% over the AID, UCM dataset, and NWPU-RESISC45 dataset, respectively. Then, we added super-pixel patterns and fused them with CNN features, observing significantly enhanced performance from 91.37% to 92.69% for AID, 91.88% to 93.19% for the UCM dataset, and 90.55% to 93.57% for the NWPU-RESISC45 dataset. The improved performance motivated us to further increase the number of features, similarly to the fusion of CNN and super-pixel patterns (SPPs) demonstrated earlier. Next, we conducted experiments with the addition of SSFs to the previously fused set of features. Fusion of SSFs to the already fused features set of CNN and SPP produced better results in terms of accuracy compared to the results obtained by previously fused features. An increase in the performance of recognition accuracy was witnessed from 92.69% to 94.19%, 93.19% to 94.99%, and 93.57% to 95.25% over the AID, UCM dataset, and NWPU-RESISC45 dataset, respectively. Therefore, we fused another classical feature, Haralick feature, with the already fused version of features and performed experiments for object categorization and scene classification. Combining all the features produced the best recognition performance with overall recognition accuracies of 97.73%, 98.75%, and 96.57% for the AID, UCM dataset, and NWPU-RESISC45 dataset, respectively.  It is clear from the results presented in the Figure 16 that fusion of CNN and classical features produced comparative results to CNN. This was a bit different for the UCM dataset, where our approach had less but acceptable accuracy when considering the computational complexity of both techniques. The well-known CNN models are computationally complex compared to FCM. The details of computational time are illustrated in Table 6. We tested these algorithms on an Intel system with 32 GB RAM and Intel (R) Core (TM) i7-1065G7 CPU @ 1.30 GHz 1.50 GHz, along with an NVIDIA GeForce GPU. The proposed model had the least computational time required for the segmentation of remote sensing images compared to CNN.

Discussion
The proposed OCSC was designed to achieve object categorization and scene recognition over remote sensing imagery. In this article, we developed a framework that uses FCM for the segmentation of RS images and MRF for labeling of the segmented images. The labeled images were then further analyzed for extraction of features including CNN features and classical features (Haralick features, Spectral-spatial features, super-pixel patterns). Here, CNN features were extracted using a pretrained CNN model (VGG16), while classical features were extracted through machine learning techniques and mathematical formulation. These extracted features were then combined using a parallel fusion mechanism and optimized before transmitting to MKL as input, where various categories of objects were specified. Once the objects were categorized, the object-to-object relationship was determined, and a fully convolutional network was employed to classify the scenes.
Initially, the segmentation process is the fundamental module to properly classify remote sensing imagery. Therefore, an effective mechanism of FCM segmentation was incorporated to achieve significant results for segmented regions from the complex highresolution scene images. After obtaining segmented regions, as a postprocessing step, an MRF was applied to obtain the labeled objects for further processing of feature extraction. During this labeling phase, the segmented regions were analyzed on the basis of the regions (connected, disconnected), and postprocessing was performed to more accurately isolate the boundaries of the regions segmented in the previous phase. These improved segmented regions were then labeled on the basis of a perceptual grouping mechanism, where each segmented region was assigned with a unique label (color).
This complementary module for labeling significantly enhanced the object categorization. We conducted experiments for both modules i.e., by employing only FCM for segmentation and by applying MRF for postprocessing and labeling of segmented regions. When only FCM-based segmentation was performed, the object categorization on the benchmark datasets achieved less accuracy; however, we saw an improvement when we added postprocessing and labeling of the objects using MRF before analysis for feature extraction. The performance in terms of object categorization accuracy was significantly increased. The details of these experimental results were demonstrated in the ablation experiment section. Moreover, our approach of feature fusion after extracting CNN features and classical features had an impact on the recognition accuracy of the scene, which led to the overall enhanced scene classification. The effect of different features on object categorization and scene recognition was illustrated in detail in the ablation experiment section.
We applied ANN and DBN for the remote sensing scene classification and compared the results with FCN (proposed) model. Although there were some comparable results in a few classes over the AID dataset, we overall observed a significant improvement compared to the other well-known classifiers. A few classes including BR and DR showed better recall using DBN, while PD had better precision using DBN; however, overall results were excellent in all classes using the proposed model. Similarly, the mean values of precision, recall, and F1 score were highest when applying FCN (proposed model).
A similar pattern was observed when we applied three different models over the RESISC45 dataset. We experienced a better precision value for BR and ST classes, while AT, BC, BH, FW, and OP classes had a better recall value compared to the proposed method. Nevertheless, the mean precision, recall, and F1 score were the highest amongst the three well-known classifiers.
While working with the OCSC model, despite the tremendous performance, we were also confronted with some limitations and constraints. Some tiny objects, such as people and animals, eluded our classification. Similarly, multiple vehicles were sometimes recognized as single vehicles when they were occluded by more than 50% in terms of pixels.

Conclusions
The proposed OCSC system was designed to achieve object categorization and scene classification over various complex aerial scene images and publicly available benchmark datasets. In this paper, we incorporated FCM followed by MRF to segment and label the aerial images from different remote sensing benchmark datasets. Furthermore, we analyzed these labeled images for extraction of classical and deep features. Moreover, these features were taken as input for object categorization by employing MKL. After the successful categorization of multiple objects present in the remote scene images, the inter-object relationships were computed to finally classify the scenes by applying FCN. The remarkable results of the proposed model show that it outperformed the SOTA remote sensing scene classification techniques.