A hybrid OSVM-OCNN Method for Crop Classiﬁcation from Fine Spatial Resolution Remotely Sensed Imagery

: Accurate information on crop distribution is of great importance for a range of applications including crop yield estimation, greenhouse gas emission measurement and management policy formulation. Fine spatial resolution (FSR) remotely sensed imagery provides new opportunities for crop mapping at a detailed level. However, crop classiﬁcation from FSR imagery is known to be challenging due to the great intra-class variability and low inter-class disparity in the data. In this research, a novel hybrid method (OSVM-OCNN) was proposed for crop classiﬁcation from FSR imagery, which combines a shallow-structured object-based support vector machine (OSVM) with a deep-structured object-based convolutional neural network (OCNN). Unlike pixel-wise classiﬁcation methods, the OSVM-OCNN method operates on objects as the basic units of analysis and, thus, classiﬁes remotely sensed images at the object level. The proposed OSVM-OCNN harvests the complementary characteristics of the two sub-models, the OSVM with e ﬀ ective extraction of low-level within-object features and the OCNN with capture and utilization of high-level between-object information. By using a rule-based fusion strategy based primarily on the OCNN’s prediction probability, the two sub-models were fused in a concise and e ﬀ ective manner. We investigated the e ﬀ ectiveness of the proposed method over two test sites (i.e., S1 and S2) that have distinctive and heterogeneous patterns of di ﬀ erent crops in the Sacramento Valley, California, using FSR Synthetic Aperture Radar (SAR) and FSR multispectral data, respectively. Experimental results illustrated that the new proposed OSVM-OCNN approach increased markedly the classiﬁcation accuracy for most of crop types in S1 and all crop types in S2, and it consistently achieved the most accurate accuracy in comparison with its two object-based sub-models (OSVM and OCNN) as well as the pixel-wise SVM (PSVM) and CNN (PCNN) methods. Our ﬁndings, thus, suggest that the proposed method is as an e ﬀ ective and e ﬃ cient approach to solve the challenging problem of crop classiﬁcation using FSR imagery (including from di ﬀ erent remotely sensed platforms). More importantly, the OSVM-OCNN method is readily generalisable to other landscape classes and, thus, should provide a general solution to solve the complex FSR image classiﬁcation problem.

used to partition the agricultural landscape into basic crop patches (objects), based on whether the SVM and CNN models were respectively applied to allocate a label to each object. The outputs of the two models were combined subsequently through a rule-based fusion strategy according to prediction probability output from the CNN. Such a fusion decision strategy allows the rectification of CNN predictions with low confidence using SVM predictions at the object level. The major contributions of this research can be summarised as: (1) the shallow architecture SVM and the deep architecture CNN was first found to be complementary to each other in terms of crop classification at the object level; (2) a straightforward rule-based decision fusion strategy was developed to effectively fuse the results of the OSVM and OCNN. We investigated the effectiveness of the proposed approach over two study sites with heterogeneous agriculture landscapes in California, USA, using the FSR UAVSAR and RapidEye imagery.
The reminder of this paper is organised into five sections: Section 2 elaborates the proposed methods in detail. Section 3 provides the study area, datasets, model structure and experimental results. A thorough discussion of the observed results is made in Section 4, and the conclusions of this research are drawn in Section 5.

Overview of the Support Vector Machine (SVM)
The principle of the SVM is to determine an optimal classification hyperplane by which a maximum margin can be achieved to separate the dataset into a predefined number of classes [43]. In this case, a kernel function, with additional variables, is usually adopted to map the non-linear input vectors into a higher space (e.g., Euclidean) Φ(X).
Suppose there is a set of data (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x m , y m ) distributed in the multi-dimensional feature space X, where x i denotes a sample vector with y i ∈ {−1, +1} as the corresponding target. The hyperplane in the transformed space can be defined as follows: where ω denotes the weight vector of the hyperplane, and b represents the offset of the hyperplane. The SVM cost function is defined using the following equations: min ω, b,ε J(ω, b, ε) = 1 2 subject to: where ε i denotes the slack variables, and C refers to the penalty parameter used to control the trade-off between empirical risk and model complexity.

Overview of Convolutional Neural Networks (CNNs)
The CNN is a forward neural network that includes an input layer, multi-hidden layers and output layer, which are connected to each other with the output of the previous layer being the input of the next layer. High-level features contained in the raw data are extracted gradually through implementation of both a convolutional layer and a pooling/subsampling layer. To learn nonlinear representations of input data, a nonlinear activation function (e.g., sigmoid, rectified linear units) is adopted [31]. In general, the operations performed in a CNN can be summarised as: where O l−1 represents the input to the lth layer, w l and b l are the weights and biases of the layer, respectively, σ(·) indicates the non-linearity function and the symbol * denotes linear convolution; a pooling operation (pool p ) with a window size p is often performed following the convolution operation to extract invariant features of the input map, forming the output (O l ) of the current (lth) layer. The feature maps outputted by the last pooling layer are then flattened into a one-dimensional array and classified using a logistic regression (LR). A softmax activation function is employed in the LR to ensure the prediction probability of each output unit belonging to a certain class sums to one.

Hybrid Object-based SVM and CNN (OSVM-OCNN) Approach
We propose a novel hybrid object-based SVM and CNN (OSVM-OCNN) approach for crop classification from FSR remotely sensed imagery. In brief, the trained SVM and CNN models were used to predict the class of each segmented object, respectively, and a fusion strategy was applied subsequently to combine the two classifications to achieve the final classification map. Figure 1 shows the workflow of the presented OSVM-OCNN methodology, which comprises four steps, namely (1) image segmentation, (2) SVM and CNN model training, (3) SVM and CNN model inference and (4) decision fusion of SVM and CNN predictions, details of which will be elaborated in the following sections.
Remote Sens. 2019, 11, 2370 5 of 20 where O represents the input to the th layer, and are the weights and biases of the layer, respectively, (•) indicates the non-linearity function and the symbol * denotes linear convolution; a pooling operation (pool ) with a window size is often performed following the convolution operation to extract invariant features of the input map, forming the output (O ) of the current ( th) layer.
The feature maps outputted by the last pooling layer are then flattened into a one-dimensional array and classified using a logistic regression (LR). A softmax activation function is employed in the LR to ensure the prediction probability of each output unit belonging to a certain class sums to one.

Hybrid Object-based SVM and CNN (OSVM-OCNN) Approach
We propose a novel hybrid object-based SVM and CNN (OSVM-OCNN) approach for crop classification from FSR remotely sensed imagery. In brief, the trained SVM and CNN models were used to predict the class of each segmented object, respectively, and a fusion strategy was applied subsequently to combine the two classifications to achieve the final classification map. Figure 1 shows the workflow of the presented OSVM-OCNN methodology, which comprises four steps, namely (1) image segmentation, (2) SVM and CNN model training, (3) SVM and CNN model inference and (4) decision fusion of SVM and CNN predictions, details of which will be elaborated in the following sections.

Image Segmentation
Image segmentation is considered the fundamental step of the OSVM-OCNN as the prediction procedures of both SVM and CNN modules are based on segmented image objects ( Figure 1). In this research, the widely used multi-resolution segmentation (MRS) algorithm was adopted to partition the imagery into crop patches (i.e., objects) with spectrally and spatially homogeneous information [44]. For the fully polarimetric UAVSAR data, three raw linear polarizations (bands HH, HV, VV) together with polarimetric parameters from the Cloude-Pottier (entropy, anisotropy, and alpha angle) and Freeman-Durden (fractions of double-bounce, single-bounce, and volume scatters) decompositions [45,46] were combined as input data for image segmentation. As for the optical RapidEye imagery, all five multispectral (Blue, Green, Red, Red Edge and Near Infrared) bands were used as input for segmentation.

SVM and CNN Model Training
In this research, the radial basis function (RBF) SVM was selected owing to its capacity to address complicated non-linear classification problems [47]. The SVM model was trained using the spectral (or polarimetric) information within the segmented patches. Two types of feature were extracted from each object for classification, including the mean and standard deviation of feature bands. All

Image Segmentation
Image segmentation is considered the fundamental step of the OSVM-OCNN as the prediction procedures of both SVM and CNN modules are based on segmented image objects ( Figure 1). In this research, the widely used multi-resolution segmentation (MRS) algorithm was adopted to partition the imagery into crop patches (i.e., objects) with spectrally and spatially homogeneous information [44]. For the fully polarimetric UAVSAR data, three raw linear polarizations (bands HH, HV, VV) together with polarimetric parameters from the Cloude-Pottier (entropy, anisotropy, and alpha angle) and Freeman-Durden (fractions of double-bounce, single-bounce, and volume scatters) decompositions [45,46] were combined as input data for image segmentation. As for the optical RapidEye imagery, all five multispectral (Blue, Green, Red, Red Edge and Near Infrared) bands were used as input for segmentation.

SVM and CNN Model Training
In this research, the radial basis function (RBF) SVM was selected owing to its capacity to address complicated non-linear classification problems [47]. The SVM model was trained using the spectral (or polarimetric) information within the segmented patches. Two types of feature were extracted from each object for classification, including the mean and standard deviation of feature bands. All these object-based hand-crafted features were fed into the SVM model for classification. Different from the SVM model, the image patches used to train the CNN model were extracted using a pre-defined square input window rather than segmented patches. The input window size and a range of parameters of the CNN model were tuned empirically, as detailed in Section 3.
The trained SVM and CNN models were used for the following model interference.

SVM and CNN Model Inference
At the model inference stage, the trained SVM was used directly to predict the label of each segmented object based on the hand-crafted features mentioned above. The inference procedure of the CNN model consists of two steps: the convolutional position of an object was first located to acquire the input image patch of CNN; then, the label of the object was predicted with the trained CNN model with the located convolutional positions and input image patches. To acquire representative features of crop patches, the object convolutional position should be located at the centre of each object. In this research, the convolutional position of each object was determined by its geometric centroid [48]. Figure 2 provides two examples of object convolutional position location.
For a specific object, its crop class is inferred by the trained CNN model; at the same time, the SVM model also allocates a class label to the object. Thus, each object has two predictions coming from the SVM and CNN models.
Remote Sens. 2019, 11,2370 6 of 20 these object-based hand-crafted features were fed into the SVM model for classification. Different from the SVM model, the image patches used to train the CNN model were extracted using a predefined square input window rather than segmented patches. The input window size and a range of parameters of the CNN model were tuned empirically, as detailed in Section 3. The trained SVM and CNN models were used for the following model interference.

SVM and CNN Model Inference
At the model inference stage, the trained SVM was used directly to predict the label of each segmented object based on the hand-crafted features mentioned above. The inference procedure of the CNN model consists of two steps: the convolutional position of an object was first located to acquire the input image patch of CNN; then, the label of the object was predicted with the trained CNN model with the located convolutional positions and input image patches. To acquire representative features of crop patches, the object convolutional position should be located at the centre of each object. In this research, the convolutional position of each object was determined by its geometric centroid [48]. Figure 2 provides two examples of object convolutional position location.
For a specific object, its crop class is inferred by the trained CNN model; at the same time, the SVM model also allocates a class label to the object. Thus, each object has two predictions coming from the SVM and CNN models.

Decision Fusion of the SVM and CNN Models
For each object, the predictions of the SVM and CNN models are m-dimensional vectors P = (p 1 , p 2 , . . . , p m ), where m is the number of classes, and each dimension i ∈ [1, 2, . . . , m] denotes the predictive probability of the ith class. Ideally, the prediction probability should be 1 for the target class and 0 for the others. However, this is not likely to happen in consideration of the complexity of remotely sensed data. The probability for each class can be represented as where p x ∈ [0, 1] and m 1 p x = 1. The SVM and CNN models simply classify each object into the class with the maximum membership (class(C)) across all classes as follows: For a specific segmented object, the SVM model uses only the features that fall completely within the object (within-object information) for classification. As a result, objects with distinctive low-level features (e.g., light regions in Figure 3b) can be separated easily by the SVM, regardless of the size of objects. However, SVMs cannot identify accurately those objects with similar within-object features (e.g., dark regions in Figure 3b), due to the lack of contextual information in the classification process. In contrast, the CNN model can extract deep high-level features (between-object information) for classification and, thus, is superior to the SVM in identifying complex objects. Note that the CNN uses a pre-defined square input window to extract features and predict labels of objects. As a result, for a specific patch, there are two situations to consider: (1) if the size of the target object (e.g., small-sized) mismatches with the scale of input window (i.e., a large area of other crop types as contextual information in the input window), the prediction probability of the object tends to be low (e.g., dark patches in Figure 3c); (2) if the input window covers only a homogeneous region, the probability tends to be large (e.g., light patches in Figure 3c). For a specific segmented object, the SVM model uses only the features that fall completely within the object (within-object information) for classification. As a result, objects with distinctive low-level features (e.g., light regions in Figure 3b) can be separated easily by the SVM, regardless of the size of objects. However, SVMs cannot identify accurately those objects with similar within-object features (e.g., dark regions in Figure 3b), due to the lack of contextual information in the classification process. In contrast, the CNN model can extract deep high-level features (between-object information) for classification and, thus, is superior to the SVM in identifying complex objects. Note that the CNN uses a pre-defined square input window to extract features and predict labels of objects. As a result, for a specific patch, there are two situations to consider: (1) if the size of the target object (e.g., smallsized) mismatches with the scale of input window (i.e., a large area of other crop types as contextual information in the input window), the prediction probability of the object tends to be low (e.g., dark patches in Figure 3c); (2) if the input window covers only a homogeneous region, the probability tends to be large (e.g., light patches in Figure 3c). In light of the above-mentioned complementarities of the SVM and CNN, a rule-based fusion strategy can be presented to combine the two models for increased classification accuracy. The fusion output gives credit to the CNN if its prediction probability is greater than or equal to a predefined threshold ( ); otherwise, it trusts the output of the SVM. Assume an image is segmented into objects. For a given segmented object ( ), where = 1,2, … , , a decision fusion strategy can be formulated to determine the class label ( ( )) of the object as follows: where and denote the predictions of the CNN and SVM models, respectively, and represents the probability of the predicted class for the object achieved by the CNN model. Here, the threshold ( ) is estimated using a grid search approach [49], that is, the threshold with the greatest classification accuracy is regarded as the optimal .
To test the performance of the proposed OSVM-OCNN method, four benchmarks including the object-based SVM (OSVM), object-based CNN (OCNN), pixel-based SVM (PSVM) and pixel-based CNN (PCNN) were compared in this research.

Study Area and Data
In this research, two typical crop areas (Figure 4), S1 and S2, located in the middle of the Sacramento Valley, in northern California were selected as case study sites. California is considered as a productive agricultural state in the United States, and accounts for about 15% of national receipts In light of the above-mentioned complementarities of the SVM and CNN, a rule-based fusion strategy can be presented to combine the two models for increased classification accuracy. The fusion output gives credit to the CNN if its prediction probability is greater than or equal to a predefined threshold (α); otherwise, it trusts the output of the SVM. Assume an image is segmented into N objects. For a given segmented object (O i ), where i = 1, 2, . . . , N, a decision fusion strategy can be formulated to determine the class label (class(O i )) of the object as follows: where class cnn and class svm denote the predictions of the CNN and SVM models, respectively, and prob cnn i represents the probability of the predicted class for the object i achieved by the CNN model. Here, the threshold (α) is estimated using a grid search approach [49], that is, the threshold with the greatest classification accuracy is regarded as the optimal α.
To test the performance of the proposed OSVM-OCNN method, four benchmarks including the object-based SVM (OSVM), object-based CNN (OCNN), pixel-based SVM (PSVM) and pixel-based CNN (PCNN) were compared in this research.

Study Area and Data
In this research, two typical crop areas (Figure 4), S1 and S2, located in the middle of the Sacramento Valley, in northern California were selected as case study sites. California is considered as a productive agricultural state in the United States, and accounts for about 15% of national receipts for crops [50]. The two study sites are heterogeneous and different from each other in crop composition, thus, being ideal to test remote sensing image classification algorithms. Based on the Crop Data Layer (CDL) produced by the United States Department of Agriculture (USDA) [51], 10 dominant crop classes were found within S1 (Table 1), including walnut, almond, alfalfa, hay, clover, winter wheat, corn, sunflower, tomato and pepper, and nine major crop classes (Table 1) in S2, namely walnut, almond, fallow, alfalfa, winter wheat, corn, sunflower, tomato and cucumber. for crops [50]. The two study sites are heterogeneous and different from each other in crop composition, thus, being ideal to test remote sensing image classification algorithms. Based on the Crop Data Layer (CDL) produced by the United States Department of Agriculture (USDA) [51], 10 dominant crop classes were found within S1 (Table 1), including walnut, almond, alfalfa, hay, clover, winter wheat, corn, sunflower, tomato and pepper, and nine major crop classes (Table 1) in S2, namely walnut, almond, fallow, alfalfa, winter wheat, corn, sunflower, tomato and cucumber. In S1, the Uninhabited Aerial Vehicle Synthetic Aperture Radar (UAVSAR) image was captured on 29 August 2011 (the peak biomass stage). The UAVSAR, an airborne polarimetric interferometric radar system, is operated in L-band with a wavelength of 23.84 cm [52]. The range and azimuth pixel spacings in single look complex imagery are 1.66 m and 1 m, respectively. The UAVSAR used in S1 is in the GRD format (georeferenced), in which the calibrated complex data were multilooked and projected to the ground coordinate. The data has a fine spatial resolution of 5 m and a spatial extent of 3474 × 2250 pixels. No additional filter algorithms were applied to the image, since multiplicative noise was reduced by the multilook procedure [53]. Three raw linear polarizations (HH, HV and VV), as well as six parameters (stated in Section 2.3.1) from the Cloude-Pottier and Freeman-Durden decompositions, were extracted for crop classification.
In S2, a cloud-free RapidEye image (Level 3A Ortho product) was acquired on 10 July 2016. RapidEye is a constellation of five satellites that are equally spaced in the same orbital plane, producing a ground sampling distance (GSD) of 6.5 m at nadir [54]. The RapidEye imagery used in S2 is Ortho product, with sensor, radiometric and geometric correction using level 1 digital terrain elevation data, was delivered resampled to a spatial resolution of 5 m. The image employed in this research has a spatial extent of 3222 × 2230 pixels and five optical bands, namely blue (440-550 nm), green (520-590 nm), red (630-685 nm), red edge (690-730 nm) and near infrared (760-850). To obtain surface reflectance, the image was atmospherically corrected using the atmospheric and topographic correction method supported by the ERDAS IMAGINE software.
We acquired sample points from the USDA-CDL data by means of stratified random sampling. The CDL data are widely used as a ground reference owing to their very high quality [10,55]. Patches of major crop types in each site were outlined [10] and split randomly into two equal subsets. A 50% subset was for training samples generation, and the other 50% subset for testing samples collection, so as to make sure that training and testing samples come from different crop patches. To acquire enough representative samples, the sample size for each crop class was set at around 200 over the two study sites (Table 1). A total number of 2268 and 2020 samples were acquired for S1 and S2, respectively. Note that 80% of the training samples were used to train individual classification methods and the remaining 20% (validation set) were employed to select the optimal hyperparameters of the classifiers.   Walnut  31  112  112  224  Almond  33  110  110  220  Alfalfa  55  125  125  250  Hay  26  101  101  202  Clover  41  110  110  220  Winter wheat  68  120  120  240  Corn  45  108  108  216  Sunflower  47  122  122  244  Tomato  58  120  120  240  Pepper  32  106  106  212   S2   Walnut  39  108  108  216  Almond  45  115  115  230  Fallow  30  90  90  180  Alfalfa  35  124  124  248  Winter wheat  40  116  116  232  Corn  22  93  93  186  Sunflower  57  130  130  260  Tomato  63  141  141  282  Cucumber  21  93  93  186 In S1, the Uninhabited Aerial Vehicle Synthetic Aperture Radar (UAVSAR) image was captured on 29 August 2011 (the peak biomass stage). The UAVSAR, an airborne polarimetric interferometric radar system, is operated in L-band with a wavelength of 23.84 cm [52]. The range and azimuth pixel spacings in single look complex imagery are 1.66 m and 1 m, respectively. The UAVSAR used in S1 is in the GRD format (georeferenced), in which the calibrated complex data were multilooked and Remote Sens. 2019, 11, 2370 9 of 20 projected to the ground coordinate. The data has a fine spatial resolution of 5 m and a spatial extent of 3474 × 2250 pixels. No additional filter algorithms were applied to the image, since multiplicative noise was reduced by the multilook procedure [53]. Three raw linear polarizations (HH, HV and VV), as well as six parameters (stated in Section 2.3.1) from the Cloude-Pottier and Freeman-Durden decompositions, were extracted for crop classification.
In S2, a cloud-free RapidEye image (Level 3A Ortho product) was acquired on 10 July 2016. RapidEye is a constellation of five satellites that are equally spaced in the same orbital plane, producing a ground sampling distance (GSD) of 6.5 m at nadir [54]. The RapidEye imagery used in S2 is Ortho product, with sensor, radiometric and geometric correction using level 1 digital terrain elevation data, was delivered resampled to a spatial resolution of 5 m. The image employed in this research has a spatial extent of 3222 × 2230 pixels and five optical bands, namely blue (440-550 nm), green (520-590 nm), red (630-685 nm), red edge (690-730 nm) and near infrared (760-850). To obtain surface reflectance, the image was atmospherically corrected using the atmospheric and topographic correction method supported by the ERDAS IMAGINE software.
We acquired sample points from the USDA-CDL data by means of stratified random sampling. The CDL data are widely used as a ground reference owing to their very high quality [10,55]. Patches of major crop types in each site were outlined [10] and split randomly into two equal subsets. A 50% subset was for training samples generation, and the other 50% subset for testing samples collection, so as to make sure that training and testing samples come from different crop patches. To acquire enough representative samples, the sample size for each crop class was set at around 200 over the two study sites (Table 1). A total number of 2268 and 2020 samples were acquired for S1 and S2, respectively. Note that 80% of the training samples were used to train individual classification methods and the remaining 20% (validation set) were employed to select the optimal hyper-parameters of the classifiers.
To further test the generalisation of the proposed method, additional scenes of UAVSAR (03 October 2011) at S1 and RapidEye (07 September 2016) at S2 were acquired and preprocessed as described previously. Three linear polarizations (HH, HV and VV) of the UAVSAR and four spectral bands (i.e., blue, green, red, red edge) of the RapidEye were extracted, respectively, for crop classification.

Segmentation Parameter
We implemented the multi-resolution segmentation (MRS) algorithm in the eCognition Developer [56]. Three control parameters, namely, scale, colour/shape and smoothness/compactness, were tuned by means of a systematic trial-and-error process. A relatively small value of the scale parameter was set for a small amount of over-segmentation, thus, assuring the homogeneity of the segmented objects. The optimal combinations of image segmentation parameters over the two study sites are summarised in Table 2.

Model Structure and Parameter Settings
The object-based SVM (OSVM) model involves two major parameters that need to be pre-defined, the penalty parameter (C) and the kernel parameter (γ), each of which has been shown to influence model outputs [57]. The former determines the trade-off between model complexity and training error, while the latter controls the shape of the hyperplane. To search for the best parameters for the model, a "grid-search" on C and γ with exponentially growing sequences (i.e., 10-2, 10-1, . . . , 103) using five-fold cross-validation was performed [49]. The optimal combination of parameters over both study sites was found to be 1000 and 0.1, by which the OSVM delivered the best classification results.
For the object-based CNN (OCNN) model, a range of pre-defined parameters need to be tuned, including the input window size, the number of layers, as well as the number of convolutional filters. The input window size of the OCNN was determined through cross-validation from a series of window sizes {24 × 24, 32 × 32, 40 × 40, 48 × 48, 56 × 56, 64 × 64}, and 40 × 40 and 32 × 32 were found to be the optimal sizes for S1 and S2, respectively. To balance network complexity and generalization ability, the number of network layers was tuned to six ( Figure 5) and a 2 × 2 max pooling layer following each convolutional layer was used to further generalise the extracted features. The other parameters were designated as follows: the filter size was 3 × 3 for the convolutional layers (except for the first layer which was 5 × 5); the number of filters in each convolutional layer was 32; the learning rate and the number of epochs were respectively 0.01 and 500 to fully extract high-level features contained in the images. The cross-entropy loss was employed as the objective function. For training the entire network, the mini-batch stochastic gradient descent with a batch size of 20 samples was adopted to minimise the loss function. The CNN was built using Keras library with Tensorflow backend.
The object-based SVM (OSVM) model involves two major parameters that need to be predefined, the penalty parameter (C) and the kernel parameter (γ), each of which has been shown to influence model outputs [57]. The former determines the trade-off between model complexity and training error, while the latter controls the shape of the hyperplane. To search for the best parameters for the model, a "grid-search" on C and γ with exponentially growing sequences (i.e., 10-2, 10-1, …, 103) using five-fold cross-validation was performed [49]. The optimal combination of parameters over both study sites was found to be 1000 and 0.1, by which the OSVM delivered the best classification results.
For the object-based CNN (OCNN) model, a range of pre-defined parameters need to be tuned, including the input window size, the number of layers, as well as the number of convolutional filters. The input window size of the OCNN was determined through cross-validation from a series of window sizes {24 × 24, 32 × 32, 40 × 40, 48 × 48, 56 × 56, 64 × 64}, and 40 × 40 and 32 × 32 were found to be the optimal sizes for S1 and S2, respectively. To balance network complexity and generalization ability, the number of network layers was tuned to six ( Figure 5) and a 2 × 2 max pooling layer following each convolutional layer was used to further generalise the extracted features. The other parameters were designated as follows: the filter size was 3 × 3 for the convolutional layers (except for the first layer which was 5 × 5); the number of filters in each convolutional layer was 32; the learning rate and the number of epochs were respectively 0.01 and 500 to fully extract high-level features contained in the images. The cross-entropy loss was employed as the objective function. For training the entire network, the mini-batch stochastic gradient descent with a batch size of 20 samples was adopted to minimise the loss function. The CNN was built using Keras library with Tensorflow backend.

Pixel-wise Classifiers and Their Parameters
The RBF SVM model was used for traditional pixel-wise SVM classification. The two control parameters (C and γ) were optimised using a "grid-search" approach as mentioned above [49], and the optimal combination of parameters was found to be 100 and 1.
The traditional pixel-wise CNN also requires a pre-defined series of control parameters. The input window size was selected from {16 × 16, 24 × 24, 32 × 32, 40 × 40 and 48 × 48} and 24 × 24 was found to be the optimal patch size at both the S1 and S2 sites. The number of layers was tuned to six and the number of filters at each convolutional layer was set to 32. The size of convolutional filters was 5 × 5 for the first convolutional layer and 3 × 3 for the other layers, the same as for the OCNN. The learning rate and the maximum number of iterations were designated as 0.01 and 500, respectively.

Decision Fusion Parameters
A rule-based decision fusion approach was performed based on the OCNN's prediction probability and the classification results of both OSVM and OCNN models. As mentioned above, the parameter of the decision fusion rules was optimised by a grid search approach through crossvalidation. The optimal threshold (α) was found to be 0.98 at S1 and 0.91 at S2, respectively.

Pixel-wise Classifiers and Their Parameters
The RBF SVM model was used for traditional pixel-wise SVM classification. The two control parameters (C and γ) were optimised using a "grid-search" approach as mentioned above [49], and the optimal combination of parameters was found to be 100 and 1.
The traditional pixel-wise CNN also requires a pre-defined series of control parameters. The input window size was selected from {16 × 16, 24 × 24, 32 × 32, 40 × 40 and 48 × 48} and 24 × 24 was found to be the optimal patch size at both the S1 and S2 sites. The number of layers was tuned to six and the number of filters at each convolutional layer was set to 32. The size of convolutional filters was 5 × 5 for the first convolutional layer and 3 × 3 for the other layers, the same as for the OCNN. The learning rate and the maximum number of iterations were designated as 0.01 and 500, respectively.

Decision Fusion Parameters
A rule-based decision fusion approach was performed based on the OCNN's prediction probability and the classification results of both OSVM and OCNN models. As mentioned above, the parameter of the decision fusion rules was optimised by a grid search approach through cross-validation. The optimal threshold (α) was found to be 0.98 at S1 and 0.91 at S2, respectively.

Classification Maps and Visual Assessment
The classification maps achieved by the OSVM-OCNN were examined at both study sites. We compared the new OSVM-OCNN method with its two sub-models (OSVM and OCNN), as well as the PSVM and PCNN. To provide a clear visualization, Figures 6 and 7 illustrate visual inspections of the classification maps using subset images of the two study sites. It is clear that the PSVM achieved undesirable results (salt-and-pepper noise), as demonstrated in Figures 6 and 7. Moreover, tomato and pepper, as well as walnut and almond, were frequently misclassified as each other, as shown in Figure 6a,c. However, the PCNN has certain advantages over the PSVM in discriminating these crop classes with similar spectral characteristics. For example, as illustrated by Figures 6c and 7a, walnut and alfalfa were better distinguished from almond and tomato, respectively, in comparison to the PSVM classifications. Additionally, the salt-and-pepper noise was reduced to some extent due to the use of contextual information. The salt-and-pepper noise still existed in the CNN classifications (especially in the UAVSAR-based CNN classification), and the misclassifications between pepper and tomato and walnut and almond were still present, as illustrated in Figures 6 and 7.

Classification Maps and Visual Assessment
The classification maps achieved by the OSVM-OCNN were examined at both study sites. We compared the new OSVM-OCNN method with its two sub-models (OSVM and OCNN), as well as the PSVM and PCNN. To provide a clear visualization, Figures 6 and 7 illustrate visual inspections of the classification maps using subset images of the two study sites. It is clear that the PSVM achieved undesirable results (salt-and-pepper noise), as demonstrated in Figures 6 and 7. Moreover, tomato and pepper, as well as walnut and almond, were frequently misclassified as each other, as shown in Figure 6a,c. However, the PCNN has certain advantages over the PSVM in discriminating these crop classes with similar spectral characteristics. For example, as illustrated by Figure 6c and Figure 7a, walnut and alfalfa were better distinguished from almond and tomato, respectively, in comparison to the PSVM classifications. Additionally, the salt-and-pepper noise was reduced to some extent due to the use of contextual information. The salt-and-pepper noise still existed in the CNN classifications (especially in the UAVSAR-based CNN classification), and the misclassifications between pepper and tomato and walnut and almond were still present, as illustrated in Figures 6 and 7. In contrast to the pixel-wise SVM and CNN, the classification maps generated by the objectbased SVM and CNN exhibited very smooth visual appearance, and the salt-and-pepper noise was removed, as shown in Figures 6 and 7. The classification of fruit crops (walnut and almond), forage crops (alfalfa and hay) and summer crops (corn, tomato and pepper) was also improved to some extent as shown by the yellow circles in Figures 6 and 7. Specifically, parts of tomato were misclassified by the OCNN, whereas these areas were accurately classified by the OSVM (Figure 6a). In contrast, the OSVM was less accurate than the OCNN when identifying hay and tomato ( Figure   Figure 6. Three representative subsets (a-c) from the UAVSAR imagery with the corresponding classification maps; the first column shows the UAVSAR images (bands VV, HV and HH), the following columns illustrate the classification maps achieved by the PSVM, PCNN, OSVM, OCNN, and the proposed OSVM-OCNN, respectively; the regions with correct and incorrect classification results were labelled with yellow and red circles, respectively.
In contrast to the pixel-wise SVM and CNN, the classification maps generated by the object-based SVM and CNN exhibited very smooth visual appearance, and the salt-and-pepper noise was removed, as shown in Figures 6 and 7. The classification of fruit crops (walnut and almond), forage crops (alfalfa and hay) and summer crops (corn, tomato and pepper) was also improved to some extent as shown by the yellow circles in Figures 6 and 7. Specifically, parts of tomato were misclassified by the OCNN, whereas these areas were accurately classified by the OSVM (Figure 6a). In contrast, the OSVM was less accurate than the OCNN when identifying hay and tomato (Figure 6b,c). Similarly, the OSVM was more accurate than the OCNN in identifying wheat and tomato while the OCNN showed certain advantages over the OSVM in discriminating alfalfa, walnut and cucumber (Figure 7). Remote Sens. 2019, 11, 2370 12 of 20 6b,c). Similarly, the OSVM was more accurate than the OCNN in identifying wheat and tomato while the OCNN showed certain advantages over the OSVM in discriminating alfalfa, walnut and cucumber (Figure 7).  (Figure 7a). More importantly, some mutual misclassifications between the OSVM and OCNN were effectively resolved. For example, as illustrated in Figure 6b,c, some wheat and walnut patches were misclassified as hay and almond, respectively, in both the OSVM and OCNN classifications; however, they appeared at different places, and nearly all the mislabelled patches were rectified when combining the two classification results using the decision fusion strategy provided in this research.

Classification Accuracy Assessment
In addition to visual assessment, we further investigated the classification accuracy of the proposed OSVM-OCNN and the other benchmark methods, including the PSVM, PCNN, OSVM, and the OCNN over the two study sites. Tables 3 and 4 list the detailed classification accuracy of the methods in both S1 and S2 using the overall accuracy (OA), Kappa coefficient ( ) and per-class mapping accuracy. As shown in the tables, the OSVM-OCNN acquired the greatest OA of 90.74% at S1 and 86.63% at S2 with of 0.90 and 0.85, respectively, consistently greater than the OCNN (86.86% and 81.68% OA with of 0.85 and 0.79, respectively) and OSVM (86.42% and 81.39% with When checking the classification maps of the OSVM-OCNN, most of the aforementioned misclassifications achieved by OSVM and OCNN were revised while keeping the smoothness of the classifications. For example, the OSVM-OCNN modified the misclassifications of the OSVM for pepper, as shown in Figure 6a, and for sunflower and walnut, as shown by Figure 7, which benefitted from the accurate classification of the OCNN. Moreover, the OSVM-OCNN revised the classification errors of the OCNN for tomato (Figures 6a and 7b) and wheat (Figure 7a). More importantly, some mutual misclassifications between the OSVM and OCNN were effectively resolved. For example, as illustrated in Figure 6b,c, some wheat and walnut patches were misclassified as hay and almond, respectively, in both the OSVM and OCNN classifications; however, they appeared at different places, and nearly all the mislabelled patches were rectified when combining the two classification results using the decision fusion strategy provided in this research.

Classification Accuracy Assessment
In addition to visual assessment, we further investigated the classification accuracy of the proposed OSVM-OCNN and the other benchmark methods, including the PSVM, PCNN, OSVM, and the OCNN over the two study sites. Tables 3 and 4 list the detailed classification accuracy of the methods in both S1 and S2 using the overall accuracy (OA), Kappa coefficient (κ) and per-class mapping accuracy. As shown in the tables, the OSVM-OCNN acquired the greatest OA of 90.74% at S1 and 86.63% at S2 with κ of 0.90 and 0.85, respectively, consistently greater than the OCNN (86.86% and 81.68% OA with κ of 0.85 and 0.79, respectively) and OSVM (86.42% and 81.39% with corresponding κ of 0.85 and 0.79, respectively). The increase in classification accuracy was much more conspicuous when compared to the pixel-wise classifiers, such as the PCNN (81.31% and 79.11% OA with κ of 0.79 and 0.76, respectively) and PSVM (72.75% and 70.20% OA with corresponding κ of 0.70 and 0.66, respectively). In addition, a McNemar test developed for pair-wise comparison further demonstrated the proposed OSVM-OCNN achieved significantly increased classification accuracy in comparison with the PSVM and PCNN, as well as the OSVM and OCNN, with z-value = 12.56, 7.44, 4.35 and 4.92 in S1 and z-value = 10.76, 5.63, 6.57 and 4.32 in S2, respectively (Table 5). However, there was no significant difference between the OSVM and OCNN classifications over both study sites despite the OAs of the OCNN being slightly higher than those of the OSVM. Table 3. Overall accuracy as well as per-class accuracy achieved by the PSVM, PCNN, OSVM, OCNN and OSVM-OCNN method with the UAVSAR image in S1; the greatest classification accuracy per row is highlighted in bold font. The superiority of the OSVM-OCNN method was also checked with class-wise accuracy assessment (Tables 3 and 4). As shown in the tables, the OSVM-OCNN achieved the most accurate class-wise classification for most of the crop types in S1 and all types in S2. The largest increase was up to 8.70% for pepper in S1 and 10.27% for almond in S2, when compared with the OCNN. The accuracy increase was also significant for sunflower (7.66%) and tomato (6.28%) in S1 and fallow (8.45%) and winter wheat (8.26%) in S2. In comparison to the OSVM, most crop classes in S1 and all classes in S2 were classified with greater accuracy with the OSVM-OCNN. Specifically, walnut exhibited the greatest increase in accuracy over both study sites, up to 11.75% at S1 and 11.87% at S2, respectively. As for winter wheat, sunflower and tomato in S1, the accuracy of the OSVM-OCNN was slightly less than that of the OSVM without significant differences. The accuracy increase of the OSVM-OCNN tended to be more obvious in comparison to the PSVM and PCNN. Here, the OSVM-OCNN was constantly superior to the PCNN and PSVM at the class-wise level, with the largest increase up to 18.58% and 24.81% for winter wheat and hay in S1 and 16.11% and 26.11% for almond and walnut in S2, respectively. For the four benchmark methods themselves (i.e., the PSVM, PCNN, OSVM and the OCNN), the OCNN achieved the greatest accuracy, followed by the OSVM and PCNN, while the PSVM was the least accurate. In S1, the two object-based methods (OSVM and OCNN) were significantly more accurate than the two pixel-wise methods (PSVM and PCNN), as demonstrated by the McNemar test (Table 5). In S2, the accuracies of the OSVM and OCNN were significantly greater than that of the PSVM (z = 7.43 and 7.40, respectively), but only slightly (about 2%) greater than that of the PCNN with no significant difference (z = 1.61 and 1.80, respectively). Between the same type of classifiers, it was found that the PCNN performed significantly more accurately than the PSVM (z = 5.98 and 5.88, respectively), while there was no difference between the OSVM and OCNN (z = 0.35 and 0.21, respectively) at both study sites as shown in Table 5.

Crop
The proposed OSVM-OCNN method and the other benchmark comparators were also validated using additional scenes of UAVSAR and RapidEye imagery at S1 and S2 study sites. The classification accuracy assessment including the overall accuracy (OA) and Kappa coefficient (k) was summarised in Table 6. The OA and k of both study sites are in accordance with the previous experimental results, where the hybrid OSVM-OCNN achieves the greatest OA of 70.28% at S1 and 76.44% at S2, consistently larger than the two sub-modules (OSVM and OCNN), the PCNN, and the PSVM ( Table 6). Such coherency of classification accuracy further demonstrates the generalisability of the proposed method.

Influence of the Decision Fusion Parameter
In this subsection, the contribution of the decision fusion parameter (α) (i.e., the prediction probability of the OCNN model) in combining classification results of the two sub-modules (OSVM and OCNN) is investigated (Figure 8). Herein, Figure 8a shows the relations between parameter α and the final classification accuracy (through fusion decision) in S1 (dots in orange) and S2 (dots in blue), respectively; whereas Figure 8b illustrates the area percentage of the OCNN predictions influenced by α in the fused classification map over the two study sites. From Figure 8a, it can be seen that, although there was a difference in accuracy between the two sites resulting from different types of remotely sensed images, the general tendencies in overall accuracy influenced by α over S1 and S2 were similar: the accuracy increased continuously until reaching the maximum accuracy (α = 0.98 in S1 and α = 0.91 in S2), and then tended to decrease with further increases in α. Here, α = 0.98 and α = 0.91 were found to be the optimal decision fusion parameters in S1 and S2, respectively. From Figure 8b, it is clear that when α was small, OCNN predictions dominated the fused outputs with little contribution from the OSVM; in contrary, too large a value for α resulted in a rapid decrease in the area percentage of CNN predictions, leading to a sharp decrease in overall accuracy (Figure 8a). When α approached initially the optimal value, the CNN predictions with low confidence were gradually replaced by accurate SVM predictions, resulting in a rapid increase in accuracy (Figure 8a). The selection of the optimal α value, thus, clearly demonstrates the complementary properties between the two sub-modules by the proposed decision fusion strategy.
In this subsection, the contribution of the decision fusion parameter (α) (i.e., the prediction probability of the OCNN model) in combining classification results of the two sub-modules (OSVM and OCNN) is investigated (Figure 8). Herein, Figure 8a shows the relations between parameter α and the final classification accuracy (through fusion decision) in S1 (dots in orange) and S2 (dots in blue), respectively; whereas Figure 8b illustrates the area percentage of the OCNN predictions influenced by α in the fused classification map over the two study sites. From Figure 8a, it can be seen that, although there was a difference in accuracy between the two sites resulting from different types of remotely sensed images, the general tendencies in overall accuracy influenced by α over S1 and S2 were similar: the accuracy increased continuously until reaching the maximum accuracy (α = 0.98 in S1 and α = 0.91 in S2), and then tended to decrease with further increases in α. Here, α = 0.98 and α = 0.91 were found to be the optimal decision fusion parameters in S1 and S2, respectively. From Figure 8b, it is clear that when α was small, OCNN predictions dominated the fused outputs with little contribution from the OSVM; in contrary, too large a value for α resulted in a rapid decrease in the area percentage of CNN predictions, leading to a sharp decrease in overall accuracy (Figure 8a). When α approached initially the optimal value, the CNN predictions with low confidence were gradually replaced by accurate SVM predictions, resulting in a rapid increase in accuracy (Figure 8a). The selection of the optimal α value, thus, clearly demonstrates the complementary properties between the two sub-modules by the proposed decision fusion strategy.

Discussion
Accurate classification of FSR remotely sensed images is considered a major challenge within the remote sensing community [57]. Combination of different classifiers is an effective means to solve the complex FSR image classification problem, where single classifiers should be as unique as possible, so as to produce different decision boundaries [38]. However, traditional classifier fusion methods by integrating classifiers at the pixel level are unsuitable for processing FSR imagery, given the potential for large amounts of noise (see the salt-and-pepper noise in the PSVM classifications, Figures 6 and 7).
In this research, a novel method (OSVM-OCNN) was proposed for the first time by fusing the outputs of the object-based SVM (OSVM) and CNN (OCNN) at the object level for crop classification from FSR images. The OSVM determines the decision boundaries among classes based completely on the low-level within-object information (e.g., spectral, polarimetric, texture; [24]). In such a manner, the OSVM can identify the objects with salient spectral properties (i.e., light regions on the Figure 3b), but has difficulty handling those objects with similar within-object information (e.g., the

Discussion
Accurate classification of FSR remotely sensed images is considered a major challenge within the remote sensing community [57]. Combination of different classifiers is an effective means to solve the complex FSR image classification problem, where single classifiers should be as unique as possible, so as to produce different decision boundaries [38]. However, traditional classifier fusion methods by integrating classifiers at the pixel level are unsuitable for processing FSR imagery, given the potential for large amounts of noise (see the salt-and-pepper noise in the PSVM classifications, Figures 6 and 7).
In this research, a novel method (OSVM-OCNN) was proposed for the first time by fusing the outputs of the object-based SVM (OSVM) and CNN (OCNN) at the object level for crop classification from FSR images. The OSVM determines the decision boundaries among classes based completely on the low-level within-object information (e.g., spectral, polarimetric, texture; [24]). In such a manner, the OSVM can identify the objects with salient spectral properties (i.e., light regions on the Figure 3b), but has difficulty handling those objects with similar within-object information (e.g., the misclassifications between two types of forage crops (alfalfa and hay), Figure 6b). This is due mainly to the unavailability of high-level between-object information. In fact, for a large crop parcel, it is normally segmented into several small objects due to the heavy spectral and spatial variations. If only the within-object information is utilised, some of the segmented objects might be misclassified. However, if the between-object (contextual) information is also taken into account, sufficiently representative information can be achieved for the objects, thus markedly increasing the chance of correctly identifying those objects. The OCNN extracts hierarchical features from images via an input window using multiple convolution and pooling operations [24]; thus, both low-level and high-level features are incorporated into the classification process. However, with a fixed input window, the OCNN is incapable of accurately extracting key within-object information of particular objects (e.g., small-sized and linearly shaped objects) due to the mismatch between the observational scale of the OCNN and the scale of the objects themselves. For example, as shown in Figure 3c, the OCNN's prediction probability of some small-sized objects (usually with distinctive within-object information) tends to be relatively low. In fact, as a state-of-the-art deep learning classifier, OCNN is especially distinguished in representing spatial contextual (i.e., between-object) information, whereas OSVM is superior in extracting within-object information. As a consequence, the shallow-structured OSVM and the deep-structured OCNN have intrinsically complementary characteristics in terms of remotely sensed image classification, as illustrated by Figure 8a. It should be noted that the incorporation of both within-and between-object information is normally necessary to identify and classify complex landscapes. This explains why the proposed hybrid OSVM-OCNN method consistently and significantly outperformed its sub-modules (the OSVM and OCNN) as well as traditional pixel-wise classifiers (the PSVM and PCNN) over both study sites (Table 5, Figures 6 and 7).
Searching for the optimal parameter combination of decision fusion rules is a tedious and time-consuming process [41]. In the proposed OSVM-OCNN, a novel decision fusion strategy was developed to integrate the two sub-models, primarily based on the prediction probability of the OCNN in consideration of its superiority in image classification. That is, the OCNN is regarded as the base classifier, and it is given credit as long the key information of the target object is acquired (i.e., high prediction probability); otherwise, the prediction of the OSVM is trusted. The combination of the two classifiers (OSVM and OCNN), therefore, represents a new rule-based decision fusion strategy that incorporates this key principle. Such a fusion strategy exactly capturing the complementarity between the two sub-modules, even with different types of data (optical and SAR images), is straightforward and efficient in comparison to previous methods (in which two or more parameters are usually employed, e.g., [38,58]), since only one parameter (α) is required. Moreover, there are some other parameters that need to be finely tuned, including those used in the sub-modules and image segmentation. The control parameters of the SVM and CNN can be tuned relatively easily according to previous research. In contrast, the parameters of segmentation algorithms are usually hard to determine. In the MRS image segmentation algorithm, the scale parameter is considered the most important, as it directly controls the relative size of the segmented objects. In practice, it is almost impossible to select an optimal scale value that can accurately segment all of the ground patches with the boundaries being retained completely. In practice, a relatively small value is always a preferred alternative (e.g., [24]). By taking the UAVSAR experiment as an example, the impact of segmentation on the overall accuracy of the proposed method was illustrated ( Figure 9). It can be seen from the figure that the OSVM-OCNN consistently outperformed the two sub-modules, regardless of how the scale parameter is tuned. The scale parameter selected in this research (i.e., scale = 25) that achieves a small amount of over-segmentation is suitable for crop classification. If the value is too small (e.g., scale = 20 in Figure 9), one crop patch may be partitioned into many very small objects; and if it is too large (e.g., scale = 30 in Figure 9), one segmented object may contain many crop patches. Obviously, both cases exert negative impact on the classification results ( Figure 9). Therefore, the segmentation parameters employed in this research by trial and error are relatively optimal. Algorithms that automatically determine segmentation parameters (e.g., [59]) could be integrated into the proposed method in future research.
The proposed hybrid OSVM-OCNN approach achieved promising crop classification results for FSR images. In fact, the proposed method that makes full use of both within-object and between-object feature representations has wide potential applicability for a range of complex classification tasks (e.g., Mangroves, [60]; land use, [38]). The proposed classification method, therefore, provides a general solution to address the complex FSR image-based classification problem. It should be mentioned that the effectiveness of the OCNN, a sub-model of the OSVM-OCNN, is constrained by a so-called optimal (fixed-sized) input window as stated previously. A variable sized input window that adjusts dynamically according to the size of objects, thus, deserves to be introduced to the OCNN. This will be investigated in detail in future research. The proposed hybrid OSVM-OCNN approach achieved promising crop classification results for FSR images. In fact, the proposed method that makes full use of both within-object and betweenobject feature representations has wide potential applicability for a range of complex classification tasks (e.g., Mangroves, [60]; land use, [38]). The proposed classification method, therefore, provides a general solution to address the complex FSR image-based classification problem. It should be mentioned that the effectiveness of the OCNN, a sub-model of the OSVM-OCNN, is constrained by a so-called optimal (fixed-sized) input window as stated previously. A variable sized input window that adjusts dynamically according to the size of objects, thus, deserves to be introduced to the OCNN. This will be investigated in detail in future research.

Conclusions
In this research, a novel hybrid method (OSVM-OCNN) was proposed by fusing a shallowstructured object-based SVM (OSVM) and a deep-structured object-based CNN (OCNN) at the object level for crop classification from FSR imagery. The OSVM has advantages in extracting low-level within-object features, while the OCNN is remarkable in terms of generalising high-level betweenobject information. The proposed OSVM-OCNN method, thus, captures the complementary characteristics of both the OSVM and OCNN models through a set of rules with only one fusion parameter required. Thus, the two sub-models were combined in a concise and effective manner. We investigated the effectiveness of the proposed method over two study sites with distinctive crop compositions using two types of FSR images (UAVSAR and RapidEye), respectively. The OSVM-OCNN consistently achieved the most accurate classification results in comparison to the two submodels (i.e., OSVM and OCNN), as well as the standard pixel-wise SVM (PSVM) and CNN (PCNN). Thus, we conclude that the presented OSVM-OCNN method is an effective and efficient approach for accurate crop classification (and classification of other complex landscapes) using FSR remotely sensed images, and it is suitable for different types of FSR remotely sensed images.

Conclusions
In this research, a novel hybrid method (OSVM-OCNN) was proposed by fusing a shallowstructured object-based SVM (OSVM) and a deep-structured object-based CNN (OCNN) at the object level for crop classification from FSR imagery. The OSVM has advantages in extracting low-level within-object features, while the OCNN is remarkable in terms of generalising high-level between-object information. The proposed OSVM-OCNN method, thus, captures the complementary characteristics of both the OSVM and OCNN models through a set of rules with only one fusion parameter required. Thus, the two sub-models were combined in a concise and effective manner. We investigated the effectiveness of the proposed method over two study sites with distinctive crop compositions using two types of FSR images (UAVSAR and RapidEye), respectively. The OSVM-OCNN consistently achieved the most accurate classification results in comparison to the two sub-models (i.e., OSVM and OCNN), as well as the standard pixel-wise SVM (PSVM) and CNN (PCNN). Thus, we conclude that the presented OSVM-OCNN method is an effective and efficient approach for accurate crop classification (and classification of other complex landscapes) using FSR remotely sensed images, and it is suitable for different types of FSR remotely sensed images.