Precise Crop Classification of Hyperspectral Images Using Multi-Branch Feature Fusion and Dilation-Based MLP

The precise classification of crop types using hyperspectral remote sensing imaging is an essential application in the field of agriculture, and is of significance for crop yield estimation and growth monitoring. Among the deep learning methods, Convolutional Neural Networks (CNNs) are the premier model for hyperspectral image (HSI) classification for their outstanding locally contextual modeling capability, which facilitates spatial and spectral feature extraction. Nevertheless, the existing CNNs have a fixed shape and are limited to observing restricted receptive fields, constituting a simulation difficulty for modeling long-range dependencies. To tackle this challenge, this paper proposed two novel classification frameworks which are both built from multilayer perceptrons (MLPs). Firstly, we put forward a dilation-based MLP (DMLP) model, in which the dilated convolutional layer replaced the ordinary convolution of MLP, enlarging the receptive field without losing resolution and keeping the relative spatial position of pixels unchanged. Secondly, the paper proposes multi-branch residual blocks and DMLP concerning performance feature fusion after principal component analysis (PCA), called DMLPFFN, which makes full use of the multi-level feature information of the HSI. The proposed approaches are carried out on two widely used hyperspectral datasets: Salinas and KSC; and two practical crop hyperspectral datasets: WHU-Hi-LongKou and WHU-Hi-HanChuan. Experimental results show that the proposed methods outshine several state-of-the-art methods, outperforming CNN by 6.81%, 12.45%, 4.38% and 8.84%, and outperforming ResNet by 4.48%, 7.74%, 3.53% and 6.39% on the Salinas, KSC, WHU-Hi-LongKou and WHU-Hi-HanChuan datasets, respectively. As a result of this study, it was confirmed that the proposed methods offer remarkable performances for hyperspectral precise crop classification.


Introduction
Hyperspectral imaging instruments can capture rich spectral signatures and intricate spatial information of observed scenes [1]. Plentiful spectral signatures and spatial information of hyperspectral images (HSIs) offer great potentials for fine crop classification [2,3] and detection [4,5]. Therefore, hyperspectral remote sensing can obtain the spectral characteristics and their differences more comprehensively and meticulously than panchromatic remote sensing [6]. Therefore, this paper uses hyperspectral techniques to finely classify crops and to promote the development of specific applications of hyperspectral techniques in agricultural remote sensing, such as monitoring the development of agriculture and optimizing the management of the agricultural industry.
Many methods have been applied to hyperspectral image classification in recent years. Early-stage classification methods are support vector machine (SVM) [7], random forest (RF) [8], multiple logistic regression [9] and decision tree [10], which can provide promising 1.
MLP, as a less constrained network, can eliminate the negative effects of translation invariance and local connectivity. Therefore, this paper modified MLP combined with dilated convolution to fully obtain spectral-spatial features of each sample and improve HSI remote sensing scene classification performance, called DMLP. The dilated convolutional layer replaced the ordinary convolution of MLP, which can enlarge the receptive field without losing resolution and keep the relative spatial position of pixels unchanged.

2.
This paper composes multi-branch residual blocks and DMLP to form a multi-level feature fusion network, called DMLPFFN. Firstly, the residual structure can retain the original characteristics of the HSI data, and avoid the problems of gradient explosion and gradient disappearance in the training process. In addition, DMLP can improve the feature extraction capability of the residual blocks and strengthen the model with essential features while retaining the original features of the hyperspectral data. In DMLPFFN, three branches of features are fused to obtain a feature map with more comprehensive information, which integrates the spectral information, spatial context information, spatial feature information and spatial location information of HSI to improve classification accuracy. 3.
Comprehensive experiments are designed and executed to prove the effectiveness of DMLPFFN by different hyperspectral datasets. DMLPFFN achieved better classification performance and generalization ability for fine crop classification.
The rest of this article is organized as follows. Section 2 describes our proposed classification approach in detail. Section 3 reports the experimental results and evaluates the performance of the proposed method. The application of the model to fine crop classification is given in Section 4. Section 5 analyzes how to choose experimental parameters in DMLPFFN and Section 6 gives the conclusion. Figure 1 shows the overall framework of our proposed DMLPFFN for HSI classification, which takes the WHU-Hi-LongKou dataset as an example. First, principal component analysis (PCA) is applied to the original HSI to reduce its spectral dimension to weaken the Hughes phenomenon and decrease the burden of model training. At first, the DMLP is structured by altering the normal convolution with the dilated convolution in the local perceptron module of the MLP, thus facilitating the aggregation of contextual information without conceiving a loss of feature map resolution so as to upgrade the classification performance of hyperspectral features.

The Proposed MLP-Based Methods for HSI Classification
In addition, DMLPFFN combines residual blocks of different sizes and DMLP to obtain three feature extraction branches, which can fuse three different levels of features and achieve feature maps with more comprehensive information. In the DMLPFFN, multiscale features of the HSI are extracted by a hierarchical different-scale feature extraction branches at different stages of the network. The low-level feature extraction branch of DMLPFFN extracts texture feature information such as color and the edge of ground objects, the middle-level branch extracts regional information and high level is used to extract semantic information with DMLP. The feature fusion is then performed by element summation of the results of the three branches, which can achieve feature maps with more comprehensive information. Then, the global average pooling transforms the feature maps into feature vectors and subsequently obtains the classification results by the softmax function. In addition, DMLPFFN combines residual blocks of different sizes and DMLP to obtain three feature extraction branches, which can fuse three different levels of features and achieve feature maps with more comprehensive information. In the DMLPFFN, multiscale features of the HSI are extracted by a hierarchical different-scale feature extraction branches at different stages of the network. The low-level feature extraction branch of DMLPFFN extracts texture feature information such as color and the edge of ground objects, the middle-level branch extracts regional information and high level is used to extract semantic information with DMLP. The feature fusion is then performed by element summation of the results of the three branches, which can achieve feature maps with more comprehensive information. Then, the global average pooling transforms the feature maps into feature vectors and subsequently obtains the classification results by the softmax function. Figure 2 shows the overall architecture of the proposed DMLP for HSI classification. The network consists of the global perceptron module, the partition perceptron module and the local perceptron module. Since the MLP has a more powerful representation than  Figure 2 shows the overall architecture of the proposed DMLP for HSI classification. The network consists of the global perceptron module, the partition perceptron module and the local perceptron module. Since the MLP has a more powerful representation than convolution, we propose DMLP to accurately represent the feature location information, and retain spatial resolution without loss of detail information.

The Proposed Dilation-Based MLP (DMLP) for HSI Classification
where ℎ and indicate the height and width of the hyperspectral feature image after average pooling; the second branch uses ℎ and to obtain a pixel for each hyperspectral feature image, and then feeds them though batch normalization (BN) and a two-layer MLP. The hyperspectral feature map (ℎ, , ) is sent to a BN layer and two fully connected layers. The Rectified Linear Unit (ReLU) function is introduced between the two

The Global Perceptron Module Block
It is assumed that the HSI dataset is size H × W × nBand, where H and W represent spatial height and width, and nBand is the number of bands. First, each pixel of the hyperspectral image is processed with a fixed window size y × x, and a single sample with a shape of y × x × nBand is generated. The global perceptron uses shared parameters for different partitions, diminishing the parameters taken for computation and increasing the connection and correlation between the partitions. The global perceptron module block consists of two branches. The first branch splits up the input hyperspectral feature image. The hyperspectral feature map changes from (H 1 , W 1 , C 1 ) to (h 1 , w 1 , O). H 1 , W 1 , C 1 indicate the height, width and number of input channels of the input hyperspectral feature map. h 1 , w 1 , O, respectively, represent the height, width and number of output channels of the split hyperspectral feature image.
In the second branch, the original feature map (H 1 , W 1 , C 1 ) is average pooled, and the size of the hyperspectral feature map becomes (h, w, O) as follows: where h and w indicate the height and width of the hyperspectral feature image after average pooling; the second branch uses h and w to obtain a pixel for each hyperspectral feature image, and then feeds them though batch normalization (BN) and a two-layer MLP. The hyperspectral feature map (h, w, O) is sent to a BN layer and two fully connected layers. The Rectified Linear Unit (ReLU) function is introduced between the two fully connected layers to effectively avoid gradient explosion and gradient disappearance. For the fully connected layer, X (in) and X (out) represent input and output; the kernel W ∈ R Q×P is the matrix multiplication (MMUL) defined as follows: The hyperspectral vector was transformed into (1, 1, C 1 ) by the BN layer and two fully connected layers. Then, the hyperspectral feature images were obtained after all branches were added. Next, we directly fed the input hyperspectral feature into partition perceptron and local perceptron without splitting.

The Partition Perceptron Module Block
The partition perceptron module block contains a BN layer and a group convolution. The input of the partition perceptron is (h, w, O). After the BN layer and group convolution processing, (h, w, O) becomes the original hyperspectral feature input (H 1 , W 1 , C 1 ). Y (out) ∈ R C 1 ×H 1 ×W 1 indicates the output hyperspectral feature and can be obtained as follows: where p is the number of pixels filled. F ∈ R C 1 /g×K×K is the convolution kernel and g indicates the number of convolution groups.

The Local Perceptron Module Block
To enhance the extraction of high-level semantic information from hyperspectral feature maps without multiplying the calculation parameters, the local perceptron module introduces a dilated convolutional layer [25] and BN layer. First, the local perceptron module simultaneously sends the segmented hyperspectral feature image (h, w, O) to the dilated convolution layer. Then, the feature graph is fed into the BN layer. Finally, the output of all convolution branches and the partition perceptron is summarized as the final result.
Specifically, the dilated convolutional layer uses the odd-even mixed dilation rates to stack in each chain, resulting in the expanded receptive field. In addition, under the premise of the same receptive field, the dilated convolution with increased dilation rate consumes fewer training parameters than the extended receptive field with a large convolution kernel. The calculation of the size of the dilated convolution kernel and the receptive field is shown in Formulas (4) and (5), respectively: f k represents the size of the original convolution kernel; f n represents the size of the dilated convolution kernel; D r represents expansion rate; l m−1 represents the receptive field size of the (m − 1) layer; l m is the size of the m layer receptive field after the convolution; S i represents the step size of layer i. The equivalently fully connected layers (FC) kernel of a Dilated Conv kernel is the result of convolution on an identity matrix with proper reshaping. Formula (6) shows exactly how to build W (F,p) from F and p.
W (F,p) = DilatedCONV(Y, F, p), (Chw, Ohw) T The convolution after multiple superimpositions of the expansion may lead to a grid effect, as shown in Figure 3. This will cause some dilution between pixels, which causes some pixels to be omitted, resulting in the loss of local information and undermining the continuity of information. Considering the grid effect, the design of the expansion rate in the DMLP mod posed in this paper follows Equation (7).
where is the expansion rate of layer and is the maximum expansion rate o . Mixed dilated convolution requires that the expansion rate of superposition convo cannot have a common divisor greater than 1. As shown in Figure 4, in this pap method of mixed parity expansion rate is used to expand the convolution kernel, a expansion rate is set to the cyclic structure [1,2,5], which can cover every pixel on t age to avoid information loss.

The Proposed DMLPFFN Model for HSI Classification
Features at various levels contain diverse information distribution. The lowe features contain rich spatial structure information, but their high resolution leads to global background information. The higher-level features have rich semantic inform and can effectively classify hyperspectral images, but their poor resolution lacks s details for the hyperspectral images [26]. For this reason, the fusion of these differe els of feature information can significantly strengthen the classification accuracy of h spectral images. The paper proposes DMLPFFN, which can extract sufficiently dif level features by fusing three feature extraction branches, as shown in Figure 1.

Fusion of Multi-Branch Features
As the layer of the network deepens, the feature information obtained duri feature extraction of the convolutional network will be different for each branch. Fi shows the structure of residual blocks with DMLP, called the adjacent edge low-lev ture extraction branch (the left branch in Figure 1), which is used to to obtain texture acteristic information such as the color and border of the ground target. Considering the grid effect, the design of the expansion rate in the DMLP model proposed in this paper follows Equation (7).
where r i is the expansion rate of layer i and M i is the maximum expansion rate of layer i. Mixed dilated convolution requires that the expansion rate of superposition convolution cannot have a common divisor greater than 1. As shown in Figure 4, in this paper, the method of mixed parity expansion rate is used to expand the convolution kernel, and the expansion rate is set to the cyclic structure [1,2,5], which can cover every pixel on the image to avoid information loss.  Considering the grid effect, the design of the expansion rate in the DMLP mod posed in this paper follows Equation (7).
where is the expansion rate of layer and is the maximum expansion rate o . Mixed dilated convolution requires that the expansion rate of superposition convo cannot have a common divisor greater than 1. As shown in Figure 4, in this pap method of mixed parity expansion rate is used to expand the convolution kernel, a expansion rate is set to the cyclic structure [1,2,5], which can cover every pixel on age to avoid information loss.

The Proposed DMLPFFN Model for HSI Classification
Features at various levels contain diverse information distribution. The lowe features contain rich spatial structure information, but their high resolution leads t global background information. The higher-level features have rich semantic infor and can effectively classify hyperspectral images, but their poor resolution lacks details for the hyperspectral images [26]. For this reason, the fusion of these differe els of feature information can significantly strengthen the classification accuracy of spectral images. The paper proposes DMLPFFN, which can extract sufficiently di level features by fusing three feature extraction branches, as shown in Figure 1.

Fusion of Multi-Branch Features
As the layer of the network deepens, the feature information obtained duri feature extraction of the convolutional network will be different for each branch. F shows the structure of residual blocks with DMLP, called the adjacent edge low-lev ture extraction branch (the left branch in Figure 1), which is used to to obtain textur acteristic information such as the color and border of the ground target.

The Proposed DMLPFFN Model for HSI Classification
Features at various levels contain diverse information distribution. The lower-level features contain rich spatial structure information, but their high resolution leads to weak global background information. The higher-level features have rich semantic information and can effectively classify hyperspectral images, but their poor resolution lacks spatial details for the hyperspectral images [26]. For this reason, the fusion of these different levels of feature information can significantly strengthen the classification accuracy of hyperspectral images. The paper proposes DMLPFFN, which can extract sufficiently different level features by fusing three feature extraction branches, as shown in Figure 1.

Fusion of Multi-Branch Features
As the layer of the network deepens, the feature information obtained during the feature extraction of the convolutional network will be different for each branch. Figure 5 shows the structure of residual blocks with DMLP, called the adjacent edge low-level feature extraction branch (the left branch in Figure 1), which is used to to obtain texture characteristic information such as the color and border of the ground target.  The residual block is introduced to connect each layer to other layers in a feed-forward fashion. According to the structure of the residual unit, represents input, ( ) represents output and ( ) represents a residual unit. The residual unit carries out identity mapping of input at each layer from top to bottom, and the features of input are learned to form the residual function [27]. Then, the output of the residual unit becomes ( ) = ( ) + . Therefore, the residual function can deal with more advanced abstract features when the number of network layers increases, and is easier to optimize. The calculation process of the residual element is shown in formula (8): where stands for nonlinear function ReLU and 1 and 2 are the weights of layer 1 and layer 2, respectively. Then, the residual unit goes through a shortcut and a second ReLU layer to obtain the output ( ): When the dimension size of the input and the output needs to be changed, a linear transformation can be performed in a shortcut operation, as shown in Formula (10): By stacking multiple residual blocks, the extracted features become increasingly discriminative. Then, we connect the output of the residual block to the input of the DMLP ( ) ( ) and ( ) represent input and output and the kernel ∈ × is the matrix multiplication (MMUL) defined as follows: This structure extracts more abstract features and discards redundant information through the DMLP module. The introduction of DMLP brings fewer parameters and higher operational efficiency and speed compared to simply increasing the depth of the residual network. In addition, it improves the global feature learning capability and the nonlinearity for the model, resulting in a better abstract representation of the model.
The middle-level branch focuses on extracting regional information with a similar structure of a low-level feature extraction branch. Middle-level features focus more on regional features than lower-level features, which is of great significance to the extraction of spatial structure features of HSIs. The high-level branch uses DMLP to extract global features, which keeps the relative spatial position of pixels unchanged and obtains the context information of the HSIs.
In detail, assume that 1 , 2 and 3 refer to the outputs of the low-, middle-and The residual block is introduced to connect each layer to other layers in a feed-forward fashion. According to the structure of the residual unit, X represents input, H(x) represents output and F(x) represents a residual unit. The residual unit carries out identity mapping of input at each layer from top to bottom, and the features of input are learned to form the residual function [27]. Then, the output of the residual unit becomes H(x) = F(x) + x. Therefore, the residual function can deal with more advanced abstract features when the number of network layers increases, and is easier to optimize. The calculation process of the residual element is shown in Formula (8): where σ stands for nonlinear function ReLU and W 1 and W 2 are the weights of layer 1 and layer 2, respectively. Then, the residual unit goes through a shortcut and a second ReLU layer to obtain the output H(x): When the dimension size of the input and the output needs to be changed, a linear transformation can be performed in a shortcut operation, as shown in Formula (10): By stacking multiple residual blocks, the extracted features become increasingly discriminative. Then, we connect the output of the residual block to the input of the DMLP. H(x) (in) and X (out) represent input and output and the kernel W ∈ R Q×P is the matrix multiplication (MMUL) defined as follows: This structure extracts more abstract features and discards redundant information through the DMLP module. The introduction of DMLP brings fewer parameters and higher operational efficiency and speed compared to simply increasing the depth of the residual network. In addition, it improves the global feature learning capability and the nonlinearity for the model, resulting in a better abstract representation of the model.
The middle-level branch focuses on extracting regional information with a similar structure of a low-level feature extraction branch. Middle-level features focus more on regional features than lower-level features, which is of great significance to the extraction of spatial structure features of HSIs. The high-level branch uses DMLP to extract global features, which keeps the relative spatial position of pixels unchanged and obtains the context information of the HSIs.
In detail, assume that O 1 , O 2 and O 3 refer to the outputs of the low-, middle-and high-level feature extraction branch, which has 16, 32 and 64 feature maps, respectively. Then, the resultant maps of the three branches are convolved with 64 kernels of size 1 × 1 in this paper. By means of such convolution operations, the number of feature maps of O 1 , O 2 and O 3 all become 64. Eventually, feature fusion can be conveniently performed by element summation as follows: where T represents the fused features, f 1 , f 2 and f 3 are the dimension matching function and Pooling is the global averaging function. The proposed DMLPFFN model enhances the resemblance between the same hyperspectral feature objects and the variability between the exotic objects to accomplish high-precision classification of crop species.

Feature Output Visualization and Analysis
In order to better analyze the characteristics of feature extraction of DMLPFFN, this paper visualizes the feature maps of different branches, as shown in Figure 6.
Remote Sens. 2022, 14, x FOR PEER REVIEW Then, the resultant maps of the three branches are convolved with 64 kernels of si in this paper. By means of such convolution operations, the number of feature m where represents the fused features, 1 , 2 and 3 are the dimension matchin tion and Pooling is the global averaging function.
The proposed DMLPFFN model enhances the resemblance between the same spectral feature objects and the variability between the exotic objects to accomplis precision classification of crop species.

Feature Output Visualization and Analysis
In order to better analyze the characteristics of feature extraction of DMLPFF paper visualizes the feature maps of different branches, as shown in Figure 6.  Figure 6b-d shows the feature output plots for the adjacent edge low-level extraction branch, localized region middle-level feature extraction branch and glo tent high-level feature extraction branch, respectively. As shown in the red frame in 6b, detailed features as edges and textures of trees and farmland are highlighted. 6c shows the crop regionality is enhanced and this branch extracts the regiona mation of the image. In Figure 6d, the global and abstract nature of the extracted f of the image is made more apparent. In summary, Figure 6 shows the differenc extracted features in each branch and it is necessary to fuse multi-branch features Figure 6b-d shows the feature output plots for the adjacent edge low-level feature extraction branch, localized region middle-level feature extraction branch and global extent high-level feature extraction branch, respectively. As shown in the red frame in Figure 6b, detailed features as edges and textures of trees and farmland are highlighted. Figure 6c shows the crop regionality is enhanced and this branch extracts the regional information of the image. In Figure 6d, the global and abstract nature of the extracted features of the image is made more apparent. In summary, Figure 6 shows the difference of the extracted features in each branch and it is necessary to fuse multi-branch features to fully dig spatial and spectral features of HSI.

Public HSI Dataset Description
In order to verify the effectiveness of the proposed method, classification experiments were performed on two standard hyperspectral datasets (Salinas and KSC) [28,29]. The details of each dataset are as follows. The Salinas dataset was acquired by an Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor at the Salinas Valley in California and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories.
Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. The details of each dataset are as follows. The Salinas dataset was acquired by an Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor at the Salinas Valley in California and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. The details of each dataset are as follows. The Salinas dataset was acquired by an Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor at the Salinas Valley in California and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. The details of each dataset are as follows. The Salinas dataset was acquired by an Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor at the Salinas Valley in California and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. The details of each dataset are as follows. The Salinas dataset was acquired by an Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor at the Salinas Valley in California and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. The details of each dataset are as follows. The Salinas dataset was acquired by an Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor at the Salinas Valley in California and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. The details of each dataset are as follows. The Salinas dataset was acquired by an Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor at the Salinas Valley in California and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. The details of each dataset are as follows. The Salinas dataset was acquired by an Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor at the Salinas Valley in California and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. The details of each dataset are as follows. The Salinas dataset was acquired by an Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor at the Salinas Valley in California and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map.  and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map.

Experimental Parameter Setting
All experiments were performed on Intel(R) Xeon(R) 4208 CPU @ 2.10GHz processor and Nvidia GeForce RTX 2080Ti graphics card. In order to reduce experimental errors, the model randomly selected a limited number of samples from the training set for training. The epoch was set to 200. All experimental results were averaged from 10 experiments. Overall accuracy (OA), average accuracy (AA) and Kappa coefficient (K) were used as evaluation indexes to measure the performance of each method. The initial learning rate of this method was 0.1 and was then divided by 10 when the error plateaued. The networks are trained for 2 × 10 4 iterations and the training minibatch has a size of 100. We use a weight decay of 0.0001 and a momentum of 0.9.

Comparison of the Proposed Methods with the State-of-the-Art Methods
The experiment mainly compares the proposed algorithm DMLP and DMLPFFN with the Radial Basis Function (RBF) Support Vector Machine algorithm (RBF-SVM) [30] and Extended Morphological Profile (EMP) Support Vector Machine Methods (EMP-SVM) [31], Convolutional Neural Network (CNN) [32], Residual Network (ResNet) [33], MLP-Mixer, RepMLP [34] and Deep Feature Fusion Network (DFFN) [35] classification performance for the hyperspectral dataset. Ten percent of the total sample number was used as the training sample number for hyperspectral classification as shown in Tables 3-5. Compared with other methods, the DMLPFFN method proposed in this paper has the highest classification accuracy for two datasets.   In order to fully analyze the effect of the water absorption band on the experimental results, we downloaded the Salinas dataset with the water absorption band from the official website and conducted experimental analysis on it. As shown in Table 4, compared with RBF-SVM, OA, AA and Kappa coefficients of DMLPFFN increased by 16.89%, 15.10% and 14.61%, and improved by 3.80%, 3.15%, and 3.81% compared with DFFN, respectively. All the experimental results show that the proposed DMLPFFN is superior to other methods on the Salinas dataset with the water absorption bands.
Besides the quantitative classification results reported, we simultaneously visualized the classification maps of different methods discussed above, as shown in Figures 7 and 8    Obviously, it can be seen that RBF-SVM results have the most misclassified pixels in all classification maps, with many pretzel noises throughout, and each part is unavoidable for the classification confusion. Taking the dataset of Salinas as an example, in Figure 7bd, a large amount of noise is generated in the upper left corner. Part of the area Vinyard untrained was misclassified as grapes_untrained. The classification confusion of Grapes_untrained, Grapes_untrained and Fallow_rough_pow in the middle part is serious. Compared with SVM, CNN and ResNet classification methods, the classification effect of MLP-Mixer, RepMLP and DFFN is improved, but there are still some misclassification phenomena. In addition, Figure 7i,j are the classification renderings of our algorithm; an obvious observation is that the classification map of the proposed method is the closest to the reference ground truth, which produces less internal noise and a cleaner boundary. Experiments show that the proposed method can effectively extract more refined features from two kinds of datasets, and cross-dimensional information interaction focuses on more important features, thus improving the classification accuracy.

Application in Fine Classification of Crops
In order to verify the classification performance and generalization ability of the DMLP and DMLPFFN, the WHU-Hi-LongKou and WHU-Hi-HanChuan hyperspectral datasets were selected in this paper for fine crop classification [36,37].
The WHU-Hi-LongKou dataset is located in a simple agricultural area and was captured by an 8 mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops. The image size was 550 × 400 pixels, with 270 bands between 400 and 1000 nm. The WHU-Hi-HanChuan dataset was collected in HanChuan, Hubei Province, using a 17 mm focal length Headwall Nano-HyperSpec sensor installed on the Leica Aibot X6 UAV V1 platform. The trial area has seven kinds crops and size 1217 × 303 pixels with 274 bands ranging from 400 to 1000 nm. Tables 6 and 7 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and Ground Truth map.  Obviously, it can be seen that RBF-SVM results have the most misclassified pixels in all classification maps, with many pretzel noises throughout, and each part is unavoidable for the classification confusion. Taking the dataset of Salinas as an example, in Figure 7b-d, a large amount of noise is generated in the upper left corner. Part of the area Vinyard untrained was misclassified as grapes_untrained. The classification confusion of Grapes_untrained, Grapes_untrained and Fallow_rough_pow in the middle part is serious. Compared with SVM, CNN and ResNet classification methods, the classification effect of MLP-Mixer, RepMLP and DFFN is improved, but there are still some misclassification phenomena. In addition, Figure 7i,j are the classification renderings of our algorithm; an obvious observation is that the classification map of the proposed method is the closest to the reference ground truth, which produces less internal noise and a cleaner boundary. Experiments show that the proposed method can effectively extract more refined features from two kinds of datasets, and cross-dimensional information interaction focuses on more important features, thus improving the classification accuracy.

Application in Fine Classification of Crops
In order to verify the classification performance and generalization ability of the DMLP and DMLPFFN, the WHU-Hi-LongKou and WHU-Hi-HanChuan hyperspectral datasets were selected in this paper for fine crop classification [36,37].
The WHU-Hi-LongKou dataset is located in a simple agricultural area and was captured by an 8 mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops. The image size was 550 × 400 pixels, with 270 bands between 400 and 1000 nm. The WHU-Hi-HanChuan dataset was collected in HanChuan, Hubei Province, using a 17 mm focal length Headwall Nano-HyperSpec sensor installed on the Leica Aibot X6 UAV V1 platform. The trial area has seven kinds crops and size 1217 × 303 pixels with 274 bands ranging from 400 to 1000 nm. Tables 6 and 7 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and Ground Truth map. and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map.  In LongKou dataset, soybean occupies a prominent position, and its plot is co ous and extensive. Sesame and cotton are interlaced around the corn planting fie shown in Table 8  In LongKou dataset, soybean occupies a prominent position, and its plot is continuous and extensive. Sesame and cotton are interlaced around the corn planting field. As shown in Table 8 and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. Broad-leaf soybean was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map.  The details of each dataset are as follows. The Salinas dataset was acquired by an Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor at the Salinas Valley in California and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map.  In LongKou dataset, soybean occupies a prominent position, and its plot is continuous and extensive. Sesame and cotton are interlaced around the corn planting field. As shown in Table 8, the OA of DMLPFFN obtained 99.16%. Among all classification groups, the classification method DMLPFFN proposed in this paper has the highest OA, AA and Kappa coefficients, reaching 99.16%, 98.59% and 96.88, respectively. Compared with RBF-SVM, EMP-SVM, CNN, ResNet, MLP-Mixer, RepMLP, DFFN and DMLP, OA increased by 10.00%, 6.95%, 4.38%, 3.53%, 2.84%, 1.58%, 1.19% and 0.91%, respectively.  In LongKou dataset, soybean occupies a prominent position, and its plot is continuous and extensive. Sesame and cotton are interlaced around the corn planting field. As shown in Table 8, the OA of DMLPFFN obtained 99.16%. Among all classification groups, the classification method DMLPFFN proposed in this paper has the highest OA, AA and Kappa coefficients, reaching 99.16%, 98.59% and 96.88, respectively. Compared with RBF-SVM, EMP-SVM, CNN, ResNet, MLP-Mixer, RepMLP, DFFN and DMLP, OA increased by 10.00%, 6.95%, 4.38%, 3.53%, 2.84%, 1.58%, 1.19% and 0.91%, respectively. The details of each dataset are as follows. The Salinas dataset was acquired by an Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor at the Salinas Valley in California and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories.
Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map.     The details of each dataset are as follows. The Salinas dataset was acquired by an Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) sensor at the Salinas Valley in California and consists of 512 × 217 pixels and 224 spectral reflectance bands. The number of bands was reduced to 204 by removing the bands covering the water-absorbing area (108-112, 154-167, 224). Ground Truth contains 16 types of land cover. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 categories. Tables 1 and 2 report the detailed number of pixels available in each class for the two datasets, respectively, and show the false-color composite image and ground truth map. In LongKou dataset, soybean occupies a prominent position, and its plot is continuous and extensive. Sesame and cotton are interlaced around the corn planting field. As shown in Table 8, the OA of DMLPFFN obtained 99.16%. Among all classification groups, the classification method DMLPFFN proposed in this paper has the highest OA, AA and Kappa coefficients, reaching 99.16%, 98.59% and 96.88, respectively. Compared with RBF-SVM, EMP-SVM, CNN, ResNet, MLP-Mixer, RepMLP, DFFN and DMLP, OA increased by 10.00%, 6.95%, 4.38%, 3.53%, 2.84%, 1.58%, 1.19% and 0.91%, respectively.
As shown in Table 9, in the HanChuan dataset, there are only a small number of soybean samples with 1335 pixels, thus affecting the classification effect of various algorithms. For soybean with poor classification performance in other algorithms, the accuracy of the two algorithms proposed in this paper can reach 93.37% and 94.16%, indicating that DMLP and DMLPFFN algorithms are suitable for separating similar ground objects. The methods proposed in this paper effectively solve the problem of spectral variation and heterogeneity within the same object.  Experimental results corresponding to different classification algorithms are shown in Figures 9 and 10 for HanChuan and LongKou datasets, respectively. As shown in Figure 9, it can be seen that there are still a large number of "salt and pepper" noises in RBF-SVM and EMP-SVM. The classification results of CNN and ResNet methods showed that the noise was greatly reduced after considering the contextual information. The classification results of ResNet, MLP-Mixer, RepMLP and DFFN methods showed that a large part of the samples of strawberry, cowpea, soybean and water oat were incorrectly classified into other categories in the middle region of the dataset. This is due to the fact that sowing at the edge of the field is even less compact than sowing in the center. The network misclassified crops as other categories at the margins of some plots owing to the sparse distribution of plants causing leakage of bare land area. Moreover, soybeans and cowpeas are crops of the same origin and exhibit highly similar spectral properties in a certain wavelength range, carrying a negative burden on the classification. Nevertheless, in our approaches, there is barely any misclassification of dense plants in the marginal areas and in the center of the plots, indicating that our method effectively discriminates between confusing crop classes due to spectral variation.
The color and edge features extracted from the low-level branch enable one to distinguish more conveniently between different types of crops and amplify the differences between different crops, whereas the regional features extracted in the middle-level branch lead to more apparent boundaries between crops in different places and perform better identification of crop areas and non-crop areas. The global features extracted at the high level minimize the clutter between sophisticated backdrops and crops to a certain extent and provide a better assessment of the overall crop area. Multi-level feature fusion can sufficiently extract and leverage the feature information of crops and fine classifications of them. Consequently, DMLPFFN is considered suitable for fine crop classification.    The color and edge features extracted from the low-level branch enable one to distinguish more conveniently between different types of crops and amplify the differences between different crops, whereas the regional features extracted in the middle-level branch lead to more apparent boundaries between crops in different places and perform better identification of crop areas and non-crop areas. The global features extracted at the high

Discussion
In order to find the optimal network structure, it is necessary to experiment with different parameters, which play a crucial role in the size of the model and the complexity of the proposed DMLPFFN. In this paper, the optimal parameter combination is determined by analyzing the influence of parameters on the accuracy of classification results, including the number of PCAs, the expansion rate of dilated convolution, the percentage of training samples and the number of branches in the feature fusion strategy.

The Number of Principal Components
The first parameter is the number of principal components selected for PCA on HSI, which is used to extract the main spectral components to improve the algorithm's efficiency and reduce noise interference. In the case of principal component number, the control variable method is used for all datasets in the experiment. That is to say, the value of training sample number, expansion rate and deep feature fusion strategy is fixed. As shown in Figure 11, OA increased and then tended to be stable with the increase of the principal component number for the four HSI datasets. Most of the information in the hyperspectral image exists in the first few principal components. However, it was concluded that using many principal components did not further improve performance.
By comparing the experimental results, it can be found that the classification accuracy of the expansion rate distributions of [1,1,2] is lower than that of the expansion rate distributions of [1,2,2]. The receptive field size of [1,1,2] is 9 × 9. In [2,2,2], although the range of the receptive field increases to 13×13, the classification accuracy is lower than the average overall accuracy of the expansion rate distributions of [1,1,2]. This is because the superposition of three dilated convolutions will result in more feature information being omitted.
The latter five experiments used the combination of two dilated convolution layers and one ordinary convolution layer, and the receptive fields were 11 × 11, 13 × 13, 15 × 15, 17 × 17 and 19 × 19. Nonetheless, [1,2,5] has the largest range of receptive fields; with the increase of expansion rate, the data of input sampling become more and more sparse, resulting in local information loss and damage to information continuity. According to the experimental results in Figure 12, when the expansion rate distribution is [1,2,5], the four HSI datasets can obtain the optimal classification results.

The Expansion Rate of Dilated Convolution
The second parameter is the distribution of the expansion rate. In this experiment, seven circulation structures with expansion rate distributions of [1,1,2], [2,2,2], [1,2,2], [1,2,3], [1,2,4], [1,2,5] and [1,2,6] are selected for comparative analysis, as shown in Figure 12. By comparing the experimental results, it can be found that the classification accuracy of the expansion rate distributions of [1,1,2] is lower than that of the expansion rate distributions of [1,2,2]. The receptive field size of [1,1,2] is 9 × 9. In [2,2,2], although the range of the receptive field increases to 13×13, the classification accuracy is lower than the average overall accuracy of the expansion rate distributions of [1,1,2]. This is because the superposition of three dilated convolutions will result in more feature information being omitted.
The latter five experiments used the combination of two dilated convolution layers and one ordinary convolution layer, and the receptive fields were 11 × 11, 13 × 13, 15 × 15, 17 × 17 and 19 × 19. Nonetheless, [1,2,5] has the largest range of receptive fields; with the increase of expansion rate, the data of input sampling become more and more sparse, resulting in local information loss and damage to information continuity. According to the experimental results in Figure 12, when the expansion rate distribution is [1,2,5], the four HSI datasets can obtain the optimal classification results.

The Percentage of Training Samples
The third parameter is the proportion of training samples to the total number of samples. We carried out experiments on the practical crop hyperspectral datasets LongKou and HanChuan, as shown in Figure 13. 0.4%, 0.6%, 0.8%, 1.0% and 1.2% of LongKou and HanChuan dataset training samples were selected for experiment, respectively. At the beginning, the classification accuracy increased with training samples. When the training sample of LongKou and HanChuan datasets was 1.2%, the OA value basically reached the highest point and then tended to be flat or even showed a downward trend. When the number of training samples reaches the required level, it can precisely illustrate the distribution of all pixels in the studied area; continuously increasing the number of training samples will not increase the classification accuracy. Therefore, 1.2% is chosen as the percentage of training samples, and the proposed DMLPFFN method always provides better performance than other comparison methods.

The Percentage of Training Samples
The third parameter is the proportion of training samples to the total number of samples. We carried out experiments on the practical crop hyperspectral datasets LongKou and HanChuan, as shown in Figure 13. 0.4%, 0.6%, 0.8%, 1.0% and 1.2% of LongKou and HanChuan dataset training samples were selected for experiment, respectively. At the beginning, the classification accuracy increased with training samples. When the training sample of LongKou and HanChuan datasets was 1.2%, the OA value basically reached the highest point and then tended to be flat or even showed a downward trend. When the number of training samples reaches the required level, it can precisely illustrate the distribution of all pixels in the studied area; continuously increasing the number of training samples will not increase the classification accuracy. Therefore, 1.2% is chosen as the percentage of training samples, and the proposed DMLPFFN method always provides better performance than other comparison methods.

The Number of Branches in Feature Fusion Strategy
The fourth parameter is the number of branches in feature fusion strategy. This paper analyzes the correlation and complementarity of information in the deep network using multibranch feature fusion. DMLPFFN2, DMLPFFN3, DMLPFFN4 and DMLPFFN5 refer

The Number of Branches in Feature Fusion Strategy
The fourth parameter is the number of branches in feature fusion strategy. This paper analyzes the correlation and complementarity of information in the deep network using multibranch feature fusion. DMLPFFN2, DMLPFFN3, DMLPFFN4 and DMLPFFN5 refer to methods that fuse two, three, four and five hierarchical branches. Among them, DMLPFFN2 represents the fusion of lower-level and higher-level sorts. It can be seen from Figure 14 that in different datasets, DMLPFFN3 obtained precision values that are superior to DMLPFFN2, DMLPFFN4 and DMLPFFN5. In addition, taking the LongKou dataset as an example, compared with the DMLPFFN2, the OA, AA and Kappa values of the DMLPFFN3 fusion strategy increased by 3.05%, 9.5%, and 1.97%, respectively. That is because the features extracted by DMLPFFN2 contain only details and global information, and regional feature information is dropped. To some extent, fusing multiple layers improves classification results. However, DMLPFFN5 has the lowest classification accuracy, which shows that too many fusion layers may bring redundant information and significantly reduce the performance. Specifically, middle-level information overlap can cause accuracy degradation. So the DMLPFFN method proposed in this paper used three branches for feature fusion and the structure is as shown in Figure 1. improves classification results. However, DMLPFFN5 has the lowest classification accuracy, which shows that too many fusion layers may bring redundant information and significantly reduce the performance. Specifically, middle-level information overlap can cause accuracy degradation. So the DMLPFFN method proposed in this paper used three branches for feature fusion and the structure is as shown in Figure 1.

The Number of Classes for HSI Classification
We conducted experiments on the KSC dataset for different numbers of classes. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 classes. Table  9 shows the OA, AA and Kappa values of the DMLPFFN method when the number of classes is 10, 11, 12 and 13. The results show that the accuracy of the result decreases when the number of classes reduces. The highest precision is achieved when the number of classes is 13, which is the original number of classes. This shows that if the number of classes in the experimental dataset is not utilized sufficiently, it will lead to a decrease in the accuracy of the experimental results.

Time Consumption and Computational Complexity
In order to comprehensively analyze the methods proposed in this paper and current research methods, this paper analyzes the average training time, average test time and total parameters of different methods. Table 11 reports the time consumption and computational complexity of different methods.

The Number of Classes for HSI Classification
We conducted experiments on the KSC dataset for different numbers of classes. The KSC dataset was picked up by AVIRIS sensors flying over the Kennedy Space in Florida. The number of spectral bands is 176, and the size is 512 × 614 pixels with 13 classes. Table 10 shows the OA, AA and Kappa values of the DMLPFFN method when the number of classes is 10, 11, 12 and 13. The results show that the accuracy of the result decreases when the number of classes reduces. The highest precision is achieved when the number of classes is 13, which is the original number of classes. This shows that if the number of classes in the experimental dataset is not utilized sufficiently, it will lead to a decrease in the accuracy of the experimental results.

Time Consumption and Computational Complexity
In order to comprehensively analyze the methods proposed in this paper and current research methods, this paper analyzes the average training time, average test time and total parameters of different methods. Table 11 reports the time consumption and computational complexity of different methods. In terms of running time, taking the HanChuan dataset as an example, although DMLP has a larger receptive field to extract more delicate features and consumes more training time than RepMLP, the total parameters are reduced by 22.98%. Moreover, compared with ResNet and the MLP-Mixer, the training times of DMLP are reduced by 59.65% and 15.15%, and DMLP has better classification accuracy. The results show that, compared with ResNet, DMLP and DMLPFFN have fewer parameters on all datasets. Compared with CNN and the MLP-Mixer, the proposed method has a few more parameters because of its greater depth and width, but the accuracy of the proposed method is the highest. Moreover, compared with DFFN, DMLPFFN has a shorter training time on four datasets because DMLPFFN improves the training performance of the model by combining the fusion strategy with MLP. Taking the LongKou dataset as an example, DMLPFFN training time and test time are reduced by 25.20% and 25.28%, respectively, compared with DFFN. In addition, DMLPFFN has the lowest training time and test time next to CNN among all deep learning methods and achieves better OA than other classification algorithms.

Conclusions
In this paper, two classification frameworks based on MLP are proposed: DMLP and DMLPFFN. Firstly, in order to expand the perceptual field and aggregate multi-branch contextual information and avoid losing the feature map resolution, we introduced a dilated convolution layer instead of ordinary convolution. Secondly, for the purpose of fully utilizing the features of HSI to improve the classification efficiency, we use fusing residual blocks and the DMLP mechanism to extract deeper features and obtain the stateof-the-art performance. Finally, we designed comprehensive experiments and executed them to prove the effectiveness of DMLPFFN by different hyperspectral datasets and to prove that it has better classification performance and generalization ability for agricultural classification.
The proposed DMLP and DMLPFFN were tested on two public datasets (Salinas and KSC) and two real HSI datasets (LongKou and HanChuan). Compared with the classical methods (RBF-SVM and EMP-SVM) and deep learning-based methods (CNN, ResNet, MLP-Mixer, RepMLP and DFFN), the experiments show that the proposed DMLP algorithm