A Dual Neural Architecture Combined SqueezeNet with OctConv for LiDAR Data Classification

Light detection and ranging (LiDAR) is a frequently used technique of data acquisition and it is widely used in diverse practical applications. In recent years, deep convolutional neural networks (CNNs) have shown their effectiveness for LiDAR-derived rasterized digital surface models (LiDAR-DSM) data classification. However, many excellent CNNs have too many parameters due to depth and complexity. Meanwhile, traditional CNNs have spatial redundancy because different convolution kernels scan and store information independently. SqueezeNet replaces a part of 3 × 3 convolution kernels in CNNs with 1 × 1 convolution kernels, decomposes the original one convolution layer into two layers, and encapsulates them into a Fire module. This structure can reduce the parameters of network. Octave Convolution (OctConv) pools some feature maps firstly and stores them separately from the feature maps with the original size. It can reduce spatial redundancy by sharing information between the two groups. In this article, in order to improve the accuracy and efficiency of the network simultaneously, Fire modules of SqueezeNet are used to replace the traditional convolution layers in OctConv to form a new dual neural architecture: OctSqueezeNet. Our experiments, conducted using two well-known LiDAR datasets and several classical state-of-the-art classification methods, revealed that our proposed classification approach based on OctSqueezeNet is able to provide competitive advantages in terms of both classification accuracy and computational amount.


Introduction
Light detection and ranging (LiDAR) technology is an active remote sensing measurement technology which can obtain ground object information by emitting laser to the target [1]. Compared with optical imaging and infrared remote sensing, LiDAR can obtain high-resolution three-dimensional spatial point clouds, elevation models, and other raster-derived data independent of weather conditions through laser beam [2,3]. The data adopted in this article are LiDAR-derived rasterized digital surface models (LiDAR-DSM) which are obtained by rasterizing the point cloud data acquired by the LiDAR system [4]. There is a lot of research on LiDAR-DSM data classification. Priestnall et al. examined methods for extracting surface features from DSM produced by LiDAR [5]. Song et al. modified the crown shape in spectrum by using DSM [6]. Zhou et al. used minimum description length (MDL) and morphology to recognize the buildings [7]. Zhou combined LiDAR height and intensity data to accurately map urban land cover [8].

SqueezeNet Design Architecture
The main objective of SqueezeNet is maintaining competitive accuracy with few parameters. To achieve this goal, three main strategies were adopted. Firstly, the 3 × 3 filter was replaced by the 1 × 1 filter for which has fewer parameters. Secondly, we reduced the number of input channels to the 3 × 3 filters. Finally, we performed subsampled operations in the later stages of the network to make the convolution layer with large activation. SqueezeNet drew on the idea of the Inception module [23] to design a Fire module with a squeeze layer and an expand layer. The structure of Fire module is shown in Figure 1. In order to reduce the number of channels for input elements, the squeeze layer used a 1 × 1 convolution kernel to compress the input elements. The expansion layer used the 1 × 1 and 3 × 3 convolution kernels for multi-scale learning and concatenating. The main objective of SqueezeNet is maintaining competitive accuracy with few parameters. To achieve this goal, three main strategies were adopted. Firstly, the 3 × 3 filter was replaced by the 1 × 1 filter for which has fewer parameters. Secondly, we reduced the number of input channels to the 3 × 3 filters. Finally, we performed subsampled operations in the later stages of the network to make the convolution layer with large activation. SqueezeNet drew on the idea of the Inception module [23] to design a Fire module with a squeeze layer and an expand layer. The structure of Fire module is shown in Figure 1. In order to reduce the number of channels for input elements, the squeeze layer used a 1 × 1 convolution kernel to compress the input elements. The expansion layer used the 1 × 1 and 3 × 3 convolution kernels for multi-scale learning and concatenating. The operation process of the Fire module is shown in Figure 2. The size of the input feature maps is h × w × n. Firstly, the input feature maps pass through the squeeze layer and obtain the output feature maps with a size of h×w×s1. The size of the feature maps is unchanged but the channels reduce from n to s1. The output feature maps of squeeze layer are sent into 1 × 1 and 3 × 3 convolution kernels in expand layer, respectively. Then concatenate the result of convolution. Finally, only the number of channels changes to e1+e3. In order to enable the output activation of the 1 × 1 and 3 × 3 filters of the extension module to have the same height and width, the boundary zero filling operation with 1 pixel is performed for the input of the 3 × 3 filters in the extension module. Rectified Linear Unit (ReLU) is applied to the activation of squeeze layer and an expand layer. Meanwhile, there is no full-connection layer in SqueezeNet. The operation process of the Fire module is shown in Figure 2. The size of the input feature maps is h × w × n. Firstly, the input feature maps pass through the squeeze layer and obtain the output feature maps with a size of h × w × s 1 . The size of the feature maps is unchanged but the channels reduce from n to s 1 . The output feature maps of squeeze layer are sent into 1 × 1 and 3 × 3 convolution kernels in expand layer, respectively. Then concatenate the result of convolution. Finally, only the number of channels changes to e 1 + e 3 . In order to enable the output activation of the 1 × 1 and 3 × 3 filters of the extension module to have the same height and width, the boundary zero filling operation with 1 pixel is performed for the input of the 3 × 3 filters in the extension module. Rectified Linear Unit (ReLU) is applied to the activation of squeeze layer and an expand layer. Meanwhile, there is no full-connection layer in SqueezeNet. The main objective of SqueezeNet is maintaining competitive accuracy with few parameters. To achieve this goal, three main strategies were adopted. Firstly, the 3 × 3 filter was replaced by the 1 × 1 filter for which has fewer parameters. Secondly, we reduced the number of input channels to the 3 × 3 filters. Finally, we performed subsampled operations in the later stages of the network to make the convolution layer with large activation. SqueezeNet drew on the idea of the Inception module [23] to design a Fire module with a squeeze layer and an expand layer. The structure of Fire module is shown in Figure 1. In order to reduce the number of channels for input elements, the squeeze layer used a 1 × 1 convolution kernel to compress the input elements. The expansion layer used the 1 × 1 and 3 × 3 convolution kernels for multi-scale learning and concatenating. The operation process of the Fire module is shown in Figure 2. The size of the input feature maps is h × w × n. Firstly, the input feature maps pass through the squeeze layer and obtain the output feature maps with a size of h×w×s1. The size of the feature maps is unchanged but the channels reduce from n to s1. The output feature maps of squeeze layer are sent into 1 × 1 and 3 × 3 convolution kernels

Octave Convolution
Natural images can be decomposed into low spatial frequency components and high spatial frequency components. Similarly, the feature maps of the convolutional layers can also be decomposed into feature components of different spatial frequencies. The low frequency components are used to describe the structure with the smooth changes, and the high frequency components are used to describe the fine details of the fast changes.
As shown in Figure 3, the components of high and low frequency of the image are written as X H and X L , and their corresponding outputs are Y H and Y L after the convolution operation, where α ∈ (0,1) represents the ratio of the low frequency channels. The arrows above and below represent that the feature maps with the same frequency self-update their information by convolution operation, and the crossed arrows help to exchange information between the two frequencies by pooling, upsample, and add operations. In the convolution operation, W H responses are both from X H to Y H and from X L to Y H , that is W H = [W H→H ,W L→H ]. W H→H is traditional convolution, and the size of input and output images are the same. W L→H upsamples the input image first and then performs traditional convolution, while W H→L pools the input image first. As shown as Equations (1) and (2), the high and low frequency feature maps are stored in different groups. Sharing information between adjacent locations can reduce the spatial resolution of the low-frequency group and spatial redundancy. Natural images can be decomposed into low spatial frequency components and high spatial frequency components. Similarly, the feature maps of the convolutional layers can also be decomposed into feature components of different spatial frequencies. The low frequency components are used to describe the structure with the smooth changes, and the high frequency components are used to describe the fine details of the fast changes.
As shown in Figure 3, the components of high and low frequency of the image are written as X H and X L , and their corresponding outputs are Y H and Y L after the convolution operation, where α ∈ (0,1) represents the ratio of the low frequency channels. The arrows above and below represent that the feature maps with the same frequency self-update their information by convolution operation, and the crossed arrows help to exchange information between the two frequencies by pooling, upsample, and add operations.  (1) and (2), the high and low frequency feature maps are stored in different groups. Sharing information between adjacent locations can reduce the spatial resolution of the low-frequency group and spatial redundancy.
The specific calculation process is written as Equations (3) and (4).

OctSqueezeNet for LiDAR Classification
As shown in Figure 4, we rescaled the feature maps with different resolutions to the same spatial resolution and concatenated their feature channels, forming a dual neural architecture similar to multi-layers called OctSqueezeNet [39] for LiDAR data classification processing.
First, random selection of 20% from the input feature map performed a 2 × 2 maxpooling operation to halve the size to 16 × 16, and the remaining feature maps maintained the original size, 32 × 32. The two parts were sent separately to different Fire modules to obtain their respective output feature maps. The size of feature maps did not change through the Fire module. We repeated the above operation for the 32 × 32 output feature maps. At the same time, 80% of the 16 × 16 output feature maps were sent to a Fire module and a 2 × 2 upsampling operation was performed on the corresponding output to restore the size to 32 × 32; the remaining 20% of these were sent to another Fire module. Feature maps with the same size in different sections were combined to form the dual The specific calculation process is written as Equations (3) and (4).

OctSqueezeNet for LiDAR Classification
As shown in Figure 4, we rescaled the feature maps with different resolutions to the same spatial resolution and concatenated their feature channels, forming a dual neural architecture similar to multi-layers called OctSqueezeNet [39] for LiDAR data classification processing. the remaining 20% were sent to a Fire module and then the outputs the outputs after this Fire module were sent to a 2 × 2 upsampled layer to restore to the size of 16 × 16. The feature maps with the same size in different parts were combined to obtain 8 × 8 and 16 × 16 and, respectively, sent to the Fire module, shown as the lower part of Figure 4. Then we upsampled the 8 × 8 feature maps to 16 × 16. The 16 × 16 feature maps were merged to obtain a single output and downsampled to 8 × 8. Finally, softmax was used for data classification.

Adaptive Learning Optimization Algorithm
Adam optimization algorithm is an extension of the stochastic gradient descent (SGD) algorithm. It is a step-optimization algorithm based on the gradient stochastic objective function and low-order moment adaptive estimation based on training data. It can replace the SGD and update the weight of the neural network iteratively based on the training data. The method estimates the different parameters by the first order matrix, mt, and the second order matrix, nt, of the gradient to calculate the adaptive learning rate. The process is shown as Equations (5)-(7): First, random selection of 20% from the input feature map performed a 2 × 2 maxpooling operation to halve the size to 16 × 16, and the remaining feature maps maintained the original size, 32 × 32. The two parts were sent separately to different Fire modules to obtain their respective output feature maps. The size of feature maps did not change through the Fire module. We repeated the above operation for the 32 × 32 output feature maps. At the same time, 80% of the 16 × 16 output feature maps were sent to a Fire module and a 2 × 2 upsampling operation was performed on the corresponding output to restore the size to 32 × 32; the remaining 20% of these were sent to another Fire module. Feature maps with the same size in different sections were combined to form the dual outputs, then sent them to the average pooling layer to obtain output feature maps with size 16 × 16 and 8 × 8.
As shown in the lower part of Figure 4, the above operation was repeated for these 16 × 16 output feature maps. Meanwhile, 80% of the these 8 × 8 output feature maps were only sent to a Fire module; the remaining 20% were sent to a Fire module and then the outputs the outputs after this Fire module were sent to a 2 × 2 upsampled layer to restore to the size of 16 × 16.
The feature maps with the same size in different parts were combined to obtain 8 × 8 and 16 × 16 and, respectively, sent to the Fire module, shown as the lower part of Figure 4. Then we upsampled the 8 × 8 feature maps to 16 × 16. The 16 × 16 feature maps were merged to obtain a single output and downsampled to 8 × 8. Finally, softmax was used for data classification.

Adaptive Learning Optimization Algorithm
Adam optimization algorithm is an extension of the stochastic gradient descent (SGD) algorithm. It is a step-optimization algorithm based on the gradient stochastic objective function and low-order moment adaptive estimation based on training data. It can replace the SGD and update the weight of the neural network iteratively based on the training data. The method estimates the different parameters by the first order matrix, m t , and the second order matrix, n t , of the gradient to calculate the adaptive learning rate. The process is shown as Equations (5)- (7): The m t and n t are the correction for m t and n t , which approximate the expected unbiased estimate, have no additional requirements for memory, and can be adjusted dynamically according to the gradient, where -m t / n t + ε forms a dynamic constraint on the learning rate and it has a clear scope.

Loss and Activate Function
The activate function of the structure in this article is ReLU. As shown as Equation (9), it causes a part of the output of neuron to zero, which results in sparseness of the network and reduces the interdependence of parameters, alleviating the problem of overfitting. The calculation of the whole process saves a lot of time. g Because our dataset is a multi-category dataset, softmax is adopted as the final classifier of the network model. Shown in Equation (10), softmax is used as the exponential operation, which can increase the comparison of the large values and small values to improve learning efficiency.
where Z L j represents the input of the jth neuron of the Lth layer (usually the last layer), a L j represents the output of the jth neuron in the Lth layer, and e represents the natural constant. K e Z L K represents the sum of the inputs of all neurons in the Lth layer. Therefore, as Equation (11), the corresponding loss function is a combination of softmax and cross-entropy loss.

Datasets Description
This article conducted experiments on two different datasets, Bayview Park and Recology, to evaluate the performance of the proposed classification method. They are the public datasets of the 2012 IEEE International Remote Sensing Image Convergence Competition and collected in the city of San Francisco, CA, USA. The Bayview Park dataset has a size of 300 × 200 pixels, the spatial resolution of 1.8 meters and marks 7 land classes. The Recology dataset consists of 200 × 250 pixels with a spatial resolution of 1.8 meters and contains 11 land classes. Figure 5 shows the DSM maps and the groundtruth maps, respectively, for Bayview Park and Recology datasets.   Figure 5 shows the DSM maps and the groundtruth maps, respectively, for Bayview Park and Recology datasets.

Experimental Set-Up
Experiments used the TensorFlow under windows as the backend, encoded with Keras and Python. We divided the datasets into two parts: Training samples and test samples. The number of training samples (i.e., 400, 500, 600, and 700) was selected randomly and the rest were used as test sets. And we adopted overall accuracy (OA), average accuracy (AA), and Kappa as objective evaluation criteria. In the experiments, SqueezeNet only used convolutions size with 1 × 1 and 3 × 3, the input features of both datasets were 32 × 32 pixels. DSM data were linearly mapped to (−1 1) and the gradient optimization algorithm selected Adam. The kernel function of SVM was set to radial basis function (rbf) and the coefficient of rbf was default to auto. The penalty parameter of the error terms was 100. The initial learning rate for Bayview Park and Recology datasets both were 0.001 in CNN, and they were set to 0.0005 and 0.001, respectively, in OctConv. In SqueezeNet and OctSqueezeNet, the initial learning rates for two datasets were both set to 0.0005. For the two datasets, the ratio of pooling for feature maps in the first input layer and middle layers were all set to 0.2. The last output layer did not perform pooling. In this article, the initial learning rate for Bayview Park and Recology datasets both were 0.001 in CNN, and they were set to 0.0005 and 0.001, respectively, in OctConv; in SqueezeNet and OctSqueezeNet, the initial learning rates for the two datasets were both set to 0.0005.

Experimental Set-Up
Experiments used the TensorFlow under windows as the backend, encoded with Keras and Python. We divided the datasets into two parts: Training samples and test samples. The number of training samples (i.e., 400, 500, 600, and 700) was selected randomly and the rest were used as test sets. And we adopted overall accuracy (OA), average accuracy (AA), and Kappa as objective evaluation criteria. In the experiments, SqueezeNet only used convolutions size with 1 × 1 and 3 × 3, the input features of both datasets were 32 × 32 pixels. DSM data were linearly mapped to (−1 1) and the gradient optimization algorithm selected Adam. The kernel function of SVM was set to radial basis function (rbf) and the coefficient of rbf was default to auto. The penalty parameter of the error terms was 100. The initial learning rate for Bayview Park and Recology datasets both were 0.001 in CNN, and they were set to 0.0005 and 0.001, respectively, in OctConv. In SqueezeNet and OctSqueezeNet, the initial learning rates for two datasets were both set to 0.0005. For the two datasets, the ratio of pooling for feature maps in the first input layer and middle layers were all set to 0.2. The last output layer did not perform pooling. In this article, the initial learning rate for Bayview Park and Recology datasets both were 0.001 in CNN, and they were set to 0.0005 and 0.001, respectively, in OctConv; in SqueezeNet and OctSqueezeNet, the initial learning rates for the two datasets were both set to 0.0005.

Bayview Park Dataset
The experiments ran on a 3.2-GHz CPU with a GTX 1060 GPU card. As shown in Table 1, OctSqueezeNet achieved the best results on OA, AA, and Kappa when different numbers of training samples were selected. When 700 samples were selected, the best OA was 95.42%, increasing 1.91%, 2.21%, 6.32%, and 19.17% compared to OctConv, SqueezeNet, CNN, and SVM, respectively. Figure 6 uses the bar chart to show the accuracy of each comparison method when different numbers of samples were selected. It can be seen intuitively that the accuracy of OctSqueezeNet proposed by us was always higher than that of other methods. samples were selected. When 700 samples were selected, the best OA was 95.42%, increasing 1.91%, 2.21%, 6.32%, and 19.17% compared to OctConv, SqueezeNet, CNN, and SVM, respectively. Figure 6 uses the bar chart to show the accuracy of each comparison method when different numbers of samples were selected. It can be seen intuitively that the accuracy of OctSqueezeNet proposed by us was always higher than that of other methods.    Figure 7 show the classification accuracy of each class. OctSqueezeNet had a good effect on classification of different land classes. Figure 8 shows the classification results of different networks for Bayview Park visually through the false color maps. It can be seen from the classification results, as the precision increased, the mis-division area of the feature gradually decreased.    Figure 7 show the classification accuracy of each class. OctSqueezeNet had a good effect on classification of different land classes. Figure 8 shows the classification results of different networks for Bayview Park visually through the false color maps. It can be seen from the classification results, as the precision increased, the mis-division area of the feature gradually decreased. samples were selected. When 700 samples were selected, the best OA was 95.42%, increasing 1.91%, 2.21%, 6.32%, and 19.17% compared to OctConv, SqueezeNet, CNN, and SVM, respectively. Figure 6 uses the bar chart to show the accuracy of each comparison method when different numbers of samples were selected. It can be seen intuitively that the accuracy of OctSqueezeNet proposed by us was always higher than that of other methods.    Figure 7 show the classification accuracy of each class. OctSqueezeNet had a good effect on classification of different land classes. Figure 8 shows the classification results of different networks for Bayview Park visually through the false color maps. It can be seen from the classification results, as the precision increased, the mis-division area of the feature gradually decreased. samples were selected. When 700 samples were selected, the best OA was 95.42%, increasing 1.91%, 2.21%, 6.32%, and 19.17% compared to OctConv, SqueezeNet, CNN, and SVM, respectively. Figure 6 uses the bar chart to show the accuracy of each comparison method when different numbers of samples were selected. It can be seen intuitively that the accuracy of OctSqueezeNet proposed by us was always higher than that of other methods.    Figure 7 show the classification accuracy of each class. OctSqueezeNet had a good effect on classification of different land classes. Figure 8 shows the classification results of different networks for Bayview Park visually through the false color maps. It can be seen from the classification results, as the precision increased, the mis-division area of the feature gradually decreased. samples were selected. When 700 samples were selected, the best OA was 95.42%, increasing 1.91%, 2.21%, 6.32%, and 19.17% compared to OctConv, SqueezeNet, CNN, and SVM, respectively. Figure 6 uses the bar chart to show the accuracy of each comparison method when different numbers of samples were selected. It can be seen intuitively that the accuracy of OctSqueezeNet proposed by us was always higher than that of other methods.    Figure 7 show the classification accuracy of each class. OctSqueezeNet had a good effect on classification of different land classes. Figure 8 shows the classification results of different networks for Bayview Park visually through the false color maps. It can be seen from the classification results, as the precision increased, the mis-division area of the feature gradually decreased. OctSqueezeNet achieved the best results on OA, AA, and Kappa when different numbers of training samples were selected. When 700 samples were selected, the best OA was 95.42%, increasing 1.91%, 2.21%, 6.32%, and 19.17% compared to OctConv, SqueezeNet, CNN, and SVM, respectively. Figure 6 uses the bar chart to show the accuracy of each comparison method when different numbers of samples were selected. It can be seen intuitively that the accuracy of OctSqueezeNet proposed by us was always higher than that of other methods.    Figure 7 show the classification accuracy of each class. OctSqueezeNet had a good effect on classification of different land classes. Figure 8 shows the classification results of different networks for Bayview Park visually through the false color maps. It can be seen from the classification results, as the precision increased, the mis-division area of the feature gradually decreased.

Recology Dataset
As shown in Table 3, OctSqueezeNet also achieved the best results for Recology dataset on OA, AA, and Kappa when selecting different numbers of training samples. When selecting 700 samples, the best OA was 95.91%, increasing 0.63%, 0.97%, 3.22%, and 17.98% compared to OctConv, SqueezeNet, CNN, and SVM, respectively. Figure 9 shows that the accuracy of the five methods grew steadily with the number of samples increasing. The accuracy of OctSqueezeNet was always the highest. the best OA was 95.91%, increasing 0.63%, 0.97%, 3.22%, and 17.98% compared to OctConv, SqueezeNet, CNN, and SVM, respectively. Figure 9 shows that the accuracy of the five methods grew steadily with the number of samples increasing. The accuracy of OctSqueezeNet was always the highest.   Table 4 and Figure 10 also prove that OctSqueezeNet had better classification effects on different classes. Figure 11 shows the classification results of the land classes and the area of the wrong division was also less and less with the increase of precision.   Table 4 and Figure 10 also prove that OctSqueezeNet had better classification effects on different classes. Figure 11 shows the classification results of the land classes and the area of the wrong division was also less and less with the increase of precision. 2.21%, 6.32%, and 19.17% compared to OctConv, SqueezeNet, CNN, and SVM, respectively. Figure 6 uses the bar chart to show the accuracy of each comparison method when different numbers of samples were selected. It can be seen intuitively that the accuracy of OctSqueezeNet proposed by us was always higher than that of other methods.    Figure 7 show the classification accuracy of each class. OctSqueezeNet had a good effect on classification of different land classes. Figure 8 shows the classification results of different networks for Bayview Park visually through the false color maps. It can be seen from the classification results, as the precision increased, the mis-division area of the feature gradually decreased. 2.21%, 6.32%, and 19.17% compared to OctConv, SqueezeNet, CNN, and SVM, respectively. Figure 6 uses the bar chart to show the accuracy of each comparison method when different numbers of samples were selected. It can be seen intuitively that the accuracy of OctSqueezeNet proposed by us was always higher than that of other methods.    Figure 7 show the classification accuracy of each class. OctSqueezeNet had a good effect on classification of different land classes. Figure 8 shows the classification results of different networks for Bayview Park visually through the false color maps. It can be seen from the classification results, as the precision increased, the mis-division area of the feature gradually decreased. 2.21%, 6.32%, and 19.17% compared to OctConv, SqueezeNet, CNN, and SVM, respectively. Figure 6 uses the bar chart to show the accuracy of each comparison method when different numbers of samples were selected. It can be seen intuitively that the accuracy of OctSqueezeNet proposed by us was always higher than that of other methods.    Figure 7 show the classification accuracy of each class. OctSqueezeNet had a good effect on classification of different land classes. Figure 8 shows the classification results of different networks for Bayview Park visually through the false color maps. It can be seen from the classification results, as the precision increased, the mis-division area of the feature gradually decreased. 2.21%, 6.32%, and 19.17% compared to OctConv, SqueezeNet, CNN, and SVM, respectively. Figure 6 uses the bar chart to show the accuracy of each comparison method when different numbers of samples were selected. It can be seen intuitively that the accuracy of OctSqueezeNet proposed by us was always higher than that of other methods.    Figure 7 show the classification accuracy of each class. OctSqueezeNet had a good effect on classification of different land classes. Figure 8 shows the classification results of different networks for Bayview Park visually through the false color maps. It can be seen from the classification results, as the precision increased, the mis-division area of the feature gradually decreased.

Selection of Experimental Parameters
The number of parameters and the model size of the proposed method in this article were compared with the classical methods. In order to adapt to the Bayview Park and Recology datasets, we made some adjustments to the structure of these classic algorithms. Next, the adjustments made in each architecture are described.
(1) SqueezeNet: The first change was the addition of batch normalization, which was not present in the original architecture. Then dropout was not used in the last convolution layer. Finally, we changed the size of the kernel on the first convolution layer from original 7 × 7 to 3 × 3.

Selection of Experimental Parameters
The number of parameters and the model size of the proposed method in this article were compared with the classical methods. In order to adapt to the Bayview Park and Recology datasets, we made some adjustments to the structure of these classic algorithms. Next, the adjustments made in each architecture are described.
(1) SqueezeNet: The first change was the addition of batch normalization, which was not present in the original architecture. Then dropout was not used in the last convolution layer. Finally, we changed the size of the kernel on the first convolution layer from original 7 × 7 to 3 × 3.

Selection of Experimental Parameters
The number of parameters and the model size of the proposed method in this article were compared with the classical methods. In order to adapt to the Bayview Park and Recology datasets, we made some adjustments to the structure of these classic algorithms. Next, the adjustments made in each architecture are described.
(1) SqueezeNet: The first change was the addition of batch normalization, which was not present in the original architecture. Then dropout was not used in the last convolution layer. Finally, we changed the size of the kernel on the first convolution layer from original 7 × 7 to 3 × 3. (2) AlexNet: Its original architecture contained eight weights layers; the first five layers were convolution and the rest of the layers were fully connected. In comparison experiments, the first convolution layer changed the stride from 4 to 1. The output of the last fully connected layer was fed into a 7-way softmax for the Bayview Park dataset and an 11-way softmax for the Recology dataset.
(3) ResNet-34: Original Resnet-34 had 4 sections which were composed of 3, 4, 6, and 3 identity blocks, respectively. The number of filters was 64, 128, 256, and 512, respectively, in each identity block of the four sections. In our experiments, the kernel of the first convolution layer was changed from 7 to 3. The number of filters was modified to 16, 28, 40, and 52, respectively. The output of the last fully connected layer was set the same as AlexNet.
As shown in Table 5, compared with some classical CNN algorithms, OctSqueezeNet proposed by us had higher precision, and reduced the number of parameters and model size significantly. When 700 training sets were selected, the OA of OctSqueezeNet on the Bayview Park dataset reached 95.42%, increasing 2.33%, 1.67%, and 2.21% compared with AlexNet, ResNet-34, and SqueezeNet, respectively; but the number of parameters of OcSqueezeNet was only 0.32M, which is 29.67M, 0.1M, and 0.92M less than the number of parameters of AlexNet, ResNet-34, and SqueezeNet, respectively. The model size was only 1.38M, which is 113M, 0.45M, and 3.45M less than the model size of AlexNet, ResNet-34, and SqueezeNet, respectively. The best OA of OcSqueezeNet on the Recology dataset was 95.91% and it was 1.64%, 0.86%, and 0.97% higher than AlexNet, ResNet-34, and SqueezeNet. Its number of parameters was the least (0.33M) compared with the other three methods. The number of parameters of AlexNet, ResNet-34, and SqueezeNet was 114.51M, 1.84M, and 4.85M, respectively. The model size of OctSqueezeNet was also the smallest (1.37M) of these four methods, which decreased 113.14M, 0.47M, and 3.48M compare to the model size of AlexNet, ResNet-34, and SqueezeNet. Additionally, the training and test time are shown in Table 5  Although the parameters were reduced, the structural branches were more complicated, which affected the transfer time. However, though the time was slower than SqueezeNet itself, it was acceptable compared to the greatly improved accuracy and the reduced number of parameters and model size. OctSqueezeNet was much faster than AlexNet and Resnet-34.

Conclusions
This article designed a dual neural architecture OctSqueezeNet to classify LiDAR-DSM data. Using SqueezeNet alone without combining with OctConv, the network parameters were more and the model size was larger. Most importantly, the classification accuracy of the dataset was reduced significantly. Because SqueezeNet replaced the traditional 3 × 3 convolution kernels with a large number of 1 × 1 convolution kernels, which lost more extracted features relatively. The OctConv did not change the size of the convolution kernel. By reducing the size of a part of the feature maps, it was equivalent to expanding the receptive field to obtain more global feature information. The feature maps with the original size were equivalent to extracting more detailed feature information, because the receptive field had no change. This can compensate for the loss of SqueezeNet in feature extraction. Because there were only 1 × 1 and 3 × 3 convolution kernels, it also made parameters of a network fewer and the model size smaller.
Overall, combining SqueezeNet with OctConv gave better classification precision of datasets and less space for the storage model. The experiment results indicate that OctSqueezeNet achieved 95.42% and 95.66%, respectively, in terms of OA on the Bayview Park and Recology datasets when the number of training samples was 700, which were better than other classification methods. At the same time, for the two datasets, the number of parameters of OctSqueezeNet achieved 0.32M and 0.33M, respectively, and the model size of OctSqueezeNet achieved 1.38M and 1.37M, respectively. They were lower than some other methods. The combination of SqueezeNet and OctConv opens a new window for LiDAR data classification by fully extracting its spatial information.