Leaf Counting with Multi-Scale Convolutional Neural Network Features and Fisher Vector Coding

: The number of leaves in maize plant is one of the key traits describing its growth conditions. It is directly related to plant development and leaf counts also give insight into changing plant development stages. Compared with the traditional solutions which need excessive human interventions, the methods of computer vision and machine learning are more e ﬃ cient. However, leaf counting with computer vision remains a challenging problem. More and more researchers are trying to improve accuracy. To this end, an automated, deep learning based approach for counting leaves in maize plants is developed in this paper. A Convolution Neural Network(CNN) is used to extract leaf features. The CNN model in this paper is inspired by Google Inception Net V3, which using multi-scale convolution kernels in one convolution layer. To compress feature maps generated from some middle layers in CNN, the Fisher Vector (FV) is used to reduce redundant information. Finally, these encoded feature maps are used to regress the leaf numbers by using Random Forests. To boost the related research, a relatively single maize image dataset (Di ﬀ erent growth stage with 2845 samples, which 80% for train and 20% for test) is constructed by our team. The proposed algorithm in single maize data set achieves Mean Square Error (MSE) of 0.32.


Introduction
Precision agriculture, which focuses on optimizing production by accounting for variabilities and dealing with uncertainties in agricultural systems, has been under active research in recent years [1]. Feature monitoring and plant phenotyping are essential parts of precision agriculture. They can help in modeling the growth process of plants and guide farmers to obtain higher yields with appropriate fertilizer, irrigation, and disease control [2,3].
Traditional plant phenotyping, involves a large number of manual measurements, and this has been identified as the current bottleneck in modern plant breeding and research programs [4]. The number of leaves of a plant is considered one of the critical phenotypic metrics related to its development and growth stages [5,6], flowering time [7], and water condition. The traditional manual measurement is slow, tedious, and expensive. Therefore, several image-based and machine learning technologies have been introduced for leaf counting. However, counting leaves automatically is challenging [8], due to a plant's rapid growth and leaf occlusion and illumination problems. Moreover, most study on leaf counting are based on rosette plants, and the relevant algorithms are not suitable for maize plants. Considering this, we designed a model suitable for counting maize leaves.
In this study, we estimate the number of leaves on a maize plant at different growth stages. The problem is posed as a nonlinear regression problem, which does not require segmenting individual

Related Work
It has been proved that second-order statistics significantly improves classification performance [17]. Some methods of CNN architectures that combine second-order statistics or coding method have been proposed. Symmetric positive definite matrix network (SPDNet) was proposed in [18]. Referenced by the structure of the CNN, it is designed with bilinear mapping layers and eigenvalue layers, instead of convolution layers and rectified linear units. In [19], the authors proposed a hybrid deep-learning architecture which allows to encode CNN features with log-Euclidean Fisher Vector (LE FV).
The leaf counting methods used in recent studies are mainly of two types: counting via object segmentation and direct counting via nonlinear regression model. Counting via object segmentation. This method involves segmenting the foreground and background points of the image and filtering the background before counting. Especially, the end-to-end instance segmentation method [20] combined with long short-term memory [21] segments one leaf at one time. In [22], the authors used a segmented image mask to generate a plant skeleton and then extracted some skeleton features such as skeleton length, convex hull circularity and the number of skeleton branch points. These features were used to regress the leaf count. A 3D color histogram has always been used to segment plants, and therefore, threshold was selected from the color histogram to segment the plant [23]. Then the segmented plant generates a distance map with highlighted peaks, which serve as leaf center points. The number of center points is the number of leaves. Influenced by illumination and the image shooting angle, some background points may remain in the segmented images. These noise points have a high probability to affect the leaf counting result.
Direct count. Here segmentation is not need, and the original image is directly used to count. In previous research, the architecture of the Resnet50 model has been modified. In [24], the modified network took as input an RGB(red, green, blue three channel) image of a rosette plant and outputted a leaf count prediction. Aich and Stavness [25] used VGGNet architecture to regress the number of leaves. The input data had four channels (segmentation+RGB; leaf counting competition offered segmented samples). There are many other similar approaches, which have different structures of the selected network. The advantage of these methods is that the image segmentation step is omitted and the original image information can be used directly to regress the leaf numbers. No additional noise will be introduced.

Materials
This experiment has been done in a laboratory environment. The samples selected in this study were Zhengdan No.958 grown in a pot under an in-house condition. Zhengdan which is the most popular cultivated in China and growing in Henan, Hebei and Shandong province. The average plant height is 240 cm, and the average panicle height is 100 cm. The pot's height we selected in this study is 0.5 m and the upper diameter is 0.4 m, the bottom diameter is 0.3 m. The soil type is medium loam.
In this research, different water content degree has been set. In a field environment, maize is likely to be drought because of the water shortage. The purpose of this research is to detect the number of leaves in different environments. Therefore, setting experimental samples of different water content degree can better simulate the natural environment and show the effect of different water content degree on the number of leaves. Table 1 shows the moisture control. The growth stage of the maize selected in this paper was V8 (eight visible leaves)~VT (the last spike was visible). According to some research, the water supply of maize in two weeks before and after the pollination period will determine the final yield [26]. Therefore, it is more meaningful to study and detect the phenotype of maize in this period. Table 2 shows the growth stage and description. Figure 1 shows some example images of maize at different growing stage. The selected image collection equipment was Canon Eos 700D. This camera has 18 million effective pixels. The actual collection of the maize image samples had a resolution of 5184 × 3456. However, the original picture resolution was compressed to 441 × 441 in the calculation to improve the computational efficiency. The camera angle and focal length were adjusted with the growth of the maize. A picture was taken every 5 minutes at 5:30 am to 18:30 pm.

Setting Label According to the Number of Maize Leaves
We tried to use the classical network structure of CNN to directly regress the number of leaves, but the results were not suitable for maize plants. Sometimes the regression result approaches the average leaf number in the training samples. We assigned the labels according to the range of sample leaves; and assigned the same label to samples with similar leaf numbers. By observing the different samples of the maize plant, we find that the image samples with similar leaf numbers in the same species often have a lot of similarities in shape, size, and shooting angle. Therefore, we classified the samples with similar numbers of leaves into the same label and then utilized CNN to learn the standard features of samples of the same class; CNN can effectively learn features from the image.
Before we assigned labels to each sample, the distribution of leaf numbers of all samples was manually counted because sometimes one type of samples may occupy a large proportion of the data sets. Disregarding the leaf number distribution will lead to the unbalanced class distribution of the training samples, which will significantly influence the network training. The leaf number distribution in a sample set is shown in Figure 2.

Setting Label According to the Number of Maize Leaves
We tried to use the classical network structure of CNN to directly regress the number of leaves, but the results were not suitable for maize plants. Sometimes the regression result approaches the average leaf number in the training samples. We assigned the labels according to the range of sample leaves; and assigned the same label to samples with similar leaf numbers. By observing the different samples of the maize plant, we find that the image samples with similar leaf numbers in the same species often have a lot of similarities in shape, size, and shooting angle. Therefore, we classified the samples with similar numbers of leaves into the same label and then utilized CNN to learn the standard features of samples of the same class; CNN can effectively learn features from the image.
Before we assigned labels to each sample, the distribution of leaf numbers of all samples was manually counted because sometimes one type of samples may occupy a large proportion of the data sets. Disregarding the leaf number distribution will lead to the unbalanced class distribution of the training samples, which will significantly influence the network training. The leaf number distribution in a sample set is shown in Figure 2.
The samples selected in this experiment correspond to four irrigation methods. Except for the first one, the other three were grown under different degrees of drought. Therefore, the distributions of the maximum and minimum values of each sample are all different. From the left figure image in Figure 2, we can find that the number of leaves mainly concentrated in 6,7 and 8. With the increase in the number of leaves, the corresponding number of samples gradually decrease. Because the number of leaves of a plant under suitable moisture increase steadily with time, whereas, a plant grows slowly in a drought state. The plant leaf number of the plant in drought is lower than that of the plant with suitable water content. Therefore, we needed to reset the label of the original sample and ensure that the reset label was uniformly distributed. The corresponding relationship between the leaf number and label can be found in Table 3. The samples selected in this experiment correspond to four irrigation methods. Except for the first one, the other three were grown under different degrees of drought. Therefore, the distributions of the maximum and minimum values of each sample are all different. From the left figure image in Figure 2, we can find that the number of leaves mainly concentrated in 6,7 and 8. With the increase in the number of leaves, the corresponding number of samples gradually decrease. Because the number of leaves of a plant under suitable moisture increase steadily with time, whereas, a plant grows slowly in a drought state. The plant leaf number of the plant in drought is lower than that of the plant with suitable water content. Therefore, we needed to reset the label of the original sample and ensure that the reset label was uniformly distributed. The corresponding relationship between the leaf number and label can be found in Table 3. Table 3. Label range corresponding to the number of leaves.

Leaf Count Net
Our method refers to Google inception net V3 structure. The depth of the network can be maintained while the number of the parameters and the risk of over-fitting are effectively reduced. In our network, convolutional kernels of different sizes are used to extract multi-scale features. After the convolution operation, the feature maps are concatenated. However, there is a problem that must be considered, if all the feature maps are concatenated, the number of feature maps will be too large, which will increase the computational complexity. Therefore, we usually introduce 1*1 convolution operation to reduce the dimensionality of the feature maps. The number of output convolution kernels is less than the input feature maps, so as to reduce the dimensionality. Figure 3 shows the process of reducing the dimensionality of 256 feature maps to 128 feature maps by a 1 × 1 convolution kernel. Another operation to minimize parameters is to replace 3 × 3 with a two-layer convolution of 1 × 3 and 3 × 1, which can reduce the number of parameters by 33%.

Leaf Count Net
Our method refers to Google inception net V3 structure. The depth of the network can be maintained while the number of the parameters and the risk of over-fitting are effectively reduced. In our network, convolutional kernels of different sizes are used to extract multi-scale features. After the convolution operation, the feature maps are concatenated. However, there is a problem that must be considered, if all the feature maps are concatenated, the number of feature maps will be too large, which will increase the computational complexity. Therefore, we usually introduce 1*1 convolution operation to reduce the dimensionality of the feature maps. The number of output convolution kernels is less than the input feature maps, so as to reduce the dimensionality. Figure 3 shows the process of reducing the dimensionality of 256 feature maps to 128 feature maps by a 1 × 1 convolution kernel. Another operation to minimize parameters is to replace 3 × 3 with a two-layer convolution of 1 × 3 and 3 × 1, which can reduce the number of parameters by 33%.
When the training was finished, we took out feature maps in the middle of the network, and these feature maps had different scales. The green frame in Figure 4 represents the extracted feature maps, and their dimensions are (53 × 53 × 128), (25 × 25 × 288), (3 × 3 × 64). These feature maps will be used as multi-scale features to predict the final number of leaves. However, a secondary extraction of features is required before fitting, because all the feature maps have a large dimensionality, a property that will make the network very difficult to train. Therefore, we use FV to encode features and convert feature maps to vectors.  When the training was finished, we took out feature maps in the middle of the network, and these feature maps had different scales. The green frame in Figure 4 represents the extracted feature maps, and their dimensions are (53 × 53 × 128), (25 × 25 × 288), (3 × 3 × 64). These feature maps will be used as multi-scale features to predict the final number of leaves. However, a secondary extraction of features is required before fitting, because all the feature maps have a large dimensionality, a property that will make the network very difficult to train. Therefore, we use FV to encode features and convert feature maps to vectors.

Coding Multi-Scale Feature Maps by Using Fisher Vector(FV)
We extract multi-scale feature maps from middle layers instead of from a single layer. Because the feature maps are compressed when they pass through the pooling layer, some features may be lost in this process, and the missing features may include useful information for the final regression results. As it is well known, the value of one point in a feature map represents a receptive field in the original image. For this large-scale feature representation, there is a high probability of missing some detailed information. For a local region in one image, it is not specific enough to represent the whole region using the value of one point. Therefore, extracting feature maps from different layers of the network can reduce the feature loss caused by the pooling process, and can better describe the  When the training was finished, we took out feature maps in the middle of the network, and these feature maps had different scales. The green frame in Figure 4 represents the extracted feature maps, and their dimensions are (53 × 53 × 128), (25 × 25 × 288), (3 × 3 × 64). These feature maps will be used as multi-scale features to predict the final number of leaves. However, a secondary extraction of features is required before fitting, because all the feature maps have a large dimensionality, a property that will make the network v

Coding Multi-Scale Feature Maps by Using Fisher Vector(FV)
We extract multi-scale feature maps from middle layers instead of from a single layer. Because the feature maps are compressed when they pass through the pooling layer, some features may be lost in this process, and the missing features may include useful information for the final regression results. As it is well known, the value of one point in a feature map represents a receptive field in the original image. For this large-scale feature representation, there is a high probability of missing some detailed information. For a local region in one image, it is not specific enough to represent the whole region using the value of one point. Therefore, extracting feature maps from different layers of the network can reduce the feature loss caused by the pooling process, and can better describe the

Coding Multi-Scale Feature Maps by Using Fisher Vector(FV)
We extract multi-scale feature maps from middle layers instead of from a single layer. Because the feature maps are compressed when they pass through the pooling layer, some features may be lost in this process, and the missing features may include useful information for the final regression results. As it is well known, the value of one point in a feature map represents a receptive field in the original image. For this large-scale feature representation, there is a high probability of missing some detailed information. For a local region in one image, it is not specific enough to represent the whole region using the value of one point. Therefore, extracting feature maps from different layers of the network can reduce the feature loss caused by the pooling process, and can better describe the original image from different scales. These feature maps can be regarded as some local feature descriptors, similar to SIFT [27]. Figure 5 describes the FV encoding process. After coding all three scale feature maps, high-level features are obtained, and all of the vectors are fused to form the feature vectors that we finally use to predict. original image from different scales. These feature maps can be regarded as some local feature descriptors, similar to SIFT [27]. Figure 5 describes the FV encoding process. After coding all three scale feature maps, high-level features are obtained, and all of the vectors are fused to form the feature vectors that we finally use to predict. We were inspired by [28], in which the authors used the SIFT operator to extract the descriptor of the face image, and then the FV was used to code the descriptor. Finally, each face image was represented by a feature vector. In this paper, H, W, and D are used to describe height, width, and dimension, respectively. Therefore, for each feature map, the number of feature points is H × W, the feature dimension of each feature point is D. Then we can use The FV encoding is based on the Fisher kernel, which groups a dense set of local features into a high-dimensional descriptor (features are better distinguished in a high-dimensional space) representing the image-level features. The descriptor uses the gradient, based on a probability function, to calculate the log-likelihood of the local features. In general, this is performed by fitting a parametric generative model, e.g. the Gaussian mixture model(GMM), and then the derivatives of the log-likelihood of the model are encoded with respect to the model parameters [28,29]. FV not only considers the gradient with respect to the weights but also considers the derivatives with respect to the mean and standard deviation [30,31]. We can assume that each feature point is independent and identically distributed. Then we use the GMM to represent the distribution of features. λ is a parameter in the GMM, and  We were inspired by [28], in which the authors used the SIFT operator to extract the descriptor of the face image, and then the FV was used to code the descriptor. Finally, each face image was represented by a feature vector. In this paper, H, W, and D are used to describe height, width, and dimension, respectively. Therefore, for each feature map, the number of feature points is H × W, the feature dimension of each feature point is D. Then we can use The FV encoding is based on the Fisher kernel, which groups a dense set of local features into a high-dimensional descriptor (features are better distinguished in a high-dimensional space) representing the image-level features. The descriptor uses the gradient, based on a probability function, to calculate the log-likelihood of the local features. In general, this is performed by fitting a parametric generative model, e.g. the Gaussian mixture model(GMM), and then the derivatives of the log-likelihood of the model are encoded with respect to the model parameters [28,29]. FV not only considers the gradient with respect to the weights but also considers the derivatives with respect to the mean and standard deviation [30,31]. We can assume that each feature point is independent and identically distributed. Then we use the GMM to represent the distribution of features. λ is a parameter in the GMM, and λ = ω i , µ i , Σ i , i = 1, 2 . . . k, where ω i represents the probability that the feature points belong to the i-th Gaussian distribution, µ i is the mean of the feature points at the i-th Gaussian distribution, and Σ i represents the covariance between feature points; σ i is standard deviation and σ 2 i = diag(Σ i ). Equations (1)-(5) describe the specific solution process: In Equations (1) and (2), γ(i, k) represents the probability that the sample point x t belongs to the kth Gaussian model. Subsequently, the partial derivatives of the GMM parameters are calculated.
Then we obtain the gradient vectors U where the d in U x represents the dimension. In U x the dimensions of the three eigenvectors are k, k × D, and k × D respectively (k represents the number of Gaussian distribution), while ω i has a constraint i ω i =1, there will be a decrease by one free variable. Finally, the dimension of U x is (2D+1) × K−1. Then the gradient is normalized using the FV information matrix to get the Fisher feature vector. Equations (6)- (8) give the final representation of the Fisher eigenvectors. For the specific derivation process can refer to the literature [30].
For each set of feature maps, a (2D+1) × K−1 dimensional feature vector descriptor is obtained. We encode the feature maps for the three middle layers in our network. Finally, the random forest algorithm is used to fit the features to predict the number of maize leaves. The detailed algorithm flow chart is illustrated in Figure 6. ω has a constraint ∑ =1, there will be a decrease by one free variable. Finally, the dimension of is (2D+1) × K-1. Then the gradient is normalized using the FV information matrix to get the Fisher feature vector. Equations (6), (7) and (8) give the final representation of the Fisher eigenvectors. For the specific derivation process can refer to the literature [30].
For each set of feature maps, a (2D+1) × K-1 dimensional feature vector descriptor is obtained. We encode the feature maps for the three middle layers in our network. Finally, the random forest algorithm is used to fit the features to predict the number of maize leaves. The detailed algorithm flow chart is illustrated in Figure 6.  Figure 6. Presented all the steps in algorithm. Figure 6. Presented all the steps in algorithm.

Results and Discussion
In this section, we evaluate the proposed leaf counting approach on an image data set of maize plant. First, we discuss the experimental settings and parameters. Next, we present the details of Symmetry 2019, 11, 516 9 of 15 the training and testing samples. Then we show a comparison between the experimental results and existing methods to prove the effectiveness of our algorithm. In this paper, we assume that samples with similar numbers of leaves have similar leaf features. There was a high similarity of features learned by CNN for samples with the same label. When only the feature map of the last layer in the network was used, the feature discrimination of samples in the same label was small, and this is not conducive to the subsequent leaf number fitting (when setting labels for samples, samples with the same label may not have the same leaf number). Therefore, we extracted the feature maps from the middle layer in the same way, and the difference of these feature maps was greater than that of the last layer. Meanwhile, these features are a good complement to detailed features, which also explains why we used multi-scale features.

Implentation Details
(1). The framework of Python+Tensorflow have been used to build the network. Some training parameters are shown in Table 4. (2). Our CNN-net trained and tested under Windows 10 64-operation system on Intel Core i7-8700 at 3.2GHZ with 32-GB RAM. The GPU is GTX 1080Ti.
(3). Finally, we use random forests to classify the number of maize leaves, the number of trees in the forest was set 70 and the max features was set 44(sqrt(features)).

Image Data
In 2.4, we present the method to get image data samples. The samples of the four water levels were 701,644,851 and 649 in number (because of the problem of shooting angle and illumination, we removed some poor-quality samples). The total number of samples was 2845, of which 80% were training samples and 20% are testing samples. The numbers of final training samples and testing samples were 2276 and 569.

Experimental Results and Comparison with Other Methods
The training accuracy and training loss are shown in Figure 7. With an increase in training epochs, the model gradually converged. As we can see from Figure 7, with an increase in training epochs, the training loss converged quickly and training accuracy was close to 100% in 200 epochs. This indicates that the features learned by CNN are suitable for classification. It is reasonable to assign the same label for maize samples with similar leaf numbers. By training the classification of maize samples, the model can roughly determine the range of leaf numbers of a single maize sample. The label assigned to the sample represents the range of the leaf numbers. Therefore, the CNN model extracts feature by learning to predict the range of plant leaf numbers. As we know, a high accuracy rate is very helpful for further encoding feature maps. The correct rate directly reflects the feature extraction ability of our network. Test samples were not reserved because our aim was not to classify them. Finally, to avoid over-fitting, the weight model obtained by the 200th iteration was saved to extract the feature from the middle layer.
In FV coding, there exists an important parameter, parameter k (the number of Gaussian distribution), which should be assigned during the whole process. To select a reasonable parameter k, four water level samples from different growth stage were selected. The number of this part is 650. The result can be seen in Figure 8. We can see k = 77 has the best performance.  In FV coding, there exists an important parameter, parameter k (the number of Gaussian distribution), which should be assigned during the whole process. To select a reasonable parameter k, four water level samples from different growth stage were selected. The number of this part is 650. The result can be seen in Figure 8. We can see k=77 has the best performance.    In FV coding, there exists an important parameter, parameter k (the number of Gaussian distribution), which should be assigned during the whole process. To select a reasonable parameter k, four water level samples from different growth stage were selected. The number of this part is 650. The result can be seen in Figure 8. We can see k=77 has the best performance.  The results of the comparison are shown in Table 5, "CountDiff" refers to the mean and standard deviation of the difference in count averaged over images. "AbsCountDiff" is the absolute of "CountDiff." "MSE" is the abbreviation for mean-square error [25]. Table 5 compares the results of our algorithm and the method of directly fitting with the deep neural network. Method (1) [32] and method (2) [33] shows that the method of directly fitting the leaf number with depth network has a high mean-square error. In the experiment, we found that the result of these method is close to the mean of the training samples, especially for samples with a large number of leaves. It can be seen from (3) and (4) that, the deep neural network is more powerful for sample feature extraction than that traditional local feature extraction algorithms, such as SIFT. In method (4), there is a large gap between the training result and test result, the over-fitting is serious. These imply that extracting multi-scale features from CNN combined with the traditional machine learning is more advantageous than the single CNN method for estimating the number of maize leaves. From Figure 9, we can see that most of the prediction errors are within one leaf. Comparing the (a) and (b), the range of error distribution of the training set and test set was consistent, and there was no large fluctuation in the distribution, which proves that our model is stable and has practical application value.
high mean-square error. In the experiment, we found that the result of these method is close to the mean of the training samples, especially for samples with a large number of leaves. It can be seen from (3) and (4) that, the deep neural network is more powerful for sample feature extraction than that traditional local feature extraction algorithms, such as SIFT. In method (4), there is a large gap between the training result and test result, the over-fitting is serious. These imply that extracting multi-scale features from CNN combined with the traditional machine learning is more advantageous than the single CNN method for estimating the number of maize leaves. From Figure 9, we can see that most of the prediction errors are within one leaf. Comparing the (a) and (b), the range of error distribution of the training set and test set was consistent, and there was no large fluctuation in the distribution, which proves that our model is stable and has practical application value. To verify the robustness of the proposed approach, a cross-validation experiment was designed. One type sample was reserved as the validation set and the other three samples were the training set. Therefore, each of the four water level samples was treated as a validation set. Then each group of experiments was repeated five To verify the robustness of the proposed approach, a cross-validation experiment was designed. One type sample was reserved as the validation set and the other three samples were the training set. Therefore, each of the four water level samples was treated as a validation set. Then each group of experiments was repeated five times. The error bar of MSE was shown in Figure 10.
As can be seen from Figure 10, our model performed well for different water level samples, and performance was worst in the first validation compared to other times. In the first validation, samples 1, 2 and 3 were the training set and sample 4 was the validation set. Sample 4 was the most severely affected by drought stress and its leaves were fewer than those of other samples in the same period. This indicates that the feature vectors of sample 4 were quite different from those of the other three type samples.

Misclassified Image Analysis
In this study, some incorrect regress samples have been shown in Figure 11a-f are the samples of incorrect leaf count at different maize growth stages. The counting error of (a)-(d) are within 2 leaves. As can be seen from the samples, influenced by illumination there are some leaves in (a)-(d) are hard to distinguish with white pots. These leaves are located in the lower part of the maize. To (c) and (d), some leaves are withered because of the water shortage. These factors increase the difficulty of leaf counting and lead to counting errors. The counting error of (e) and (f) are more than 2 leaves. These samples have more leaves than (a)-(d) and the leaves are shaded from each other, therefore the counting error increases dramatically. times. The error bar of MSE was shown in Figure 10. As can be seen from Figure 10, our model performed well for different water level samples, and performance was worst in the first validation compared to other times. In the first validation, samples 1, 2 and 3 were the training set and sample 4 was the validation set. Sample 4 was the most severely affected by drought stress and its leaves were fewer than those of other samples in the same period. This indicates that the feature vectors of sample 4 were quite different from those of the other three type samples.

Misclassified Image Analysis
In this study, some incorrect regress samples have been shown in Figure 11 (a)~(f) are the samples of incorrect leaf count at different maize growth stages. The counting error of (a)~(d) are within 2 leaves. As can be seen from the samples, influenced by illumination there are some leaves in (a)~(d) are hard to distinguish with white pots. These leaves are located in the lower part of the maize. To (c) and (d), some leaves are withered because of the water shortage. These factors increase the difficulty of leaf counting and lead to counting errors. The counting error of (e) and (f) are more than 2 leaves. These samples have more leaves than (a)~(d) and the leaves are shaded from each other, therefore the counting error increases dramatically.  As can be seen from Figure 10, our model performed well for different water level samples, and performance was worst in the first validation compared to other times. In the first validation, samples 1, 2 and 3 were the training set and sample 4 was the validation set. Sample 4 was the most severely affected by drought stress and its leaves were fewer than those of other samples in the same period. This indicates that the feature vectors of sample 4 were quite different from those of the other three type samples.

Misclassified Image Analysis
In this study, some incorrect regress samples have been shown in Figure 11

The Relationship Between Maize Leaf Number and Water Content
The number of leaves can reflect the water content of a maize plant. Figure 12 shows a line chart of the changes in the number of leaves over time for the four samples. The observation period was 32 days. The blue line represents the sample with suitable moisture. We can see that the number of leaves increase stepwise with time, and there is a clear demarcation line between the other three water-deficient samples. The green line represents moderate drought. Although it overlaps with the other two water-deficient samples, the total leaf number also rises stepwise; the rising rate is much lower than that of the suitable moisture sample and higher than that of the other two water-deficient samples. The discrimination between samples 3 and 4 was small, the number of leaves in sample 4 first increased and then decreased with time. By observing the actual image of the sample, it was found that the leaves in Sample 4 dropped seriously due to dry up. Sample 3 also exhibited the same condition. Our algorithm does not detect the drooping leaves, because their color is close to that of the soil. Therefore, the distributions of samples 3 and 4 are similar in Figure 12.  The number of leaves can reflect the water content of a maize plant. Figure 12 shows a line chart of the changes in the number of leaves over time for the four samples. The observation period was 32 days. The blue line represents the sample with suitable moisture. We can see that the number of leaves increase stepwise with time, and there is a clear demarcation line between the other three water-deficient samples. The green line represents moderate drought. Although it overlaps with the other two water-deficient samples, the total leaf number also rises stepwise; the rising rate is much lower than that of the suitable moisture sample and higher than that of the other two water-deficient

Conclusions
In this paper, a deep learning approach combined with the traditional machine learning is been proposed. The CNN is responsible for extracting multi-scale features from different layers. Multi-scale features extraction can compensate for the loss of features caused by pooling layers. A FV maps the multi-scale features to a higher dimensional space. This can enhance the expression ability of the CNN and make the model perform well. Our method does not require segmentation, and new noise regions, which are associated with the error segmentation algorithm, are not introduced. The experimental results demonstrate that this method effectively counts maize leaves. However, for the samples with abnormal illumination and leaf occlusion, there are still large errors in counting. This indicates that some work still needs to be done.
In future extensions of this paper, we plan to enrich our data set by collecting more images of new maize species. Moreover, the current related studies are mainly conducted in a laboratory environment; future works should focus on field environment. In the preprocessing, the sample labels for different leaf numbers need to be manually marked to ensure the partitioned samples have similar morphological features and a uniform distribution. However, manual operation is inconvenient, and the automatic partitioning method needs be developed. Furthermore, to avoid redundancy of information, only three-layer feature maps were extracted according to the CNN structure. We cannot guarantee that these three-level feature maps are the best combination. In subsequent works, we will continue to study how to select the optimal combination.