Vision-Based Classification of Mosquito Species: Comparison of Conventional and Deep Learning Methods

This study aims to propose a vision-based method to classify mosquito species. To investigate the efficiency of the method, we compared two different classification methods: The handcraft feature-based conventional method and the convolutional neural network-based deep learning method. For the conventional method, 12 types of features were adopted for handcraft feature extraction, while a support vector machine method was adopted for classification. For the deep learning method, three types of architectures were adopted for classification. We built a mosquito image dataset, which included 14,400 images with three types of mosquito species. The dataset comprised 12,000 images for training, 1500 images for testing, and 900 images for validating. Experimental results revealed that the accuracy of the conventional method using the scale-invariant feature transform algorithm was 82.4% at maximum, whereas the accuracy of the deep learning method was 95.5% in a residual network using data augmentation. From the experimental results, deep learning can be considered to be effective for classifying the mosquito species of the proposed dataset. Furthermore, data augmentation improves the accuracy of mosquito species’ classification.


Introduction
Mosquitoes are the most important disease vector responsible for causing major deaths among children and adult population. A study conducted in 2015 revealed that more than 430,000 people die of malaria annually [1]. The expert in tropical medicine affirms that blood-sucking mosquitoes comprise Aedes, Anopheles, and Culex species. The activities of mosquitoes differ in terms of their time zones, behavior habits, and mediated infections. The classification of species is important because prevention and extermination of infectious diseases are different for each species.
The identification of the species of mosquitoes is difficult for a layman. However, if vision-based classification of mosquito species from pictures captured by a smartphone or mobile device is available, the classification results can be utilized to educate and enlighten the population with inadequate knowledge of mosquitoes about their mediated infectious diseases, especially in Africa, Southeast Asia, and Central and South America, where several infectious diseases are epidemic.
So far, some vision-based studies have been conducted to identify the bug species. Fuchida et al. [2] presented the design and experimental validation of an automated vision-based mosquito classification module. The module can identify mosquitoes from other bugs, such as bees and flies, by extracting the morphological features, followed by a support vector machine (SVM)-based classification. Using multiple classification strategies, experimental results that involve the classification between mosquitoes and a predefined set of other bugs demonstrated the efficacy and validity of the proposed approach,

Dataset Construction
This section describes the preconditions and the dataset construction. As discussed in the introduction, Aedes, Anopheles, and Culex are the human blood-sucking mosquitoes. The most infectious species among the Aedes genus are Aedes albopictus and Aedes aegypti. The most infectious species among the Anopheles genus is Anopheles stephensi. The most infectious species among the Culex genus is Culex pipiens pallens.
The time zone, behavior habits, and infectious diseases of Aedes albopictus and Aedes aegypti are similar. Experts may find challenges in determining the difference in appearance even with the aid of a microscope. Therefore, we targeted Aedes albopictus out of the two species because of its large global population.
We utilized the images obtained by using a single-lens reflex camera to learn, while images obtained using a smartphone were used for testing and further validation. In addition, we assumed that the illuminance varied from 380 to 1340 lx and the background was white herein. Table 1 presents the details of the shooting equipment. Figure 1 depicts an example of the captured images. The region of the mosquito was manually clipped from the captured images. Figures 2 and 3 demonstrate the clipped mosquitoes from the images captured using a single-lens reflex camera and a smartphone, respectively.
Each clipped image was rotated by 90 • , and all the rotated images were added to the dataset; the number of images had a four-fold increase. Consequently, we obtained the images of each mosquito species, consisting of 4000 images for training, 500 images for testing, and 300 images for validating. In general, the dataset with three types of mosquitoes comprised 12,000 images for training, 1500 images for testing, and 900 images for validating.              Figure 4 illustrates the system flow. We constructed a dataset of mosquito species and conducted a classification using handcraft features or deep learning. The classifications using handcraft features and classification using deep learning are hereinafter referred to as conventional classification and deep classification, respectively.

Classification Method
For the frequency features, we utilized GIST [18], which is characterized by imitating human perception. GIST divides the image into small 4 × 4 blocks and extracts the structure of the entire image using filters with different frequencies and scales.
Classification of the mosquito species was conducted using SVM and the aforementioned features. SVM considered a soft margin, and multiclass classification was performed using a oneversus-rest method.

Conventional Classification
In this section, we present the details of conventional classification. Twelve types of handcraft features, including shape, color, texture, and frequency, were adopted for handcraft feature extraction and the SVM method was adopted for classification.
For the shape features, we extracted speeded-up robust features (SURF) [8], scale-invariant feature transform (SIFT) [9], dense SIFT [10], histogram of oriented gradients (HOG) [11], co-occurrence HOG (CoHOG) [12], extended CoHOG (ECoHOG) [13], and local binary pattern (LBP) [14]. SURF and SIFT comprised of two algorithms of feature point detection and feature description. Feature point detection is robust with respect to the changing scale and noise, and feature description is robust with respect to illumination change and rotation. Dense SIFT uses a grid pattern for feature points and denotes the same feature description as that denoted by SIFT. An example of the detected key point is depicted in Figure 5. The features extracted from SURF, SIFT, and dense SIFT were respectively clustered using k-means and vector quantization, conducted using a bag-of-features [15]. HOG is a histogram of the luminance gradient, and CoHOG is a histogram of the appearance frequency of a combination of the luminance gradients. ECoHOG is a histogram of the intensity of a combination of the luminance gradients. The LBP transforms into an LBP image that is robust to illumination change and exhibits its luminance histogram.

Deep Classification
In this section, we give the details of the CNN-based deep classification. Three types of architectures, including AlexNet [19], Visual Geometry Group Network (VGGNet) [20], and Deep residual network (ResNet) [21], were adopted for classification.
These architectures are often used for object recognition. AlexNet was used as it has been proposed. Note that VGGNet uses 16 layers of network and that ResNet uses 18 layers of network structure.
The CNN learning method was conducted using a stochastic gradient descent method, and the learning rate schedule was obtained using simulated annealing. The ImageNet pre-trained weights [22] were used as initial weights to accelerate the convergence of learning.
As explained in Section 2, we captured mosquito images with a simple white background. Images with the little-noised simple background can lead to overfitting images for training and a As the color features, we calculated color histograms in each colorimetric systems of red, green, blue (RGB), hue, saturation, value (HSV), and L * a * b * [16]. The RGB color system uses a simple concatenation of R, G, and B channel histograms. HSV is a color feature that uses the HSV color specification system; when the color saturation is low in the HSV color system, the hue value becomes unstable. Therefore, the histogram of each channel excludes the value when the saturation value is 10% or less. L * a * b * is a color feature that uses the L * a * b * color system and a three-dimensional histogram of L *, a *, b *.
For the texture features, we calculated a co-occurrence matrix feature (GLCM: gray level co-occurrence matrix). We used contrast, dissimilarity, homogeneity, energy, correlation, and angular secondary moment as the statistical values of GLCM [17].
For the frequency features, we utilized GIST [18], which is characterized by imitating human perception. GIST divides the image into small 4 × 4 blocks and extracts the structure of the entire image using filters with different frequencies and scales.
Classification of the mosquito species was conducted using SVM and the aforementioned features. SVM considered a soft margin, and multiclass classification was performed using a one-versus-rest method.

Deep Classification
In this section, we give the details of the CNN-based deep classification. Three types of architectures, including AlexNet [19], Visual Geometry Group Network (VGGNet) [20], and Deep residual network (ResNet) [21], were adopted for classification.
These architectures are often used for object recognition. AlexNet was used as it has been proposed. Note that VGGNet uses 16 layers of network and that ResNet uses 18 layers of network structure.
The CNN learning method was conducted using a stochastic gradient descent method, and the learning rate schedule was obtained using simulated annealing. The ImageNet pre-trained weights [22] were used as initial weights to accelerate the convergence of learning.
As explained in Section 2, we captured mosquito images with a simple white background. Images with the little-noised simple background can lead to overfitting images for training and a declined generality.
To suppress overfitting, the training images were transformed by the following data augmentation steps.

•
Rotation: Rotate the image by angle θ. Angle θ varies randomly in the range from −θ r to θ r . Here, the limit value θ r is set to be 45 • .

•
Brightness change: Randomly change the brightness of the image. Change rate α is chosen in the range from −α r to α r . Here, the limit value α r is set to be 40%.

•
Contrast change: Randomly change the contrast of the image. Change rate α is chosen in the range from −α r to α r . Here, the limit value α r is set to be 40%. • Saturation change: Randomly change the saturation of the image. Change rate α is chosen in the range from −α r to α r . Here, the limit value α r is set to be 40%.

•
Hue change: Randomly change the hue of the image. Change rate α is chosen in the range from −α r to α r . Here, the limit value α r is set to be 25%.
The limit values, θ r and α r , were determined a priori by reference to the default ones used in the iNaturalist Competition 2018 Training Code. It is noteworthy that the number of training images was constant since the transformed image was replaced by the original one.
An example of the data augmentation is depicted in Figure 6. The variation of training images increases in terms of color, brightness, and rotation fluctuation based on the aforementioned processing. Appl. Sci. 2019, 9,

Experimental Conditions
The mosquitoes were classified using the dataset presented in Section 2. We used Aedes albopictus, Anopheles stefensi, and Culex pipiens pallens as species classification targets. The images captured using a single-lens reflex camera were used for training, and the images captured using a smartphone were used for the test. We used the dataset with three types of mosquitoes consisting of 12,000 images for training and 1500 images for testing. In the case of the deep classification, training and testing phases were performed five times, and the average classification accuracy was calculated. Tables 2 and 3 show the aggregated accuracies over the three considered species of conventional and deep classifications, respectively.

Experimental Results
In conventional classification, parameters for features were set to default values adopted in OpenCV libraries on Python. The hyperparameters of SVM were determined based on preliminary experiments. The error-term penalty parameter, C, was set to one and the kernel function was defined as a linear kernel. The SVM classifier with these parameters had the highest accuracy in the grid search.
The classification accuracies with SIFT and SURF are high. By applying the fact that the classification with dense SIFT exhibits low accuracy, it turns out that an algorithm that detects feature points and describes local features is effective.
The deep classification has a lower classification accuracy than the conventional classification unless it is by data augmentation. By applying data augmentation, the deep classification has higher accuracy than the conventional classification. Therefore, data augmentation is effective. ResNet has the highest discrimination accuracy of 95.5%, indicating that deep classification is effective for mosquito species classification. Figure 7 shows the confusion matrices. Figure 7a shows one from the conventional classification with SIFT, which achieved the highest accuracy in Table 2. Each species is misclassified from each other in comparison with the deep classification result.

Experimental Conditions
The mosquitoes were classified using the dataset presented in Section 2. We used Aedes albopictus, Anopheles stefensi, and Culex pipiens pallens as species classification targets. The images captured using a single-lens reflex camera were used for training, and the images captured using a smartphone were used for the test. We used the dataset with three types of mosquitoes consisting of 12,000 images for training and 1500 images for testing. In the case of the deep classification, training and testing phases were performed five times, and the average classification accuracy was calculated. Tables 2 and 3 show the aggregated accuracies over the three considered species of conventional and deep classifications, respectively. Table 2. Result of conventional classification. Speeded-up robust features (SURF); histogram of oriented gradient (HOG); co-occurrence HOG (CoHOG); extended CoHOG (ECoHOG); local binary pattern (LBP); red, green, blue (RGB); hue, saturation, value (HSV); gray level co-occurrence matrix (GLCM); GIST. L*a*b*.

Features
Accuracy  In conventional classification, parameters for features were set to default values adopted in OpenCV libraries on Python. The hyperparameters of SVM were determined based on preliminary experiments. The error-term penalty parameter, C, was set to one and the kernel function was defined as a linear kernel. The SVM classifier with these parameters had the highest accuracy in the grid search.
The classification accuracies with SIFT and SURF are high. By applying the fact that the classification with dense SIFT exhibits low accuracy, it turns out that an algorithm that detects feature points and describes local features is effective.
The deep classification has a lower classification accuracy than the conventional classification unless it is by data augmentation. By applying data augmentation, the deep classification has higher accuracy than the conventional classification. Therefore, data augmentation is effective. ResNet has the highest discrimination accuracy of 95.5%, indicating that deep classification is effective for mosquito species classification. Figure 7 shows the confusion matrices. Figure 7a shows one from the conventional classification with SIFT, which achieved the highest accuracy in Table 2. Each species is misclassified from each other in comparison with the deep classification result. Figure 8 shows an image visualizing the target spot in gradient-weighted class activation mapping (Grad-CAM), which is a visualization method [23] for deep classification. The heatmap covers the body of the mosquito.
Aedes albopictus and Culex pipiens pallens have a similar shape but their body colors are different. Nevertheless, Aedes albopictus and Culex pipiens pallens can be classified based on similar body parts from the visualization result. The images captured using a smartphone do not indicate the color differences when the shooting environment is dark because a smartphone camera is unable to offer a quality shot when compared with the single-lens reflex cameras. The above facts lead to misclassification. Table 2. Result of conventional classification. Speeded-up robust features (SURF); histogram of oriented gradient (HOG); co-occurrence HOG (CoHOG); extended CoHOG (ECoHOG); local binary pattern (LBP); red, green, blue (RGB); hue, saturation, value (HSV); gray level co-occurrence matrix (GLCM); GIST. L*a*b*

Features
Accuracy   Figure 7b shows one from the deep classification with ResNet, which achieved the highest accuracy in Table 3. This matrix was calculated from the result of one training and testing phase among five ones. Aedes albopictus and Culex pipiens pallens are relatively misclassified as each other, whereas Anopheles stefensi is relatively classified appropriately. Figure 8 shows an image visualizing the target spot in gradient-weighted class activation mapping (Grad-CAM), which is a visualization method [23] for deep classification. The heatmap covers the body of the mosquito.

Effective Data Augmentation Investigation Experiment
We showed that the deep classification achieves a higher classification accuracy by data augmentation in Section 4. In this section, as further validation, we investigated which transformation step of the data augmentation is effective for deep classification.

Experimental Conditions
As explained in Section 3.2, we applied transformation steps to the training images for data augmentation. However, it is not clear which step contributes the most to the classification. In addition, the range values, and , were determined a priori. By the data augmentation described in Section 3.2, we verified the step which contributes the most to the classification accuracy.
In this experiment, we chose one transformation step from the steps explained in Section 3.2 and applied the chosen transformation separately to the training images.  Aedes albopictus and Culex pipiens pallens have a similar shape but their body colors are different. Nevertheless, Aedes albopictus and Culex pipiens pallens can be classified based on similar body parts from the visualization result. The images captured using a smartphone do not indicate the color differences when the shooting environment is dark because a smartphone camera is unable to offer a quality shot when compared with the single-lens reflex cameras. The above facts lead to misclassification.

Effective Data Augmentation Investigation Experiment
We showed that the deep classification achieves a higher classification accuracy by data augmentation in Section 4. In this section, as further validation, we investigated which transformation step of the data augmentation is effective for deep classification.

Experimental Conditions
As explained in Section 3.2, we applied transformation steps to the training images for data augmentation. However, it is not clear which step contributes the most to the classification. In addition, the range values, θ r and α r , were determined a priori.
By the data augmentation described in Section 3.2, we verified the step which contributes the most to the classification accuracy.
In this experiment, we chose one transformation step from the steps explained in Section 3.2 and applied the chosen transformation separately to the training images.

•
Rotation: rotate the image by angle θ. Angle θ varies randomly in the range from −θ r to θ r . • Brightness change: Randomly change the brightness of the image. Change rate α is chosen in the range from −α r to α r .

•
Contrast change: Randomly change the contrast of the image. Change rate α is chosen in the range from −α r to α r . • Saturation change: Randomly change the saturation of the image. Change rate α is chosen in the range from −α r to α r .
• Hue change: Randomly change the hue of the image. Change rate α is chosen in the range from −α r to α r .
We utilized the dataset with three types of mosquitoes, comprising 12,000 images for training and 900 images for validation. Training and validation were performed five times by considering each condition, and the average of classification accuracy was calculated. Generally, 270 (54 conditions by five trials) trials of training and validation were performed. Regarding the brightness, contrast, saturation, or hue change, the limit value varied at five intervals from 0% to 50%. Therefore, the 11 by 4 conditions were prepared. Conditions A to K denote the ranges of change rate ; [0 (no change, = 0)], [−5 to 5 ( = 5)], …, [−50 to 50 ( = 50)], respectively.

Experimental Results
We utilized the dataset with three types of mosquitoes, comprising 12,000 images for training and 900 images for validation. Training and validation were performed five times by considering each condition, and the average of classification accuracy was calculated. Generally, 270 (54 conditions by five trials) trials of training and validation were performed. Figure 9 shows the experimental results. The error bar shows the standard deviation. Regarding rotation, the classification accuracy is almost the same at any condition. As mentioned in Section 2, the clipped training image was rotated by 90°, and all the rotated images were added to the dataset. We observed that the rotation variation seems to be retained in advance; hence, the rotation has small effects on the accuracy in this experiment that we conducted.

Experimental Results
Regarding brightness, saturation, or hue changes, the accuracy improves up to the condition E ( = 25) or G ( = 30), and there are little changes afterward. Regarding contrast change, the accuracy improves with the increase in the change rate. In condition K( = 50) of contrast, the highest classification accuracy achieved is 89.1%. From the above, the fluctuation of contrast offers the most improvement in terms of the accuracy of mosquito species classification.

Conclusion
This study compared the handcraft feature-based conventional classification method and CNNbased deep classification method. We constructed a dataset for classifying mosquito species. For conventional classification, shape, color, texture, and frequency were adopted for handcraft feature Regarding rotation, the classification accuracy is almost the same at any condition. As mentioned in Section 2, the clipped training image was rotated by 90 • , and all the rotated images were added to the dataset. We observed that the rotation variation seems to be retained in advance; hence, the rotation has small effects on the accuracy in this experiment that we conducted.
Regarding brightness, saturation, or hue changes, the accuracy improves up to the condition E (α r = 25) or G (α r = 30), and there are little changes afterward.
Regarding contrast change, the accuracy improves with the increase in the change rate. In condition K(α r = 50) of contrast, the highest classification accuracy achieved is 89.1%.
From the above, the fluctuation of contrast offers the most improvement in terms of the accuracy of mosquito species classification.

Conclusions
This study compared the handcraft feature-based conventional classification method and CNN-based deep classification method. We constructed a dataset for classifying mosquito species. For conventional classification, shape, color, texture, and frequency were adopted for handcraft feature extraction, and the SVM method was adopted for classification. For deep classification, three types of architectures were adopted and compared.
The deep classification had a lower classification accuracy than the conventional classification unless by data augmentation. However, by data augmentation, deep classification had a higher accuracy than the conventional classification. ResNet achieved the highest discrimination accuracy of 95.5%, indicating that deep classification is effective for mosquito species classification. Furthermore, we verified that data augmentation with the fluctuation of contrast contributes the most improvement.