Method for Training Convolutional Neural Networks for In Situ Plankton Image Recognition and Classification Based on the Mechanisms of the Human Eye

In this study, we propose a method for training convolutional neural networks to make them identify and classify images with higher classification accuracy. By combining the Cartesian and polar coordinate systems when describing the images, the method of recognition and classification for plankton images is discussed. The optimized classification and recognition networks are constructed. They are available for in situ plankton images, exploiting the advantages of both coordinate systems in the network training process. Fusing the two types of vectors and using them as the input for conventional machine learning models for classification, support vector machines (SVMs) are selected as the classifiers to combine these two features of vectors, coming from different image coordinate descriptions. The accuracy of the proposed model was markedly higher than those of the initial classical convolutional neural networks when using the in situ plankton image data, with the increases in classification accuracy and recall rate being 5.3% and 5.1% respectively. In addition, the proposed training method can improve the classification performance considerably when used on the public CIFAR-10 dataset.


Introduction
Plankton, the tiny oceanic organisms in the marine realm, play a critical role in marine research and they are highly influenced by their environmental conditions [1,2]. Plankton can be observed as recorded images that are captured by underwater imaging systems [3]. As far as we know, automatic and accurate identification of these tiny organisms is essential for real-time monitoring of the marine ecology as well as for advanced assessment of water quality and the marine environment [4].
To perform ecological monitoring using an imaging system, an image acquisition technique must be applied. However, owing to the presence of organic matter and suspended particles, underwater visibility is highly limited. The ideal visibility, that is, the visibility in the case of clear seawater, is typically approximately 20 m. However, the visibility under the condition of in situ imaging, may be only a few meters in turbid seawater [5]. It is limited because of the attenuation of light as it propagates through the seawater [6]. Light attenuation blurs the background of the captured image, and the living organisms and suspended matter found in complex underwater environments may To address this problem, we propose a new method for image classification and network training, combining the translational and rotational features, paying attention to the in situ plankton image. Polar coordinate representation based on the mechanisms of the human eye is used to describe the rotational features of the target, and it is used for the transformation into translational ones for the input of the neural network. During the learning process, the rotational features of the target are learned indirectly, and they are expected to improve the generalization capability and classification accuracy of the network. In the proposed method, the features of the target before and after rotation are extracted and combined to improve the classification accuracy and recall rate. Using this method, we achieve satisfactory results for the Bering Sea plankton image dataset (acquired and constructed in the laboratory [23]) and the CIFAR10 dataset [24].
The rest of this paper is structured as follows: Section 2 discusses the proposed methods, including those used for problem analysis, model building, and finding the solution; Section 3 describes the proposed training method which combines a neural network with the polar mechanisms similar to what are used in the human eye, as well as the results of the tests conducted to evaluate the performance of the proposed model; and Section 4 summarizes the primary conclusions of the study and the direction that future work on the topic should take.

Existing Method
In Euclidean geometry, translation is a geometric transformation that moves every point on an image or in a given space by the same distance and in the same direction. For example, in an image classification task, the target should remain the same (i.e., have the same tag) irrespective of where it moves to in the image. However, it is difficult to remain translationally invariant when handling image processing and classification [22]. Convolutional neural networks can partially solve the problem of translation invariance by introducing local connections and weight sharing [22]. The convolution process is defined as feature detection at a specific location [14]. This implies that, regardless of where the target appears in the image, the system should detect the same features and output the same response. If the target is moved to a different position in the image, the convolution kernel will not detect the features of the target until the kernel moves to a new target location [22], and the convolution kernel is invariant during the moving process. This function of the kernel To address this problem, we propose a new method for image classification and network training, combining the translational and rotational features, paying attention to the in situ plankton image. Polar coordinate representation based on the mechanisms of the human eye is used to describe the rotational features of the target, and it is used for the transformation into translational ones for the input of the neural network. During the learning process, the rotational features of the target are learned indirectly, and they are expected to improve the generalization capability and classification accuracy of the network. In the proposed method, the features of the target before and after rotation are extracted and combined to improve the classification accuracy and recall rate. Using this method, we achieve satisfactory results for the Bering Sea plankton image dataset (acquired and constructed in the laboratory [23]) and the CIFAR10 dataset [24].
The rest of this paper is structured as follows: Section 2 discusses the proposed methods, including those used for problem analysis, model building, and finding the solution; Section 3 describes the proposed training method which combines a neural network with the polar mechanisms similar to what are used in the human eye, as well as the results of the tests conducted to evaluate the performance of the proposed model; and Section 4 summarizes the primary conclusions of the study and the direction that future work on the topic should take.

Existing Method
In Euclidean geometry, translation is a geometric transformation that moves every point on an image or in a given space by the same distance and in the same direction. For example, in an image classification task, the target should remain the same (i.e., have the same tag) irrespective of where it moves to in the image. However, it is difficult to remain translationally invariant when handling image processing and classification [22]. Convolutional neural networks can partially solve the problem of translation invariance by introducing local connections and weight sharing [22]. The convolution process is defined as feature detection at a specific location [14]. This implies that, regardless of where the target appears in the image, the system should detect the same features and output the same response. If the target is moved to a different position in the image, the convolution kernel will not detect the features of the target until the kernel moves to a new target location [22], and the convolution kernel is invariant during the moving process. This function of the kernel together with that of the largest pooling layer ensures the translational invariance of the target in the convolutional neural network, as demonstrated in Figure 2.  However, the previous research has showed that the convolution process is susceptible to the rotation of the image [22]. In this manner, the neural network will be made to partially learn features when considering the rotated images. Thus, it is necessary to use and accumulate a large amount of target data to perform rotation and translation enhancements on these sample data, and use more datasets for network training [17,25]. Furthermore, this requirement is often difficult to satisfy when working with a small scale of the dataset (e.g., the in situ images) similar to this study in which in situ-obtained marine plankton images are used. This is because the images must be acquired in the field, and extensive testing in the ocean will be required to obtain a sufficiently large dataset. Hence, it is essential to improve the classification accuracy, using smaller datasets.

The Proposed Method
To solve the problem of the rotational invariance of image features, we sought to mimic the imaging mechanisms of the human eye. Figure 3 shows the structure of an eyeball [26]. Studies have shown that external light first enters the eyeball through the cornea at the pupil. It then passes through the lens and vitreous body to reach the retina [27]. In the retina, the light signal is converted into an electrical one, and it is subjected to initial processing [27,28]. The process of recognizing, perceiving, and understanding the external signals are completed in the visual cortex. The retinal imaging mechanism of the human eye is an uniformed sampling process which can be best described by transformation in a logarithmic polar coordinate [29]. The logarithmic polar coordinate system can transform an image in a Cartesian coordinate system into one in a logarithmic polar coordinate system based on a set of transformation laws [30]. However, the previous research has showed that the convolution process is susceptible to the rotation of the image [22]. In this manner, the neural network will be made to partially learn features when considering the rotated images. Thus, it is necessary to use and accumulate a large amount of target data to perform rotation and translation enhancements on these sample data, and use more datasets for network training [17,25]. Furthermore, this requirement is often difficult to satisfy when working with a small scale of the dataset (e.g., the in situ images) similar to this study in which in situ-obtained marine plankton images are used. This is because the images must be acquired in the field, and extensive testing in the ocean will be required to obtain a sufficiently large dataset. Hence, it is essential to improve the classification accuracy, using smaller datasets.

The Proposed Method
To solve the problem of the rotational invariance of image features, we sought to mimic the imaging mechanisms of the human eye. Figure 3 shows the structure of an eyeball [26]. Studies have shown that external light first enters the eyeball through the cornea at the pupil. It then passes through the lens and vitreous body to reach the retina [27]. In the retina, the light signal is converted into an electrical one, and it is subjected to initial processing [27,28]. The process of recognizing, perceiving, and understanding the external signals are completed in the visual cortex. The retinal imaging mechanism of the human eye is an uniformed sampling process which can be best described by transformation in a logarithmic polar coordinate [29]. The logarithmic polar coordinate system can transform an image in a Cartesian coordinate system into one in a logarithmic polar coordinate system based on a set of transformation laws [30].
Considering the logarithmic polar coordinate point (ρ, θ), where a represents the base number (typically 10 or e), (x, y) represents the point of Cartesian coordinate, ρ denotes the logarithmic radial distance from the center point (x c , y c ), and θ denotes an angle. However, the log polar coordinate systems tend to display mainly the central section of an image. If point (x, y) is far from the center point (x c , y c ), ∆ρ, the increment of ρ, varies very slowly and nonlinearly. Given that we wished to obtain rotationally invariant images that retained the image features, we opted to describe the images, using a polar coordinate system: Sensors 2020, 20, 2592 where ρ denotes radial distance from the center (x c , y c ). Assuming ρ and θ lie along the vertical and horizontal axes respectively, we obtain the coordinate system shown in Figure 4. Note that the origin in both coordinate systems is taken to be in the upper left corner. The benefit of this new coordinate space is that a simple scale and a rotation change may be induced by directly modifying the (ρ, θ) data [31]. If the images in the Cartesian coordinates system rotate φ degrees, they will translate φ horizontally in the polar coordinate. Likewise, the scaling can also be used for vertical translation as shown in Figure 5.
To solve the problem of the rotational invariance of image features, we sought to mimic th aging mechanisms of the human eye. Figure 3 shows the structure of an eyeball [26]. Studies hav wn that external light first enters the eyeball through the cornea at the pupil. It then passe ough the lens and vitreous body to reach the retina [27]. In the retina, the light signal is converte o an electrical one, and it is subjected to initial processing [27,28]. The process of recognizing rceiving, and understanding the external signals are completed in the visual cortex. The retina aging mechanism of the human eye is an uniformed sampling process which can be best describe transformation in a logarithmic polar coordinate [29]. The logarithmic polar coordinate system transform an image in a Cartesian coordinate system into one in a logarithmic polar coordinat tem based on a set of transformation laws [30].  Given that we wished to obtain rotationally invariant images that retained the image features, we opted to describe the images, using a polar coordinate system: where  denotes radial distance from the center ( , ) cc xy .
Assuming  and  lie along the vertical and horizontal axes respectively, we obtain the coordinate system shown in Figure 4. Note that the origin in both coordinate systems is taken to be in the upper left corner. The benefit of this new coordinate space is that a simple scale and a rotation change may be induced by directly modifying the ( , )  data [31]. If the images in the Cartesian coordinates system rotate  degrees, they will translate  horizontally in the polar coordinate.
Likewise, the scaling can also be used for vertical translation as shown in Figure 5.  Considering that feature learning and extraction in convolutional neural networks had worked well [14], the convolutional neural networks were applied to extract feature vectors in images and to describe them by both Cartesian coordinates and polar coordinates. Fusing the two types of vectors and using them as inputs for the conventional machine learning models for classification, another classifier to combine them is needed in order to describe and integrate these two features of vectors that come from different image coordinates. Support vector machines (SVMs) were selected as the classifiers as they had been proven to have properly performed the two-class and multiclass classification tasks [23,32], and had been used successfully in the classification of plankton images [33]. It is assumed that SVMs can find the optimal hyperplane in the feature space. This results in the greatest separation between the positive and negative samples in the training set [34]. In addition, a certain amount of fault tolerance can be retained when using the concept of soft interval in order to improve the robustness and classification accuracy of SVMs. Considering that there will exist morphological differences between conspecific plankton populations from different marine regions, the classification should be performed, using the model of higher fault-tolerance [35]. Thus, a multiclass SVM model developed by our research group is applied using this method [36]. The SVM classifier is used to classify the dataset based on a combination of two kinds of features that are described in Cartesian coordinate and polar coordinate systems.
For features in Cartesian coordinate and polar coordinate systems, we separately input the Cartesian coordinate images as c I and polar coordinate images as  Considering that feature learning and extraction in convolutional neural networks had worked well [14], the convolutional neural networks were applied to extract feature vectors in images and to describe them by both Cartesian coordinates and polar coordinates. Fusing the two types of vectors and using them as inputs for the conventional machine learning models for classification, another classifier to combine them is needed in order to describe and integrate these two features of vectors that come from different image coordinates. Support vector machines (SVMs) were selected as the classifiers as they had been proven to have properly performed the two-class and multiclass classification tasks [23,32], and had been used successfully in the classification of plankton images [33]. It is assumed that SVMs can find the optimal hyperplane in the feature space. This results in the greatest separation between the positive and negative samples in the training set [34]. In addition, a certain amount of fault tolerance can be retained when using the concept of soft interval in order to improve the robustness and classification accuracy of SVMs. Considering that there will exist morphological differences between conspecific plankton populations from different marine regions, the classification should be performed, using the model of higher fault-tolerance [35]. Thus, a multiclass SVM model developed by our research group is applied using this method [36]. The SVM classifier is used to classify the dataset based on a combination of two kinds of features that are described in Cartesian coordinate and polar coordinate systems.
For features in Cartesian coordinate and polar coordinate systems, we separately input the Cartesian coordinate images as I c and polar coordinate images as I p into two different convolutional neural networks with the same structure and different training parameters. The two networks are trained separately without interference. As the fully connected layer can describe the features learned by the entire network model, we use the output of the fully connected layer of the classical convolutional neural network as the features' eigenvector (V c ) obtained from Image I c , and the features of eigenvector (V p ) was obtained from Image I p . By splicing the features, we obtain the vector eigenvector, V = V c V p , which has both the Cartesian coordinate features and the polar coordinate features. As shown in Figure 6, after the feature extraction process, the eigenvector (V) is inputted into the multiclass SVM to realize a global optimization. In this case, with two different kinds of features contained, the parameters of the SVM classifier are optimized separately. The process of the proposed research is shown in Figure 7. To illustrate the application of the proposed method, the sample in situ images captured by the undersea imaging device, and the process of region of interest (ROI) extraction and classification are also shown in Figure 8. As shown in Figure 6, after the feature extraction process, the eigenvector (V ) is inputted into the multiclass SVM to realize a global optimization. In this case, with two different kinds of features contained, the parameters of the SVM classifier are optimized separately. The process of the proposed research is shown in Figure 7. To illustrate the application of the proposed method, the sample in situ images captured by the undersea imaging device, and the process of region of interest (ROI) extraction and classification are also shown in Figure 8.   As shown in Figure 6, after the feature extraction process, the eigenvector (V ) is inputted into the multiclass SVM to realize a global optimization. In this case, with two different kinds of features contained, the parameters of the SVM classifier are optimized separately. The process of the proposed research is shown in Figure 7. To illustrate the application of the proposed method, the sample in situ images captured by the undersea imaging device, and the process of region of interest (ROI) extraction and classification are also shown in Figure 8.

Dataset Used
We used two datasets: the in situ plankton dataset and the CIFAR-10 dataset in order to further discuss the robustness of the algorithm. The images in the plankton dataset were acquired in the Bering Sea, using PlanktonScope, an in situ underwater imager [37]. The plankton images in the datasets used for training and testing were extracted from the original image dataset, using an ROI extraction program [38]. The samples in the training dataset included those corresponding to six planktonic taxa (arrow worm, copepod, fish larvae, jellyfish, krill, and pteropod) as well as the negative samples. After obtaining the initial ROI samples, the sampling scale was increased, using the mirroring and rotation operations. The plankton datasets comprised seven classes. The training set comprised 2048 examples from each category while the testing set comprised 512 examples from each category. Altogether, the total number of the samples across all seven training sets was 14,336 while the total number of samples across all seven validation sets was 3584.
For generalization, the CIFAR-10 dataset was used to measure the performance of the convolutional neural networks [24]. This dataset comprised 60,000 color images which were divided into 10 categories of 6000 images each. For this dataset, we used 50,000 images for training and 10,000 images for testing.

Experimental Procedure
To evaluate the classification accuracy and the efficiency of the human-eye-based neural network training method proposed in this study, we performed several sets of comparative tests (Tables 1 and 2 were performed, using MATLAB 2018b on a Core i3 7100 CPU with 32 GB RAM running the Ubuntu operating system (16.04)). The fully connected layer is the structural layer of the neural network which can comprehensively describe the characteristics of the network samples. However, there may exist multiple types of fully connected layers in the actual network structure [14].

Dataset Used
We used two datasets: the in situ plankton dataset and the CIFAR-10 dataset in order to further discuss the robustness of the algorithm. The images in the plankton dataset were acquired in the Bering Sea, using PlanktonScope, an in situ underwater imager [37]. The plankton images in the datasets used for training and testing were extracted from the original image dataset, using an ROI extraction program [38]. The samples in the training dataset included those corresponding to six planktonic taxa (arrow worm, copepod, fish larvae, jellyfish, krill, and pteropod) as well as the negative samples. After obtaining the initial ROI samples, the sampling scale was increased, using the mirroring and rotation operations. The plankton datasets comprised seven classes. The training set comprised 2048 examples from each category while the testing set comprised 512 examples from each category. Altogether, the total number of the samples across all seven training sets was 14,336 while the total number of samples across all seven validation sets was 3584.
For generalization, the CIFAR-10 dataset was used to measure the performance of the convolutional neural networks [24]. This dataset comprised 60,000 color images which were divided into 10 categories of 6000 images each. For this dataset, we used 50,000 images for training and 10,000 images for testing.

Experimental Procedure
To evaluate the classification accuracy and the efficiency of the human-eye-based neural network training method proposed in this study, we performed several sets of comparative tests (Tables 1 and 2 were performed, using MATLAB 2018b on a Core i3 7100 CPU with 32 GB RAM running the Ubuntu operating system (16.04)). The fully connected layer is the structural layer of the neural network which can comprehensively describe the characteristics of the network samples. However, there may exist multiple types of fully connected layers in the actual network structure [14]. Therefore, to ensure a better classification accuracy, we consistently used the output of the first fully connected layer of the classical convolutional neural network as the input for the multiclass SVM model. To verify the effect of the proposed model in extracting rotation features, we also had the original image rotated with a step of 30 degrees, flipped up and down, and flipped left and right respectively, and called as the augmented dataset for data augmentation, which is 14 times the size of the original. Thereafter, a series of comparative tests were performed to calculate their classification accuracy, recall rate, and run time of each model, using the in situ plankton image dataset (Table 1). In Models 1-5, we fine-tuned the classical convolutional neural network, using the training set, and then determined the classification accuracy and run time, using the testing set. In Models 6-10, we fine-tuned the same classical convolutional neural network corresponding to Models 1-5, using the augmented training set. In Models 11-15, we extracted feature maps from the corresponding Models (1-5) after the completion of the first fully connected layer. Thereafter, we trained the multiclass SVM model, using the training set and determined the classification accuracy and run time, using the testing set. In Models 16-20, we combined the features extracted from the dataset with the corresponding Models (11-15) after performing polar coordinate transformation. We then calculated the classification accuracy and run time, using the testing set. To validate the performance of each algorithm, we also performed the tests, using the CIFAR-10 dataset (Table 2) in the same procedure.

Analysis of Evaluation Results
In this study, we analyzed the performance and the pertinent parameters of the four different classification methods against CIFAR-10 and the in situ plankton image datasets. The results are shown in Figure 9. These methods included a classical convolutional neural network (Method I); a classical convolutional neural network trained by the augmented dataset (Method II); a classical convolutional neural network with ordinary features combined with an SVM classifier (Method III); and a classical convolutional neural network with both ordinary and polar coordinate features combined with an SVM classifier (Method IIII). The feasibility of the proposed method is demonstrated by the results obtained on the public CIFAR-10 dataset in which Method IIII performed the best among all the methods. The classification accuracy and recall rate were the highest in Method IIII. Method III exhibited the next highest performance, followed by Method II and Method I. Moreover, it was observed that the classification accuracy and recall rate could be improved by partially replacing the totally connected layer of the convolutional neural network and the Softmax classification layer with the multiclass SVM model. Furthermore, the use of polar coordinates and the multiclass SVM classifier layer significantly improved the classification accuracy and recall rate. The improvement was in the range of approximately 2-5%, which was significant with respect to classification methods. The polar coordinate features allow the feature conversion of rotational characteristics into translational ones. Both the rotated and the existing ordinary features were then learned by the optimized convolutional neural network in order to increase the generalization ability of the proposed model. Compared with Method I, Method II uses an augmented dataset to enhance the generalization ability in rotation, however, the improvement is limited; even though the amount of data in the augmented dataset is 14 times that of the original, the training time will also increase greatly. For the proposed method, the improvement is much more significant; it demonstrates its effectiveness in features extraction and combination. In general, the same conclusion is applicable to both the open dataset (CIFAR10) and the plankton dataset, which shows its robustness to different datasets.

Discussion
The introduction of polar coordinate features allowed the conversion of the rotational features into translational features. As both the rotated features and the ordinary features were then learned by the convolutional neural network, the generalization ability of the model increased. Moreover, the model could be suitable for multi-pose and multi-angle images, which are very common in natural environments. The images in the plankton dataset were collected in the ocean and the camera angle was not fixed, so the creatures in the image have a variety of postures and angles. Consequently, in

Discussion
The introduction of polar coordinate features allowed the conversion of the rotational features into translational features. As both the rotated features and the ordinary features were then learned by the convolutional neural network, the generalization ability of the model increased. Moreover, the model could be suitable for multi-pose and multi-angle images, which are very common in natural environments. The images in the plankton dataset were collected in the ocean and the camera angle was not fixed, so the creatures in the image have a variety of postures and angles. Consequently, in order to describe such images better, more rotated features are needed. Using polar coordinates combined with common features and rotation features, better results can be achieved when dealing with images without specific angles. Thus, the proposed method can be applicable for the undersea imager receiving optical signals of the tiny creatures in the undersea observation network (shown in Figure 10). It may also have better effect on the images taken by autopilot and with unmanned aerial vehicles (UAVs), and related work can be carried out in the future.
Sensors 2020, 20, x FOR PEER REVIEW 13 of 15 order to describe such images better, more rotated features are needed. Using polar coordinates combined with common features and rotation features, better results can be achieved when dealing with images without specific angles. Thus, the proposed method can be applicable for the undersea imager receiving optical signals of the tiny creatures in the undersea observation network (shown in Figure 10). It may also have better effect on the images taken by autopilot and with unmanned aerial vehicles (UAVs), and related work can be carried out in the future.

Summary and Outlook
We developed a training method for convolutional neural networks designed for the recognition and classification of plankton by applying the mechanism of the human eye or the polar coordinate system. When applying the classification and identification for in situ plankton images, the image was described in the polar coordinate system as well as the Cartesian coordinate system. This has enabled the construction of an optimized classification and recognition network available for in situ plankton images. This was constructed by exploiting the advantages of both coordinate systems and automatically adjusting the weights of two eigenvectors in network training. The model is trained, using 14,336 ROIs and tested, using 3,584 ROIs both from the in situ plankton images. The DenseNet201 + Polar + SVM model exhibited the highest classification accuracy (97.989%) and recall rate (97.986%) in the comparative tests. The accuracy of the proposed model was markedly higher than those of the initial classical convolutional neural networks when using the in situ plankton image data with the increase in classification accuracy and recall rate being 5.3% and 5.1% respectively. In addition, the proposed training method can improve the classification performance considerably when it is used on the public CIFAR-10 dataset, which consists of 10 categories with 50,000 training samples and 10,000 test samples. In this case, the DenseNet201 + Polar + SVM model showed the highest classification accuracy (94.91%) as well as the highest recall rate (94.76%).
One shortcoming of the proposed method is that it requires an extra neural network, specifically for the polar-transformed images; thus, it requires extra time in network training. However, as the structures of these two networks are similar, their hyper-parameters such as learning rate and their training epoch are the same. This will extract more features, and increases the accuracy as well as recall rate, and possesses a controllable efficient model. Another point to be noted is the resizing operation in the model. The images used as the network inputs were standardized to ensure that they had a uniform size. Thereafter, they are classified and recognized, using the convolutional neural

Summary and Outlook
We developed a training method for convolutional neural networks designed for the recognition and classification of plankton by applying the mechanism of the human eye or the polar coordinate system. When applying the classification and identification for in situ plankton images, the image was described in the polar coordinate system as well as the Cartesian coordinate system. This has enabled the construction of an optimized classification and recognition network available for in situ plankton images. This was constructed by exploiting the advantages of both coordinate systems and automatically adjusting the weights of two eigenvectors in network training. The model is trained, using 14,336 ROIs and tested, using 3,584 ROIs both from the in situ plankton images. The DenseNet201 + Polar + SVM model exhibited the highest classification accuracy (97.989%) and recall rate (97.986%) in the comparative tests. The accuracy of the proposed model was markedly higher than those of the initial classical convolutional neural networks when using the in situ plankton image data with the increase in classification accuracy and recall rate being 5.3% and 5.1% respectively. In addition, the proposed training method can improve the classification performance considerably when it is used on the public CIFAR-10 dataset, which consists of 10 categories with 50,000 training samples and 10,000 test samples. In this case, the DenseNet201 + Polar + SVM model showed the highest classification accuracy (94.91%) as well as the highest recall rate (94.76%).
One shortcoming of the proposed method is that it requires an extra neural network, specifically for the polar-transformed images; thus, it requires extra time in network training. However, as the structures of these two networks are similar, their hyper-parameters such as learning rate and their training epoch are the same. This will extract more features, and increases the accuracy as well as recall rate, and possesses a controllable efficient model. Another point to be noted is the resizing operation in the model. The images used as the network inputs were standardized to ensure that they had a uniform size. Thereafter, they are classified and recognized, using the convolutional neural network. This operation inevitably affected the morphological characteristics of the targets in the images. In future studies, a more optimal combination of the two different networks should be investigated to form a unified training and testing structure. Concretely, we will focus on extracting the two features with only one convolutional neural network in the future. In addition, the feature fusion part also needs to be completed with an SVM. We hope to simplify this point and put the feature fusion characteristic into the convolutional neural network by designing a specialized module. Finally, we will embed our proposed model into the undersea imager or UAV imager to achieve real-time image acquisition and classification.