LPNet: Retina Inspired Neural Network for Object Detection and Recognition

: The detection of rotated objects is a meaningful and challenging research work. Although the state-of-the-art deep learning models have feature invariance, especially convolutional neural networks (CNNs), their architectures did not speciﬁcally design for rotation invariance. They only slightly compensate for this feature through pooling layers. In this study, we propose a novel network, named LPNet, to solve the problem of object rotation. LPNet improves the detection accuracy by combining retina-like log-polar transformation. Furthermore, LPNet is a plug-and-play architecture for object detection and recognition. It consists of two parts, which we name as encoder and decoder. An encoder extracts images which feature in log-polar coordinates while a decoder eliminates image noise in cartesian coordinates. Moreover, according to the movement of center points, LPNet has stable and sliding modes. LPNet takes the single-shot multibox detector (SSD) network as the baseline network and the visual geometry group (VGG16) as the feature extraction backbone network. The experiment results show that, compared with conventional SSD networks, the mean average precision (mAP) of LPNet increased by 3.4% for regular objects and by 17.6% for rotated objects.


Introduction
In recent years, deep learning has played an important role in many areas of life, such as image processing [1][2][3], object detection [4][5][6], optic imaging [7][8][9], and speech recognition [10,11]. Especially in object detection and recognition, the accuracy of deep learning models becomes increasingly important [12]. However, due to the dependence on datasets, deep learning networks have limitations for special objects, such as rotated objects. The rotation invariant feature is one of the key methods to solve the problem of targeting variable direction. For example, remote sensing images cause difficulties for conventional convolutional neural network models to achieve robust detection.
For rotated objects, features mismatching is a challenging work [13], which will reduce the final recognition accuracy. There is a large number of deep learning methods to solve such problems [14,15]. In fact, some works solved these problems by the retina-like mechanism [16][17][18]. In the human visual system, photoreceptor cells of the retina are sparsely distributed as the distance from the fovea increases. This structural characteristic causes the high imaging resolution of the retina in the vicinity of the fovea and low imaging resolution in the peripheral area [19]. Between the retina and the visual cortex, a logpolar coordinate mapping relationship is established [20] and brings the advantages of anti-rotation and scale transformation [21][22][23][24].
Based on these characteristics of human retinal imaging, the retina-like mechanism is used in various fields [25][26][27]. For example, in the task of object search and detection in a large field of view and high-resolution scene, the data are compressed using log-polar coordinates with variable resolution imaging. A retina-like mechanism is also widely used in high-speed rotation navigation and guidance [28]. Similarly, numerous studies use the retina-like log-polar transformation in the classification task of deep learning [29][30][31]. However, for object detection and recognition, few works combine log-polar transformation with convolutional neural networks (CNNs) [32]. Based on these methods, we propose a log-polar coordinates feature extraction network, named LPNet. Inspired by the principle of human retinal imaging, we introduce the log-polar transformation that overcomes the rotation problem into CNN. We also put the variable resolution characteristics of the human eye into LPNet, which effectively increases the robustness of networks.
Several works address the space invariance on image processing using CNNs [33][34][35]. In [36], log-polar transformation is brought into the field of deep learning through the polar transformation network (PTN), which realizes the invariance of translation, the equal change in rotation, and the expansion in the polar coordinate system. However, PTN only recognized global deformation. In [37], the results show that several angles are better fitted to CNNs with log-polar operations for all tested datasets. However, only rotation transformation is performed and experiments are carried out under different rotations. In [38], log-polar is combined with the attention mechanism. Thus, several improvements are achieved from the original basis. A large amount of experimental verification is lacking and they just focused on classification tasks. In [39], they applied a log-polar transformation as a pre-processing step to a classification CNN. It reduced the required image size and improved the performance in handling image rotation and scaling permutations. However, this method is applied using the MNIST dataset for classification. Object detection and recognition on other datasets are not discussed. The above research works show that log-polar gradually established its role in deep learning, but the transformations are rarely applied to object detection and recognition networks.
The rest of this paper is organized as follows. Section 2 introduces the preliminary background knowledge. Section 3 describes the proposed method in detail. Section 4 presents the experimental settings. Finally, Section 5 reports the general conclusions and suggests future research directions.

Log-Polar Transformation
For an image, if we select the coordinate origin as O (0, 0), the position of the pixel based on sampling is expressed in both cartesian coordinates (x, y) and log-polar coordinates (r, θ) [40][41][42][43]. Equations (1) and (2) show the conversion between two coordinates.
In the cartesian coordinate system, z is a complex number. We define the equation for the log-polar coordinate, as shown in Equation (3).
We set a cartesian coordinate number represented by the value of z = x + iy. Logpolar coordinates are represented by the value of w = ξ + iη, where i is the complex imaginary unit. z = x + iy = r(cos θ + i sin θ) = re iθ (4) Log-polar coordinates ξ and η are given as Equations (6) and (7).
According to the above equations, the object in log-polar coordinates w changes in proportion and rotation. That means that the object is magnified by τ times to the origin of the coordinate. If the objects are rotated by an angle of α, then its corresponding polar coordinate changes to (τr, θ + α). After log-polar transformation, the mapping is as Equations (8)- (10).
If the target in the cartesian coordinates changes in proportion, then it is equivalent to the target in the log-polar coordinates being displaced along the radius axis [37]. The rotation change in the target in cartesian coordinates is equivalent to the displacement change in the target in the log-polar coordinate space along the angular axis. We take 2π as the periodic displacement, the log-polar transformation has scale and rotation invariance [44] which is established when the center point coincides with the coordinate origin after the target change [45].
According to these characteristics, we introduce log-polar into object detection and recognition network to alleviate the influence of rotating targets on accuracy. During image rotation or transformation, the same object will have different feature maps after CNN extraction. Therefore, the image becomes inconsistent with the results of conventional network training. To solve the problem of reduced accuracy due to image rotation, we introduce log-polar transformation. Moreover, we call the log-polar transformation for feature maps during the encoding process. After performing a convolution operation, an inverse log-polar transformation is carried out on the network, which is called the decoding process of CNN.
As shown in Figure 1a-d, when the target in the images is rotated by 90 • , the feature of the encoded image remains unchanged. Compared to Figure 1b,c, the feature map encoding shows that the rotation only performs translation transformation. However, the distribution of feature maps remains unchanged. For CNNs, this slight change does not cause significant accuracy falling.

SSD300
The baseline network of the proposed LPNet is the single-shot multibox detector (SSD). SSD is an objection detection network proposed by Liu et al. [46] and is one of the popular frameworks. Figure 2 shows the architecture of SSD300. Compared with the faster region-based convolutional network (R-CNN) [47], the SSD network has a considerable speed improvement. Compared with you only look once (YOLO) network [48], SSD has a clear advantage in accuracy. SSD uses VGG16 as the backbone network. The obtained multi-scale feature maps are used for the detection of objects. In the process of locating objects, large feature maps are used on relatively small objects while small feature maps are used for large ones. SSD refers to YOLO but differs from it. It uses convolution to detect feature maps of different scales and, at the same time, draws on the idea of anchors in faster R-CNN.
The speed advantage of SSD lies in the fact that the algorithm is implemented based on the feedforward architecture. The calculations are all in an end-to-end single channel. For a single input image, SSD generates multiple fixed-size bounding boxes and scores for the object category. Then, non-maximum suppression (NMS) operation is added to obtain the final prediction. Thus, the detection speed is significantly improved. The first half is the basic network, which is mainly used for image classification. The second half is a multi-scale convolutional layer with a size reduced layer by layer and is mainly used for the extraction and detection of object features at multiple scales.

Architecture Overview
In this section, we extract feature maps from the backbone network and transform them to log-polar coordinates. The baseline framework is the SSD300 network and the feature extraction backbone network is VGG16. Figure 3 shows the structure of the proposed LP layer.
We regard the module that combines convolution and log-polar in the feature extraction as the encoder. However, the module that combines the convolution unit and inverse log-polar is the decoder. The ratio of numbers of feature maps that require a log-polar transformation in each layer is defined as the observation factor (OF), which affects the overall network performance. We will discuss this in Section 4.
In the decoder module, the usual cartesian coordinates are converted to log-polar coordinates. This transformation helps the network extract the feature information in the log-polar coordinates of the feature map. When the object is rotated and scaled, the network error effectively decreases. Equations (11) and (12) show the calculations of the conventional CNN and the encoder module, respectively. However, this transformation also raises several problems. After the encoder module, all information extracted by the CNN is of log-polar coordinates. The feature information of the image itself cannot be extracted further. To solve this problem, we introduce the inverse log-polar module. In the encoder module, the input feature maps are normal uniform pixel distribution images. After the log-polar transform, the feature maps are resampled as nonuniform distribution. Through inverse log-polar transform, we get the "inverse log-polar feature map". The pixels in this map are sampled as human eyes. This image has high resolution in the center and low resolution on the side.
where g(x o , y o ) is the output feature map in cartesian coordinates, g(r o , θ o ) is the output feature map in log-polar coordinates, f (·) is the input feature map, and h(·) is the convolution kernel.
In this module, we convert the cartesian coordinates of the feature map to log-polar coordinates. The feature maps in the log-polar coordinates have more characteristics than the cartesian coordinates, such as rotation invariance, and so on. We show this transformation in Figure 4. In this module, we convert the cartesian coordinates of the feature map to log-polar coordinates. The feature maps in the log-polar coordinates have more characteristics than the cartesian coordinates, such as rotation invariance, and so on. We show this transformation in Figure 4.
In the decoder module, we add an inverse transformation of the feature map after the log-polar transformation. The log-polar coordinate information is converted into the image information in cartesian coordinates to facilitate the subsequent feature extraction of the CNN.
This operation is the opposite of the transformation in Figure 4. We rearrange the pixels in log-polar coordinates on a new feature map. While the feature points of log-polar coordinates and the feature points under the cartesian coordinates are not in a one-to-one correspondence, we dropped some points on the original feature map. After the decoding, the points of interest in the feature map show high resolution while other regions appear in low resolution. This result is also similar to the observation characteristic of the human eye, which is non-uniform imaging. This characteristic also alleviates the influence of noise points during the convolution operation. Thus, the interference of redundant features on the network accuracy is suppressed.

Sliding LPNet
With the consideration that feature extraction by the human eye scans to the extracting method, the starting point of the high-resolution field is the center of the image. However, the interesting area is not all in the center. Furthermore, the objects scatter everywhere in an image. Therefore, in this section, we propose another scanning feature map gaze center selection method.
We divide the feature map into small units according to the size ratio. Before the logpolar transformation is performed on each feature map in each layer, the center of each small cell is used as the center of the log-polar transformation of entire feature maps. The In the decoder module, we add an inverse transformation of the feature map after the log-polar transformation. The log-polar coordinate information is converted into the image information in cartesian coordinates to facilitate the subsequent feature extraction of the CNN. This operation is the opposite of the transformation in Figure 4. We rearrange the pixels in log-polar coordinates on a new feature map. While the feature points of log-polar coordinates and the feature points under the cartesian coordinates are not in a one-to-one correspondence, we dropped some points on the original feature map. After the decoding, the points of interest in the feature map show high resolution while other regions appear in low resolution. This result is also similar to the observation characteristic of the human eye, which is non-uniform imaging. This characteristic also alleviates the influence of noise points during the convolution operation. Thus, the interference of redundant features on the network accuracy is suppressed.

Sliding LPNet
With the consideration that feature extraction by the human eye scans to the extracting method, the starting point of the high-resolution field is the center of the image. However, the interesting area is not all in the center. Furthermore, the objects scatter everywhere in an image. Therefore, in this section, we propose another scanning feature map gaze center selection method.
We divide the feature map into small units according to the size ratio. Before the log-polar transformation is performed on each feature map in each layer, the center of each small cell is used as the center of the log-polar transformation of entire feature maps. The position of the center point coordinate relative to entire feature maps is sequentially presented in a sliding window format. Figure 5 shows the splitting of small cells in the feature map, the size of which determines the center coordinate of each small cell. The selected center point is used as the basis for the next feature map encoding. The number of grids to be cut is λ × λ. Then, the image is cut into small grids h × h. The size of the original image is H × W. In Figure 5, we take H = W = 300, h = 100 as an example, and take the center point of each small grid as the center point of each non-uniform sampling. According to the setting of OF value, the number of feature maps that need to be coded and decoded is calculated as L. A center point selection on the L feature maps is carried out in turn and calculated according to the coordinate of the cutting. Table 1 shows the obtained coordinates.

Experimental Results and Analysis
In this section, we take the SSD network as the baseline framework. According to the experimental results, the increase in accuracy is also suitable for other networks.

Experiment Setup
In object detection and recognition experiments, the default hyper-parameters are as follows: the training step is 120,000; the training and testing batch sizes are 32 and 64, respectively; the polynomial decay learning rate scheduling strategy is adopted with an initial learning rate of 0.1; the warm-up step is 1000; and the momentum and weight decay are 0.9 and 0.005, respectively. All of our LPNet experiments use the same hyper-parameter as the default setting. All experiments are trained with two 2080Ti GPUs.
We trained LPNet on the pattern analysis, statical modeling, and computational learning (PASCAL) VOC 2007 and PASCAL VOC 2012 datasets. The main goal of the PASCAL VOC dataset is to recognize image objects. The dataset contains 20 types of objects. Then, each image is labeled. The labeled objects include people, animals (such as cats, dogs, islands, etc.), vehicles (such as cars, boats, airplanes, etc.), and furniture (such as chairs, tables, sofas, etc.). Each image has 2~4 objects on average. All the annotated images have labels needed for object detection and recognition, but only some data have segmentation labels. Among them, the VOC2007 dataset contains 9963 annotated images. These images are divided into three categories, consisting of train, val, and test. Then, the total 24,640 objects are labeled. The VOC2012 dataset is an upgraded version of the VOC2007 dataset, with a total of 11,530 images. For the objection detection, the trainval/test of VOC2012 contains all the corresponding images from 2008 to 2011 years. The trainval has 11,540 images and a total of 27,450 objects. For the segmentation task, the trainval of VOC2012 contains all the corresponding images from 2007 to 2011 years, and the test only contains 2008 to 2011 years. Trainval has 2913 images with 6929 objects. The VOC2012 dataset is divided into 20 categories, including 21 categories of background.

Results
We carried out experiments on the PASCAL VOC2007 and PASCAL VOC2012 datasets, and divided the LPT mode into two types (sliding and stable). During the network training, VOC2007 + VOC2012 is used as the training set and tested on the VOC2007 test dataset. Table 2 shows the test results. In this experiment, the performance indicators are mainly compared with SSD and similar networks. Compared with the high-performance network that appeared after SSD, experimental results also draw the same inference. Table 2 shows that, under VOC2007 and VOC2012 datasets, two modes of LPNet have higher accuracy than SSD300, faster R-CNN, and fast region-based convolutional network (R-CNN) [49].
Compared with the fast R-CNN algorithm, the mAP of LPNet is improved by 7.6%. This improvement comes from the fact that LPNet can effectively reduce the influence of noise on the input feature map. Moreover, for objects of different sizes, the detection and recognition accuracy of LPNet improved. Consistent with the input image size of SSD300, both modes of LPNet choose 300×300 as the image input size of the network. In the network training, Figure 6 shows the loss and mAP of the modes. Figure 7 shows that the LPNet of the sliding mode is more steady than that of the stable mode, and that the mAP is improved.

OF Impact
The OF value represents the intensity information of the retina-like mechanism on CNN. This parameter also determines the number of feature maps that need focus. To study the influence of the OF on the overall network performance of the network, we select its gradient values. For two different modes of LPNet, the mAP test is performed on the VOC2007 + VOC2012 dataset. Table 3 shows the final model accuracy.  Table 3 shows that, as the value of OF increases, the network noise is relatively suppressed, which leads to a certain degree of improvement in accuracy. When the OF of the gradient is given, the network accuracy also shows the result of the increasing gradient. When OF = 1, this situation is called the full observation state and the corresponding network accuracy is the highest.

Dataset Rotation
In Figure 2, we show the effect of the object at different rotation angles. The arrangement of feature maps corresponding to different rotation targets is the same, which effectively reduces the influence of rotation on the network accuracy. In training the encoded feature map, the network not only has the characteristics of the cartesian coordinates but also those of the log-polar coordinate, which enhances the robustness of the network.
In this section, an overall rotation transformation is carried out on the VOC2007 and VOC2012 datasets to test the anti-rotation performance of LPNet. For these two datasets, training and testing are carried out with rotations of 45 • , 90 • , and 180 • . The original SSD network is also tested under the same experimental conditions. Figure 7 shows the final experimental results. As the image in the dataset rotates by a certain angle, the accuracy of target detection also decreases. However, LPNet can reduce the impact of this rotation on accuracy. We choose to compare the accuracy of Figure 8a-d under the condition of OF = 1. As the image rotates, the mAP of the SSD300 (the blue straight line) decreases by about 13% at most. Compared to SSD300, the mAP of LPNet increases by up to 15.6%. Although rotation also reduces the mAP of LPNet, the mAP will not drop by more than 2% at most. By comparison of Figure 8e-h, we can also see that, compared to stable LP, sliding LP is more suitable for rotating object detection. Figure 8 shows that, for a normal dataset (without rotation processing), the SSD network reaches an accuracy of 74.3%. However, at a certain angle rotation, the network adaptability decreases and the highest accuracy becomes 61.32%. The LPNet with log-polar transformation maintains the original accuracy without a decrease, and the network has strong robustness. For stable and sliding LPNet, under the normal dataset, the accuracy of stable LPNet is slightly higher than that of sliding LPNet. However, for rotating targets, sliding LPNet is more adaptable. Given that its center point is constantly moving, the network of the sliding LPNet is more suitable than the stable LPNet to capture the characteristics of various forms when the target rotates. We show some images as examples in Figure 9, which are rotated in several angles and detected by LPNet.

Conclusions and Future Works
In this study, we introduce log-polar transformation into the deep learning object detection and recognition network, which effectively reduces the influence of objects rotation on detection accuracy. However, further research is still necessary. This research selects relatively reasonable parameters but no experiments on the selection of step size λ, which must also be studied and discussed in future studies. We also did not conduct training and testing on large datasets, such as the COCO. The network recall and other parameters also need testing in combination with different network frameworks.
Log-polar transformation is a method with a retina-like mechanism. This study uses LPNet to alleviate the influence of objects rotation and enhance the robustness of CNN. However, this mechanism is also used more widely in other aspects of object detection and recognition models in deep learning. The human eye has the advantages of variable resolution and scale transformation, which correspond to the attention mechanism of object detection and recognition network. This trend makes CNN more simple and human-like in the future.