Deep Learning Scene Recognition Method Based on Localization Enhancement

With the rapid development of indoor localization in recent years; signals of opportunity have become a reliable and convenient source for indoor localization. The mobile device cannot only capture images of the indoor environment in real-time, but can also obtain one or more different types of signals of opportunity as well. Based on this, we design a convolutional neural network (CNN) model that concatenates features of image data and signals of opportunity for localization by using indoor scene datasets and simulating the situation of indoor location probability. Using the method of transfer learning on the Inception V3 network model feature information is added to assist in scene recognition. The experimental result shows that, for two different experiment sceneries, the accuracies of the prediction results are 97.0% and 96.6% using the proposed model, compared to 69.0% and 81.2% by the method of overlapping positioning information and the base map, and compared to 73.3% and 77.7% by using the fine-tuned Inception V3 model. The accuracy of indoor scene recognition is improved; in particular, the error rate at the spatial connection of different scenes is decreased, and the recognition rate of similar scenes is increased.


Introduction
Innovative positioning technologies and wireless networks promote the development of the indoor positioning system [1,2]. At the same time, with the improvement of hardware and developments in behavior calculation, more and more intelligent devices are being used in scientific research and real-world scenery [1]. The acquisition of real-time location information on mobile devices has become an indispensable element of intelligent devices [3]. Since the mobile terminal contains built-in physical devices, such as vision and multi-sensors, it has become a new type of scene perception [4] and communication [5] platform for image acquisition and decision-making. Mobile terminals can provide basic communication information and perception for positioning and scene recognition technology [6]. In indoor positioning, which uses the mobile terminal as the development platform, Signals of Opportunity (SoOP) [7], such as Wi-Fi or Bluetooth signals, are not specifically built to be positioning signals [8], but these signals have become a widely-researched topic of indoor positioning, because they are easy to build up and are cost-effective. Chen et al. [3] achieved positioning accuracy of up to 2-5 m by using the fusion signal of the built-in sensor, Wi-Fi, and magnetic field. This indicates that SoOP can work in combination with other positioning sources to provide accurate, reliable, and real-time positioning.
However, current research is mainly focused on the aspects of implementation or auxiliary positioning [9]. There is insufficient usage of the relationship between positioning information and the scenery semantic, for example, when users are in an indoor environment holding their mobile

Related Work
There is a large number of studies related to scene recognition and to deal with this problem, large datasets have emerged. In order to understand the scene-centered way of human recognition, reference applied the gist of a scene for improving scene image recognition by the mean of global image features [20]. A popular way of bridging the semantic gap in scene recognition is semantic modeling, in [21], a manifold regularized deep architecture is designed to get good performance. In fact, data is also an important direction in scene recognition tasks. In the case of sufficient data, even the simplest model algorithm can be used and get a pretty good classification result. There is a diverse set of 67 indoor scenes for indoor scene classification, which contains 15,620 images [12]. Based on deep learning, the places [22] provide tremendous indoor and outdoor scene images and the reference trained models for learning deep features of scene-centered images. The Places dataset contains seven million tagged images, a total of 476 categories, are the largest image dataset including number of scenes and locations. Additionally, the most famous database named ImageNet [23], hierarchical and diversified, is widely used, which can be applied to object recognition, image classification, and automatic target clustering, etc. Due to these large databases, ImageNet-trained deep features can actually have success in different types of work [24]. Likewise, our work is to transfer learning the well pre-trained model based on large Imagenet datasets and then feed the model with our own dataset for the purpose of tackling the problems of our work.
Currently, some work of computer vision has realized the importance of utilizing geographic information. One focus is on the research of visual location. In [25], the image geolocalization can be greatly extended by the proposed cross-view feature translation method, which means to enrich attributes of geo-tagged imagery. In addition to the image itself, the methods [26] consider the digital elevation models with target mountain terrain to exploit the visual location. However, the other focus, putting geographic information into visual tasks, like image classification, is more similar to our study. In [27], to estimate priors of spatio-temporal classes at irregular and biased locations significantly improves performance. In [28], the encoding of GPS coordinates are extracted and merged into a convolutional neural network for image classification and it achieves improvement in the mean average precision. In [29], a geographical information system (GIS) assists in object detection, which computes GIS priors of the visible objects in the image to detect the final bounding boxes. In [30], the problem of object recognition is tracked with image appearance and geo-services positioning information on mobile devices. In [31], through picture metadata, the method uses the nonvisual context information (season and location) to achieve scene understanding in customer photos. In [32], they deeply study three spatial contexts to improve image classification.
Compared with previous approaches, this paper explicitly exploits the combination of positioning information and image information to improve scene recognition. We are interested in the performance of two types' combination. By transfer learning, we learn and extract image features; by simulative positioning error, we design two positioning features representing the scene attribute. We assess two scene places' recognition accuracy in different categories under four schemes. This can provide different situations which improve the scene recognition.

Feature Fusion Algorithm
Indoor scenes are different from outdoor scenes, which have a wide range of background elements with relatively stable features. Based on the convolutional neural network [16], to obtain more accurate scene prediction results, we only need to extract the most prominent and typical background information from the outdoor scenery. The indoor scene has a large number of foreground objects, and even for the same kind of scenery, due to the differences in interior decoration, style, and other factors, the features are very diverse. Therefore, it is difficult to obtain high recognition accuracy by using the convolutional neural network with limited typical indoor scenery image for training. When considering real-time images captured by mobile devices, due to the high randomness and disturbance in the shooting angle, focus, and exposure (as Figure 1 shows), the captured image will have a large amount of random noise, and its features are much weaker than those of a typical indoor image, which makes it difficult for convolutional neural networks trained with typical indoor scene images to extract prominent scenery features from this type of image. those of a typical indoor image, which makes it difficult for convolutional neural networks trained with typical indoor scene images to extract prominent scenery features from this type of image. Compared with image information, positioning information in the indoor scene cannot directly offer scene features, but the spatial relationship with each scene can be reflected through certain calculations, along with the indoor map. In this way, we can also judge the category of the scene at the current location. However, as indoor positioning accuracy is difficult to guarantee in situations of weak positioning signals or few signal sources, there will be a high risk of inaccuracy in the scenery prediction by using this information directly, especially in the area of multi-scene junctions and scene boundary areas. Thus, it is difficult to guarantee a high accuracy of indoor scene recognition when indoor image information or positioning information is provided alone.
Based on the above analysis, this paper puts forward a deep learning scene recognition method based on localization enhancement. Here, we extract image and positioning features from the scenery image and positioning data from positioning points, then use the method of deep learning to train the optimal feature fusion strategy for the fusion operation of these two features; finally, the fusion feature is used to obtain the scenery prediction results. The algorithm is shown in Figure 2. We extract the scene level feature vector of the scene image (labeled as ) by using a convolutional neural network that is fine-tuned to the scene studied, and extract the positioning feature vector by overlaying the operation using the positioning information of the location site with the estimated error and the indoor scene base map. Further, the image feature vector and the positioning feature vector are combined into one feature, denoted as follows: The combined feature vector is transformed by the feature fusion fully connected layer, then the output, called the fusion vector, is processed by the final decision fully connected layer and the output is the final prediction result.  Compared with image information, positioning information in the indoor scene cannot directly offer scene features, but the spatial relationship with each scene can be reflected through certain calculations, along with the indoor map. In this way, we can also judge the category of the scene at the current location. However, as indoor positioning accuracy is difficult to guarantee in situations of weak positioning signals or few signal sources, there will be a high risk of inaccuracy in the scenery prediction by using this information directly, especially in the area of multi-scene junctions and scene boundary areas. Thus, it is difficult to guarantee a high accuracy of indoor scene recognition when indoor image information or positioning information is provided alone.
Based on the above analysis, this paper puts forward a deep learning scene recognition method based on localization enhancement. Here, we extract image and positioning features from the scenery image and positioning data from positioning points, then use the method of deep learning to train the optimal feature fusion strategy for the fusion operation of these two features; finally, the fusion feature is used to obtain the scenery prediction results. The algorithm is shown in Figure 2. We extract the scene level feature vector of the scene image (labeled as V image ) by using a convolutional neural network that is fine-tuned to the scene studied, and extract the positioning feature vector by overlaying the operation using the positioning information of the location site with the estimated error and the indoor scene base map. Further, the image feature vector and the positioning feature vector are combined into one feature, denoted as follows: those of a typical indoor image, which makes it difficult for convolutional neural networks trained with typical indoor scene images to extract prominent scenery features from this type of image. Compared with image information, positioning information in the indoor scene cannot directly offer scene features, but the spatial relationship with each scene can be reflected through certain calculations, along with the indoor map. In this way, we can also judge the category of the scene at the current location. However, as indoor positioning accuracy is difficult to guarantee in situations of weak positioning signals or few signal sources, there will be a high risk of inaccuracy in the scenery prediction by using this information directly, especially in the area of multi-scene junctions and scene boundary areas. Thus, it is difficult to guarantee a high accuracy of indoor scene recognition when indoor image information or positioning information is provided alone.
Based on the above analysis, this paper puts forward a deep learning scene recognition method based on localization enhancement. Here, we extract image and positioning features from the scenery image and positioning data from positioning points, then use the method of deep learning to train the optimal feature fusion strategy for the fusion operation of these two features; finally, the fusion feature is used to obtain the scenery prediction results. The algorithm is shown in Figure 2. We extract the scene level feature vector of the scene image (labeled as ) by using a convolutional neural network that is fine-tuned to the scene studied, and extract the positioning feature vector by overlaying the operation using the positioning information of the location site with the estimated error and the indoor scene base map. Further, the image feature vector and the positioning feature vector are combined into one feature, denoted as follows: The combined feature vector is transformed by the feature fusion fully connected layer, then the output, called the fusion vector, is processed by the final decision fully connected layer and the output is the final prediction result.   The combined feature vector V f use is transformed by the feature fusion fully connected layer, then the output, called the fusion vector, is processed by the final decision fully connected layer and the output is the final prediction result.

Image Feature Vector
GoogleNet [33] improves the prediction result of the neural network by increasing the depth and width of the network. The depth of the network represents the number of layers the network has, and the width of the network represents the number of neurons in every layer. In order to solve the problem of overfitting of the network and the cost surging of model training, GoogleNet is added with the Inception structure (as shown in Figure 3). In the Inception V1 [33] version, a 1 × 1 convolution kernel is introduced into the structure to make the feature map thinner; for Inception V2 [34], on the one hand, the BN [35] layer is introduced to make the output parameters of every layer normalized into the Gaussian distribution of N(0, 1), in order to lower the risk of overfitting, and on the other hand, the method of replacing the 5 × 5 convolution kernel with two 3 × 3 convolution kernels, which is used in VGG [36] net, is introduced into the Inception structure for further reduction of the number of parameters; the Inception V3 [17] version introduced the concept of convolution decomposition, that is, decomposing the n × n convolution kernel into two one-dimensional 1 × n and n × 1 kernels, to speed up computing of the network and increase the depth of the network.

Image Feature Vector
GoogleNet [33] improves the prediction result of the neural network by increasing the depth and width of the network. The depth of the network represents the number of layers the network has, and the width of the network represents the number of neurons in every layer. In order to solve the problem of overfitting of the network and the cost surging of model training, GoogleNet is added with the Inception structure (as shown in Figure 3). In the Inception V1 [33] version, a 1 × 1 convolution kernel is introduced into the structure to make the feature map thinner; for Inception V2 [34], on the one hand, the BN [35] layer is introduced to make the output parameters of every layer normalized into the Gaussian distribution of N(0,1), in order to lower the risk of overfitting, and on the other hand, the method of replacing the 5 × 5 convolution kernel with two 3 × 3 convolution kernels, which is used in VGG [36] net, is introduced into the Inception structure for further reduction of the number of parameters; the Inception V3 [17] version introduced the concept of convolution decomposition, that is, decomposing the n × n convolution kernel into two one-dimensional 1 × n and n × 1 kernels, to speed up computing of the network and increase the depth of the network. There are foreground objects of different scale, size, and scope in an indoor scene image, making it difficult for a single type of convolution kernel to fully perceive these different scene features. However, the model construction method of the Inception module with different size of the paralleled convolution cores is suitable for solving the feature extraction task of such complex scene images. Therefore, we select the Googlenet model with Inception V3 structure as the image feature vector extraction module. By inputting the scene image obtained by the mobile device into the module, we can extract the scene level image feature vector, which is used to reflect the image feature of the current scene of the mobile device. The image feature is involved in the subsequent feature fusion layer, and conducts the fusion operation in combination with the positioning feature vector.
In order to avoid overfitting due to the module learning to the random noise from mobile device images, we will use a small number of typical scene images in the fine-tuning [37] process of the Inception V3 module. The typical scene images should have prominent scene features, significant difference between categories, rich features in every category, and a nearly even quantity of images in every category. After several layers of convolution, pooling and other operations on the image data, the Inception V3 module will output a bottleneck value tensor of [1,2048] size, which reflects the underlying features in the output of the convolution operation on the image. The process of fine-tuning can be seen as the process of transforming the bottleneck value feature into the scene level image feature, and its essence is to adjust the shape and parameter of the fully connected layer behind the convolution operation, so as to make the final output results of the model adapt to the problem studied.
According to , which represents the number of scene categories of the studied area in this paper, all parameters connected to the activation of the last convolution layer of Inception V3 are replaced with the fully connected layer with 2048 × + parameters, and the new layer outputs the final result through Softmax operation. The parameters of the fully connected layer are trained by using the typical image dataset collected from the studied scene in this paper, while the original parameters of all convolution layers are maintained. After the training is completed, the parameters of the fully connected layer are fixed and saved with the original convolution layer parameters as a new Inception network model. There are foreground objects of different scale, size, and scope in an indoor scene image, making it difficult for a single type of convolution kernel to fully perceive these different scene features. However, the model construction method of the Inception module with different size of the paralleled convolution cores is suitable for solving the feature extraction task of such complex scene images. Therefore, we select the Googlenet model with Inception V3 structure as the image feature vector extraction module. By inputting the scene image obtained by the mobile device into the module, we can extract the scene level image feature vector, which is used to reflect the image feature of the current scene of the mobile device. The image feature is involved in the subsequent feature fusion layer, and conducts the fusion operation in combination with the positioning feature vector.
In order to avoid overfitting due to the module learning to the random noise from mobile device images, we will use a small number of typical scene images in the fine-tuning [37] process of the Inception V3 module. The typical scene images should have prominent scene features, significant difference between categories, rich features in every category, and a nearly even quantity of images in every category. After several layers of convolution, pooling and other operations on the image data, the Inception V3 module will output a bottleneck value tensor of [1,2048] size, which reflects the underlying features in the output of the convolution operation on the image. The process of fine-tuning can be seen as the process of transforming the bottleneck value feature into the scene level image feature, and its essence is to adjust the shape and parameter of the fully connected layer behind the convolution operation, so as to make the final output results of the model adapt to the problem studied.
According to N category , which represents the number of scene categories of the studied area in this paper, all parameters connected to the activation of the last convolution layer of Inception V3 are replaced with the fully connected layer with (2048 × N category ) + N category parameters, and the new layer outputs the final result through Softmax operation. The parameters of the fully connected layer are trained by using the typical image dataset collected from the studied scene in this paper, while the original parameters of all convolution layers are maintained. After the training is completed, the parameters of the fully connected layer are fixed and saved with the original convolution layer parameters as a new Inception network model.

Positioning Feature Vector
As the positioning points in the scene are affected by the signal intensity and number of signal sources, the positioning result often has some uncertainty, which is manifested as the random distribution of the positioning points within a certain error range. Therefore, the scene prediction result obtained by the positioning information is only a probability value instead of a certain prediction result. To conduct the prediction by making full use of the geometric location information of the positioning points in the scene and the feature information of the relationship with the surrounding scene, in this paper, we will use the location features and the features of the relationship with all surrounding scenes to describe the positioning feature of the positioning points.
The location scene feature mainly represents the semantic representation of the plane coordinate (x, y) of points in the scene, that is, the scene category where the positioning point is located. Considering that the positioning point may be located outside the scene boundary due to an error, it is necessary to see this situation as a specific scene category. This feature can be expressed in the form of a one-hot encoding in {0, 1} distribution, with the shape of [1, N category + 1], where N category represents the number of scene categories studied. By overlaying the positioning points with the scene base map, we assigned 1 to the element whose index corresponds to the category index number of the scene that the point falls into, and set the remaining elements in the one-hot vector to 0. If the positioning point is outside the scene, the first element is set to 1 and the rest to 0, as follows: where i and k represent the scene category index, and Equation 2a represents the situation where the point belongs to no category, and Equation 2b represents the situation when the point is located in the kth type of scene category. The algorithm of overlaying the positioning points and scene base map can be equivalent to the method of judging whether the point is inside a polygon. There are several popular algorithms to solve this, including discriminating by the sum of area, discriminating by sum of angle, and the leading ray method. In this paper, we use the sum of area for discrimination. By saving the scene base map in the format of GeoJSON, we can use the type of polygon to save the boundary of scenes, and the points encoded around the outer boundary of the scene are anti-clockwise. A positioning point P o can form vector → P o P i and → P o P i+1 along with any two sequential points P i and P i+1 , and we can obtain the angle between these two vectors using the formula θ = cos −1 . Sum up all the angles formed by the positioning point and all the possible combinations of sequential points; the point is located inside the scene if the sum equals 2π, otherwise the point is located outside the scene. In addition, we need to consider the situation where some scenes have inner boundaries, and in this situation if the sum of angles equals 2π, the point is located outside the boundary, as shown in Figure 4. The algorithm's flow chart is shown in Algorithm 1.
we can obtain the angle between these two vectors using the formula θ = cos ⃗ ⃗ . Sum up all the angles formed by the positioning point and all the possible combinations of sequential points; the point is located inside the scene if the sum equals 2π, otherwise the point is located outside the scene. In addition, we need to consider the situation where some scenes have inner boundaries, and in this situation if the sum of angles equals 2π, the point is located outside the boundary, as shown in Figure 4. The algorithm's flow chart is shown in Algorithm 1.   For each polygon T i = [(x 1 , y 1 ), (x 2 , y 2 ), · · · , (x n , y n ), (x 1 , y 1 )] in D i 4: For each pair of sequential points P i , P i+1 in T i

5:
calculate the angle θ i between → P o P i and → P o P i+1 6: sum up angles θ Sum ← θ Sum + θ i 7: End For 8: If T i is the outer boundary and θ Sum equals 2π 9: element V k ← 1 in feature vector V location , where k represents the category of scene D i 10: Else If T i is the inner boundary and θ Sum equals 2π 11: element V k ← 0 in feature vector V location , where k is the category of scene D i 12: End If 13: End The feature of the relationship between the positioning point and the surrounding scenes is related to the current positioning error. The larger the error is, the larger the region of scenes related to the positioning point, and the more complicated the scene category involved. In particular, when the point is near the junction of multiple scenes, the uncertainty of scene prediction is greater, so the relationship between the positioning with error and surrounding scenes needs to be expressed in some way. The positioning error can be estimated by using parameters such as the intensity of the positioning signal and the number of available signals while, in the experiment in this paper, the positioning error in the indoor scene is simulated by using the preset parameters. By overlaying various scene boundaries with the error circle whose center is the positioning point and radius equals the positioning error, we can measure the area of intersection between scenes and the error circle. The value of the area can reflect how significant the relationship between the current positioning point and the scene is. The more significant the relationship is, the more likely this positioning point is located inside this scene. By organizing the value of each area into a [1, N category ] sized tensor and normalizing the element values to the range of [0, 1], the significance of the relationship with the scene is expressed by the size of the probability values, such as: where S i represents the area of intersection of the ith scene category and the error circle. The calculation of the intersection area of the multi-boundary scene and the error circle can be divided into the calculation of the intersection area of the triangle and circle. As shown in Figure 5, every positioning point, say P o , can form a triangle ∆P o P i P i+1 with any two of the sequential points in the boundary, such as P i and P i+1 . In this way, we can calculate the intersection areas of the error circle with each triangle corresponding to each border line; finally, the intersection area of the error circle and the scene boundary is the sum of the areas above. It should be noticed that, according to the encoding rules of GeoJSON, the outer boundary of polygons is encoded counterclockwise, while the inner boundary is encoded clockwise, so the value of area calculated by the outer boundary is positive in the method of the vector's cross product, and by the inner boundary, the result is negative. The detailed algorithm flow chart is shown in Algorithm 2. For each pair of sequential points P 、P in T 5: If segment P P is inside the error circle 6: sum up the area S ← S + S ∆ 7: Else If segment P P is outside the error circle 8: respectively calculate the intersection P and P of P P , P P and the error circle 9: sum up the area S ← S + S

10:
Else 11: calculate the intersection P of P P and the error circle 12: calculate the intersection P (or P ) of P P (or P P ) and the error circle 13: sum up the area S ← S + S ∆ + S

14:
End If 15: End  For each polygon T i = [(x 1 , y 1 ), (x 2 , y 2 ), · · · , (x n , y n ), (x 1 , y 1 )] in scene D i 4: For each pair of sequential points P i ,P i+1 in T i 5: If segment P i P i+1 is inside the error circle 6: sum up the area S Sum ← S Sum + S ∆P o P i P i+1 7: Else If segment P i P i+1 is outside the error circle 8: respectively calculate the intersection P i and P i+1 of P o P i , P o P i+1 and the error circle 9: sum up the area S Sum ← S Sum + S sector P i P o P i+1 10: Else 11: calculate the intersection P c of P i P i+1 and the error circle 12: calculate the intersection P i (or P i+1 ) of P o P i (or P o P i+1 ) and the error circle 13: sum up the area S Sum ← S Sum + S ∆P o P c P i+1 + S segment P c P o P i 14: End The location scene feature vector and the relation vector between the positioning point and the surrounding scene are merged to form the positioning feature vector, sized as 1, 2 * N category + 1 , as follows:

Neural Network Model Design
The feature fusion model designed in this paper is mainly composed of the Inception V3 convolution processing image feature extraction module, and the feature fusion and decision module (as shown in Figure 6). The Inception V3 convolution based image feature extraction module is obtained through fine-tuning with typical scene images, as described in Section 3.1. The feature fusion and decision module is composed of a feature fusion fully connected layer and final prediction fully connected layer, and the parameters are obtained through the simulative localization and image training set.  The neural network requests two types of input data, one is the image data of the corresponding positioning point and the other is the positioning feature vector , which is obtained through the localization data process in Section 3.2. The image will be processed into a one-dimension image feature vector sized as 1, by the Inception V3 module's convolution process, and the image feature vector will be combined with the positioning feature vector as a one-dimension vector , sized as 1,3 * + 1 , which will be the input in the latter process module. Through training a fully connected layer with 3 × + 1 + 3 × + 1 parameters using the ReLU function as the activation function, we can turn the feature vector combined with the image feature vector and the positioning feature vector into the fused vector of image information and positioning information. Finally, we train a fully connected layer with 3 × + 1 × + parameters using a Softmax operation to output the final scene prediction result.

Model Training
The training process of the model mainly focuses on two major modules, namely, the image feature extraction module and the feature fusion and decision module. The processes are as follows:


Fine-tune the Inception V3 model: We used the typical scene image set to fine-tune the pre-trained Inception V3 model and replaced the fully connected layer behind the previous convolution layer with a new fully connected layer whose input tensor has the dimensions  The neural network requests two types of input data, one is the image data of the corresponding positioning point and the other is the positioning feature vector V positioning , which is obtained through the localization data process in Section 3.2. The image will be processed into a one-dimension image feature vector sized as 1, N category by the Inception V3 module's convolution process, and the image feature vector will be combined with the positioning feature vector V positioning as a one-dimension vector V f use , sized as 1, 3 * N category + 1 , which will be the input in the latter process module. Through training a fully connected layer with 3 × N category + 1 2 + 3 × N category + 1 parameters using the ReLU function as the activation function, we can turn the feature vector combined with the image feature vector and the positioning feature vector into the fused vector of image information and positioning information. Finally, we train a fully connected layer with 3 × N category + 1 × N category + N category parameters using a Softmax operation to output the final scene prediction result.

Model Training
The training process of the model mainly focuses on two major modules, namely, the image feature extraction module and the feature fusion and decision module. The processes are as follows: • Fine-tune the Inception V3 model: We used the typical scene image set to fine-tune the pre-trained Inception V3 model and replaced the fully connected layer behind the previous convolution layer with a new fully connected layer whose input tensor has the dimensions [1,2048] and output tensor with 1, N category dimensions. The original parameters of the convolution layers are retained and the parameters of the new fully connected layer are updated via training.
• Training the feature fusion and decision layers: Using the processed positioning feature vector and the image feature vector from the fine-tuned Inception V3 module as the input data for the fully connected feature fusion layer, the output tensor propagates forward to the decision fully connected layer. Together with the ground-truth data, we can calculate the loss of training by the prediction result from the fully connected decision layer, and we use the training loss to update the parameters of the previous two fully connected layers.
In this paper, we built the feature fusion model base on the TensorFlow [38] framework, which is an open source artificial intelligence learning system developed by Google and is widely used in machine learning. The following hardware was used for training: CUP: 2.50 GHz Intel i5-7300HQ and GPU: GTX 1050 4 GB. The main parameters set in the training process are listed in Table 1:

Model Training and Validation Set
In this study, we prepared different training data sets for fine-tuning the Inception V3 model and training the feature fusion and decision layers. We divide the datasets into train dataset and evaluation dataset (85%, 15% respectively). For fine-tuning the Inception V3 model, we collected five classes in the 2nd floor of the laboratory, and each class has more than 600 images; eight classes in the railway station, and each class has more than 800 images, except the indoor channel with 600 images. For our proposed model, we collected 3000 images of the same classes in the 2nd floor of laboratory and 3500 images of the same classes in the railway station, according to the number of simulative positioning points. We adopted a typical indoor image set for fine-tuning Inception V3. The dataset must have prominent scene features, significant difference between categories, rich features in every category, and the number of images in every category must be approximately equal. The dataset is used to train the Inception V3 model to function as the image feature extraction module. The aforementioned features of the dataset can ensure that the fine-tuned model is be sensitive to the important scenic features of the image captured using a mobile device and reduce the effect of noise in the image.
We used simulative positioning and the image dataset to train the feature fusion and decision layers. The dataset consists of two parts, the simulated point data and the corresponding image data. Together, they can simulate a given state of a mobile device in an indoor environment. The simulated point dataset is generated by adding random noise (in this experiment, the noise range is 5 m) to the appropriate positioning points that are chosen along the main roads of the scene. Each simulated point corresponds to the scene image that was collected from the exact point in the scene that it belongs to. The image must have strong randomness, which simulates the situation when the mobile device moves indoors. This randomness can be implemented in different ways, such as shooting angle, focus, and exposure.
In this paper, we chose the following two kinds of typical indoor scenes as the target for experiment: The second floor of the laboratory and the railway station. The laboratory scenery is classified into five categories, which includes a machine room, office, corridor, lobby, meeting room. The railway station scenery is classified into eight scene categories, namely, indoor channel, ticket office, service counter, waiting hall, stairs, foot court, shop, and security check channel. The scene distribution is shown in Figure 7.

Validation Set for Generalization Ability
In order to compare the generalization ability of the fine-tuned Inception V3 model and the feature fusion model, we selected the third floor of the laboratory with a category of scenery category similar to that of the second floor as the dataset for validation of the generalization. We prepared the simulative positioning and image set with the same range of noise as the experiment on the second floor, and we collected images from the third floor scenery including the machine room, office, corridor, and lobby. The distribution of scenery is shown in Figure 8.

Experimental Design
To compare the performance of the feature fusion model in this paper and the traditional convolution neural network in the indoor scene identification task, in the experiment was conducted

Validation Set for Generalization Ability
In order to compare the generalization ability of the fine-tuned Inception V3 model and the feature fusion model, we selected the third floor of the laboratory with a category of scenery category similar to that of the second floor as the dataset for validation of the generalization. We prepared the simulative positioning and image set with the same range of noise as the experiment on the second floor, and we collected images from the third floor scenery including the machine room, office, corridor, and lobby. The distribution of scenery is shown in Figure 8.

Validation Set for Generalization Ability
In order to compare the generalization ability of the fine-tuned Inception V3 model and the feature fusion model, we selected the third floor of the laboratory with a category of scenery category similar to that of the second floor as the dataset for validation of the generalization. We prepared the simulative positioning and image set with the same range of noise as the experiment on the second floor, and we collected images from the third floor scenery including the machine room, office, corridor, and lobby. The distribution of scenery is shown in Figure 8.

Experimental Design
To compare the performance of the feature fusion model in this paper and the traditional convolution neural network in the indoor scene identification task, in the experiment was conducted to obtain the following four levels of comparison as shown in Table 2: Plan 1, in which the dataset of

Experimental Design
To compare the performance of the feature fusion model in this paper and the traditional convolution neural network in the indoor scene identification task, in the experiment was conducted to obtain the following four levels of comparison as shown in Table 2: Plan 1, in which the dataset of the second floor of the laboratory was used to fine-tune the Inception V3 model and to train the model designed in this paper. A subset of the data was used as a test set to test the identification performance of the two models. Plan 2 is similar to plan 1. We tested the identification performance of models with the same method in the railway station scenery. Plan 3, to determine the performance of this model with positioning points at different positioning accuracies, we constructed different positioning datasets with different positioning errors in the second floor of the laboratory scene while retaining the image information. We used the datasets of different positioning accuracy to test the performance of the models. In plan 4, to compare the generalization ability of the Inception V3 and the model in this paper, we selected the third floor of the laboratory as the experimental site. The third floor has scenery similar to that of the second floor. We then constructed the test set based on the procedure of construction for a simulative localization and image set and we tested the model trained based on the dataset of the second floor.

The Result of Laboratory Scene
For the laboratory scene, the fine-tuning procedure on the Inception V3 model achieves convergence after 2000 iterations, and the training of feature fusion and decision module achieves convergence after 10,000 iterations. The test set is 15% of the simulative localization and image data. Figure 9 shows the result of the overlapping of the positioning point of the test sample set and scene boundary (black spots are prediction error points). The percentage of the scene in the positioning point that is consistent with the real value is 69.0%. The error points are mainly concentrated near the scene boundary, especially at the intersection of the machine room, corridor, and lobby.
The fine-tuned Inception V3 model is fed on the image dataset corresponding to the positioning points. The prediction results are as shown in Figure 10a. The prediction results of the feature fusion model are shown in Figure 10b.
The prediction accuracy of the fine-tuned Inception V3 model is 73.3%. The distribution of error points vary between scenes. For example, in the lobby scenes, the error points are mainly concentrated in the lower part of the scene map. In addition, Inception V3 yields good recognition results in scenes with relatively simple and singular layouts (such as the corridor). Since there are fewer foreground objects and a higher similarity in terms of background features, the image obtained by the mobile device is subjected to lesser disturbance in terms of shooting angle, focus, and exposure. Additionally, even the noise is relatively weak.
The accuracy obtained for the second floor of the laboratory using the feature fusion model is 97.0%. The error points near the boundary of the scene have generally been corrected compared with results of overlapping way as Figure 9. In particular, the prediction results for locations beyond the scene showed the most significant improvement. In contrast to the fine-tuned Inception V3 model, the feature fusion model that combines positioning and image features, is less-dependent on image features and extends the Eigenspace of positioning points. When the image quality is poor, results cannot always accurately reflect the scene feature; however, accurate prediction results can still be yielded. As a result, there is no significant difference in the distribution of error points in terms of the scene category.  As for the prediction result in individual categories, we can refer to the confusion matrix of the experimental result (as shown in Figure 11). The prediction of Inception V3 is highly efficient for the office and corridor categories. This is mainly because the features of these scenes are relatively simple. For instance, the corridor contains relatively fewer foreground objects. However, this model fails to capture prominent features due to the confusion in corridor and other categories. Images from other categories are easily misclassified as belonging to the office or corridor category. The feature fusion model yields a relatively average accuracy for each category. Given that the lobby is adjacent to several other types of scenes, a higher amount of test data is misclassified into the lobby  As for the prediction result in individual categories, we can refer to the confusion matrix of the experimental result (as shown in Figure 11). The prediction of Inception V3 is highly efficient for the office and corridor categories. This is mainly because the features of these scenes are relatively simple. For instance, the corridor contains relatively fewer foreground objects. However, this model fails to capture prominent features due to the confusion in corridor and other categories. Images from other categories are easily misclassified as belonging to the office or corridor category. The feature fusion model yields a relatively average accuracy for each category. Given that the lobby is adjacent to several other types of scenes, a higher amount of test data is misclassified into the lobby category. The results of the feature fusion model show that error points are focused in the junction As for the prediction result in individual categories, we can refer to the confusion matrix of the experimental result (as shown in Figure 11). The prediction of Inception V3 is highly efficient for the office and corridor categories. This is mainly because the features of these scenes are relatively simple. For instance, the corridor contains relatively fewer foreground objects. However, this model fails to capture prominent features due to the confusion in corridor and other categories. Images from other categories are easily misclassified as belonging to the office or corridor category. The feature fusion model yields a relatively average accuracy for each category. Given that the lobby is adjacent to several other types of scenes, a higher amount of test data is misclassified into the lobby category. The results of the feature fusion model show that error points are focused in the junction area of scenes with transparent surfaces (because of glass doors and so on.), such as the boundary between the corridor and the machine room or lobby. At these transparent junctions, significant errors might occur in image information. This might lead to the situation where the location and image information are biased and classified into a false category.

The Result of Railway Station Scene
For the railway station scene, the fine-tuned Inception V3 model achieves convergence after 2000 iterations. The proposed model achieves convergence after 15,000 iterations. The test set is 15% of the simulative localization and image set. The results of the overlapping of the positioning point of the test sample set and scene boundary are shown in Figure 12. As the area of railway station scene is wider than that of the scene on the second floor of the laboratory, the effect of the positioning error is reduced. The proportion of the scene categories in the positioning points that are consistent with the real value is 81.2% and the error points are still mainly concentrated near the scene boundary.
The accuracy of the results using the fine-tuned Inception V3 model is 77.7%, and the distribution of the prediction results is shown in Figure 13a. In contrast to smaller indoor scenes (such as the laboratory), the proportion of the background in the railway station image has increased. Thus, a smaller scene area (such as the service counter and the security check channel) is easily affected by background information from scenes with larger areas at the railway station. The features of the images are biased towards background information of the larger area scene. Thus, the number of error points increase.

The Result of Railway Station Scene
For the railway station scene, the fine-tuned Inception V3 model achieves convergence after 2000 iterations. The proposed model achieves convergence after 15,000 iterations. The test set is 15% of the simulative localization and image set. The results of the overlapping of the positioning point of the test sample set and scene boundary are shown in Figure 12. As the area of railway station scene is wider than that of the scene on the second floor of the laboratory, the effect of the positioning error is reduced. The proportion of the scene categories in the positioning points that are consistent with the real value is 81.2% and the error points are still mainly concentrated near the scene boundary.
The accuracy of the results using the fine-tuned Inception V3 model is 77.7%, and the distribution of the prediction results is shown in Figure 13a. In contrast to smaller indoor scenes (such as the laboratory), the proportion of the background in the railway station image has increased. Thus, a smaller scene area (such as the service counter and the security check channel) is easily affected by background information from scenes with larger areas at the railway station. The features of the images are biased towards background information of the larger area scene. Thus, the number of error points increase. The accuracy of the railway station scenery using the feature fusion model is 96.6%. The distribution of predicted result is shown in Figure 13b. Compared with the fine-tuned Inception V3, owing to the combination of image features and positioning features, the description information of the scenes with smaller regions is more detailed, and the influence of the background information of a larger scene area is reduced. Hence, the occurrence of the situation in which a smaller area scene is incorrectly predicted is relatively less frequent.   Figure 14 shows the confusion table for the eight categories, namely indoor channel, ticket office, service counter, waiting hall, stairs, food court, shop, and security check channel. The results of the fine-tuned Inception V3 model show that the features of scenes with smaller areas are easily affected by scenes with larger areas (such as waiting hall and indoor channel) and rich background information. Therefore, more images are classified to the waiting hall and indoor channel categories. Meanwhile, the prediction for smaller area scenes (such as service counter, stairs, and check channel) is generally not good as those of the larger area scenes. Owing to the fusion of image features and positioning features, the feature fusion model in this paper has more detailed description information for scenes with small areas and weaker confusion effects from the background information of scenes of larger areas. Thus, the number of erroneous classifications for small areas is reduced. Furthermore, the results show that scenes of larger areas are predicted more accurately. The accuracy of the railway station scenery using the feature fusion model is 96.6%. The distribution of predicted result is shown in Figure 13b. Compared with the fine-tuned Inception V3, owing to the combination of image features and positioning features, the description information of the scenes with smaller regions is more detailed, and the influence of the background information of a larger scene area is reduced. Hence, the occurrence of the situation in which a smaller area scene is incorrectly predicted is relatively less frequent. The accuracy of the railway station scenery using the feature fusion model is 96.6%. The distribution of predicted result is shown in Figure 13b. Compared with the fine-tuned Inception V3, owing to the combination of image features and positioning features, the description information of the scenes with smaller regions is more detailed, and the influence of the background information of a larger scene area is reduced. Hence, the occurrence of the situation in which a smaller area scene is incorrectly predicted is relatively less frequent.   Figure 14 shows the confusion table for the eight categories, namely indoor channel, ticket office, service counter, waiting hall, stairs, food court, shop, and security check channel. The results of the fine-tuned Inception V3 model show that the features of scenes with smaller areas are easily affected by scenes with larger areas (such as waiting hall and indoor channel) and rich background information. Therefore, more images are classified to the waiting hall and indoor channel categories. Meanwhile, the prediction for smaller area scenes (such as service counter, stairs, and check channel) is generally not good as those of the larger area scenes. Owing to the fusion of image features and positioning features, the feature fusion model in this paper has more detailed description information for scenes with small areas and weaker confusion effects from the background information of scenes of larger areas. Thus, the number of erroneous classifications for small areas is reduced. Furthermore, the results show that scenes of larger areas are predicted more accurately.  Figure 14 shows the confusion table for the eight categories, namely indoor channel, ticket office, service counter, waiting hall, stairs, food court, shop, and security check channel. The results of the fine-tuned Inception V3 model show that the features of scenes with smaller areas are easily affected by scenes with larger areas (such as waiting hall and indoor channel) and rich background information. Therefore, more images are classified to the waiting hall and indoor channel categories. Meanwhile, the prediction for smaller area scenes (such as service counter, stairs, and check channel) is generally not good as those of the larger area scenes. Owing to the fusion of image features and positioning features, the feature fusion model in this paper has more detailed description information for scenes with small areas and weaker confusion effects from the background information of scenes of larger areas. Thus, the number of erroneous classifications for small areas is reduced. Furthermore, the results show that scenes of larger areas are predicted more accurately. Categories such as service counters and stairs, which is surrounded by larger scenes such as waiting rooms and indoor channels, are partly misclassified to larger area scene but it is reduced compared to the prediction result of the fine-tuned Inception V3 model.

Generalization Experiments
In order to verify the generalization ability of the feature fusion model in the condition with positioning data of different positioning accuracy, we change the positioning errors of the simulative positioning and image dataset from 1 m to 10 m with the corresponding images remained the same. The prediction result is shown in Figure 15.
As shown in the prediction results, by increasing the positioning error, the prediction accuracy based on positioning information decreases significantly to 40% when the error comes to 10 m. As for the feature fusion model, the image information can be used to compensate for the low accuracy of the positioning information. In the broken line of accuracy, as shown in Figure 15, it can be seen that the line representing the feature fusion model gets increasingly close to the line of the Inception V3 as the positioning error is increased. This means that greater weight is given to the image feature.
The comparison of the results shows that, although the increase of error range causes a dramatic decrease in the accuracy of prediction solely with positioning information, the proposed model's prediction accuracy decreases within a relatively small range. It can be proved that the feature-linking neural layers in the proposed model plays a role in the combined decision made using image and positioning information. Even though the positioning is made in indoor conditions, such as conditions when the signals of opportunity received by mobile devices are weak or under interference, the proposed model can increase prediction accuracy by applying positioning

Generalization Experiments
In order to verify the generalization ability of the feature fusion model in the condition with positioning data of different positioning accuracy, we change the positioning errors of the simulative positioning and image dataset from 1 m to 10 m with the corresponding images remained the same. The prediction result is shown in Figure 15.
As shown in the prediction results, by increasing the positioning error, the prediction accuracy based on positioning information decreases significantly to 40% when the error comes to 10 m. As for the feature fusion model, the image information can be used to compensate for the low accuracy of the positioning information. In the broken line of accuracy, as shown in Figure 15, it can be seen that the line representing the feature fusion model gets increasingly close to the line of the Inception V3 as the positioning error is increased. This means that greater weight is given to the image feature. The comparison of the results shows that, although the increase of error range causes a dramatic decrease in the accuracy of prediction solely with positioning information, the proposed model's prediction accuracy decreases within a relatively small range. It can be proved that the feature-linking neural layers in the proposed model plays a role in the combined decision made using image and positioning information. Even though the positioning is made in indoor conditions, such as conditions when the signals of opportunity received by mobile devices are weak or under interference, the proposed model can increase prediction accuracy by applying positioning information as a supplementary feature to the image feature. Thus, the prediction has a stable accuracy and the model has appreciable generalization when confronted with positioning data of varying accuracy.  To validate the generalization of the model with images from different scenes of similar categories, we chose the third floor of the laboratory, which has scenic similarities to the second floor of the laboratory, as the place to collect the validation set. This validation set contains 500 positioning points. We used the validation set to test the proposed model that was trained using the training set from the scenery of the second floor. The results show that the accuracy of the fine-tuned Inception V3 model, which was also trained using the dataset from the second floor (see Figure 16a), is 69.4%. Additionally, the accuracy of the feature fusion model (see Figure 16b) is 91.4%. The feature fusion model achieves better prediction results in recognizing other dataset apart from the training set, compared with the fine-tuned Inception V3 model.  Table 3) shows that the accuracy of the feature fusion model has increased significantly compared to that of the fine-tuned Inception V3 model, especially with regard to scenery with wide areas, simple scenery features, or simple layouts. However, regarding the office scenery, owing to the differences between the office scene in the To validate the generalization of the model with images from different scenes of similar categories, we chose the third floor of the laboratory, which has scenic similarities to the second floor of the laboratory, as the place to collect the validation set. This validation set contains 500 positioning points. We used the validation set to test the proposed model that was trained using the training set from the scenery of the second floor. The results show that the accuracy of the fine-tuned Inception V3 model, which was also trained using the dataset from the second floor (see Figure 16a), is 69.4%. Additionally, the accuracy of the feature fusion model (see Figure 16b) is 91.4%. The feature fusion model achieves better prediction results in recognizing other dataset apart from the training set, compared with the fine-tuned Inception V3 model.  To validate the generalization of the model with images from different scenes of similar categories, we chose the third floor of the laboratory, which has scenic similarities to the second floor of the laboratory, as the place to collect the validation set. This validation set contains 500 positioning points. We used the validation set to test the proposed model that was trained using the training set from the scenery of the second floor. The results show that the accuracy of the fine-tuned Inception V3 model, which was also trained using the dataset from the second floor (see Figure 16a), is 69.4%. Additionally, the accuracy of the feature fusion model (see Figure 16b) is 91.4%. The feature fusion model achieves better prediction results in recognizing other dataset apart from the training set, compared with the fine-tuned Inception V3 model.  Table 3) shows that the accuracy of the feature fusion model has increased significantly compared to that of the fine-tuned Inception V3 model, especially with regard to scenery with wide areas, simple scenery features, or simple layouts. However, regarding the office scenery, owing to the differences between the office scene in the  Table 3) shows that the accuracy of the feature fusion model has increased significantly compared to that of the fine-tuned Inception V3 model, especially with regard to scenery with wide areas, simple scenery features, or simple layouts. However, regarding the office scenery, owing to the differences between the office scene in the generalization test set and the office scene in the pre-training set, and the similarity of office scenery and other scenery in the training sets, the validation result is still not ideal. Therefore, the different functionality of similar scenes must be considered when constructing the categories for the scenery. Furthermore, the prediction accuracy of the feature fusion model is greater at the junction areas of scenes where the scenes can be seen through the junction boundary; the same as that of Inception V3 model, the junction area with transparent borders is still the main distribution area of the error points.

Conclusions
This paper proposed a deep learning scene recognition method based on localization enhancement. We use the method of deep learning to enhance the scenery features by fusing the scene level image feature and the positioning feature. The image feature is extracted from image data by a fine-tuned convolutional neural network, while the positioning feature is extracted from positioning information, in the form of the combination of the scene location feature and feature of the relationship with surrounding scenes. Experiments show that the proposed model has better accuracy in the environment with complex indoor scenes. Compared with the traditional convolution neural network, the proposed model has better generalization ability when dealing with scene image data of the same kind, but from different sources, and also works well in positioning data of different error ranges.