Research on Feature Extraction Method of Indoor Visual Positioning Image Based on Area Division of Foreground and Background

In the process of indoor visual positioning and navigation, difficult points often exist in corridors, stairwells, and other scenes that contain large areas of white walls, strong consistent background, and sparse feature points. Aiming at the problem of positioning and navigation in the real physical world where the walls with sparse feature points are difficult to be filled with pictures, this paper designs a feature extraction method, ARAC (Adaptive Region Adjustment based on Consistency) using Free and Open-Source Software and tools. It divides the image into foreground and background and extracts their features respectively, to achieve not only retain positioning information but also focus more energy on the foreground area which is favourable for navigation. In the test phase, under the combined conditions of illumination, scale and affine changes, the feature matching maps by the feature extraction algorithm proposed in this paper are compared with those by SIFT and SURF. Experiments show that the number of correctly matched feature pairs obtained by ARAC is better than SIFT and SURF, and whose time of feature extraction and matching is comparable to SURF, which verifies the accuracy and efficiency of the ARAC feature extraction method.


Introduction
Indoor positioning cannot use the GNSS (Global Navigation Satellite System) due to the obstruction of the GPS (Global Positioning System) signal by the building [1]. The field of indoor positioning and navigation is developing rapidly [2]. Many experts and scholars at home and abroad have conducted a lot of research on indoor positioning methods [3] and technologies [4][5][6]. The positioning methods such as RFID (Radio Frequency IDentification) and Bluetooth will reduce the positioning accuracy due to various problems such as building materials, multipath fading, and noise interference, so their applicability is limited. With the continuous advancement of AI (Artificial Intelligence), machine vision positioning has developed rapidly. Compared with other indoor positioning methods, visual positioning has unique advantages because it does not require additional equipment. The feature extraction [7] and matching [8] of images in the process of visual positioning are important research contents, which are also the core technical issues in the field of computer vision. They are of high application value in scene recognition, image stitching, positioning navigation, intelligent vision diagnosis, robot vision, security monitoring, automatic driving and other engineering fields [9]. Image feature extraction is a key step in image recognition. The effect of feature extraction directly determines the effect of image recognition, thereby affecting subsequent positioning and navigation. How to extract image features with highly representative characteristics from original images is a research hotspot in intelligent image processing, and it is a prerequisite for feature matching, scene recognition, and indoor positioning and navigation.
Common feature point extraction methods include SIFT (Scale-Invariant Feature Transform), SURF (Speeded Up Robust Features), and ORB (Oriented FAST and Rotated BRIEF) methods [10]. The SIFT proposed by Lowe in 1999 is one of the most common local feature representation methods. This algorithm is being used to detect and describe local interest points each of which is accompanied by a corresponding size factor on objects. It is invariant to geometric characteristics and noise, and it is also stable to viewpoint changes [11]. However, its feature vector is 128-dimensional, so it has a large amount of calculation and a slower calculation speed [12,13], which is not suitable for navigation and other situations with high real-time requirements. Researchers have proposed many improvement methods on its basis. The SURF algorithm [14] proposed by Bay et al. in 2006 reduces the dimensions of the SIFT descriptive features, making it superior to the SIFT algorithm in performance and speed. The algorithm comparison experiment in literature [15] shows that the SURF algorithm is the most robust local feature algorithm. Based on this advantage, the SURF algorithm is widely used in image matching and related fields. The FAST (Features from Accelerated Segment Test) method has a faster calculation speed and has been applied in the ORB (Oriented FAST and Rotated BRIEF) description method. The traditional ORB method uses the BRIEF (Binary Robust Independent Elementary Feature) descriptor with directional information to calculate the grayscale of random point pairs in the neighbourhood of feature points. However, the feature descriptor is more sensitive to noise, so the matching effect is not ideal [16,17].
Whether it is SIFT, SURF or their improved algorithms, they all extract many features indiscriminately without screening. If it is in a wide-area outdoor or indoor condition where there are particularly many feature points, the features extracted by SIFT, SURF, and their improved methods are very good. However, in the process of positioning and navigation, we often encounter some corridors, stairwells, hallways, partitions in shopping malls and supermarkets, libraries, computer rooms and other places with large areas of white walls. The grayscale of the background of such scenes changes smoothly and the feature points are relatively sparse. The number of feature points provided for feature matching is limited, which increases the difficulty of positioning and navigation. In this type of scene, when SIFT and SURF are used for feature extraction, the large number of extracted feature points will increase the false matching rate and greatly extend the time of feature matching due to problems such as different illumination and affine transformation for the lack of screening. What is more, in the existing research, especially in the process of indoor positioning and navigation, a large number of markers, images and logos are often posted on the walls of the corridor by researchers themselves to improve the accuracy. Some objects are also placed on the table to increase the feature points to be extracted and matched to assist our positioning [18], but this is not the real situation. In the actual physical world, a large number of consistent backgrounds are difficult to be filled with logos and pictures, which leads to the studies of image feature extraction and matching in such sparse feature point scenes to ensure the accuracy of indoor positioning and navigation. Therefore, it is especially aiming at scenes with more white walls, strong background consistency, and sparse feature points to serve indoor navigation. This paper proposes a feature extraction method ARAC (Adaptive Region Adjustment based on Consistency) to support high-precision positioning and navigation in response to the changes in the image due to different conditions such as shooting position, angle and illumination. This method is based on the free and cross-platform text editor, Visual Studio Code, and the crossplatform computer vision library, Open CV (Open source Computer Vision library) issued under the BSD license (Berkeley Software Distribution) and uses python for programming to achieve feature extraction in a specific environment. Firstly, the Grab Cut method is used to segment the foreground and background of the indoor scene images. It is divided into foreground areas with abundant scene information and distinct features and background areas with gentle changes. Then it extracts features of the foreground and background areas respectively, so as not to lose the position information in the background of the image, but also focus more energy on the foreground areas with salient features. It achieves the purpose of improving the speed and accuracy of feature extraction to adapt to realtime navigation. In addition, faster speeds of extraction and matching can be achieved if foreground and background are processed in parallel.
The roadmap of the research is shown in Figure 1.
t. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 3 of 28 to achieve feature extraction in a specific environment. Firstly, the Grab Cut method is used to segment the foreground and background of the indoor scene images. It is divided into foreground areas with abundant scene information and distinct features and background areas with gentle changes. Then it extracts features of the foreground and background areas respectively, so as not to lose the position information in the background of the image, but also focus more energy on the foreground areas with salient features. It achieves the purpose of improving the speed and accuracy of feature extraction to adapt to real-time navigation. In addition, faster speeds of extraction and matching can be achieved if foreground and background are processed in parallel. The roadmap of the research is shown in Figure 1. In the data analysis, blue, red, yellow, and green represent SIFT, SURF, ARAC, and ARAC under parallel processing, respectively.

Features of the Image
The features of an image are the essential characteristics that can be distinguished from other types of images. The geometric characteristics of these features can be obtained from the extracted data through measurement or processing. Some features are natural features that can be felt intuitively, such as brightness, edges, textures and colours; some are obtained through transformation or calculations, such as matrix, histograms, scaleinvariant and principal components [19].
Image recognition is actually a classification process. To identify the category to which an image belongs, we need to distinguish it from other images of different categories. This requires that the selected features not only describe images well but more importantly, they must be able to distinguish images of different categories. We hope to select those image features with small differences between images of the same type and large differences between images of different categories, which we call the most discriminative features. In the data analysis, blue, red, yellow, and green represent SIFT, SURF, ARAC, and ARAC under parallel processing, respectively.

Features of the Image
The features of an image are the essential characteristics that can be distinguished from other types of images. The geometric characteristics of these features can be obtained from the extracted data through measurement or processing. Some features are natural features that can be felt intuitively, such as brightness, edges, textures and colours; some are obtained through transformation or calculations, such as matrix, histograms, scaleinvariant and principal components [19].
Image recognition is actually a classification process. To identify the category to which an image belongs, we need to distinguish it from other images of different categories. This requires that the selected features not only describe images well but more importantly, they must be able to distinguish images of different categories. We hope to select those image features with small differences between images of the same type and large differences between images of different categories, which we call the most discriminative features.

Features Matching
Image matching refers to the method of finding similar images in two or more images through a certain algorithm. In the research of digital image processing, image feature extraction and matching have always been key issues, which play vital roles in image registration, target detection, pattern recognition, computer vision and other fields.

FLANN
Feature matching records the feature points of the target image and the image to be matched, and constructs a descriptor according to the feature point set, compares and filters this feature descriptor, and finally obtains a mapping set of matching points. We can also measure the matching degree of two pictures according to the size of this set.
Image feature point matching is commonly used by BF Matcher (Brute Force Matcher) and FLANN matching [20]. The difference between the two is that BF Matcher tries all possible matches to find the best match. FLANN matching is an approximation method. It is faster because what it found is the nearest neighbouring match. The matching accuracy can be improved by adjusting the parameters.
In 2009, Muja and Lowe proposed the FLANN algorithm, which is a collection of nearest neighbour search algorithms for large data sets and high-level features, and the algorithm is not affected by local sensitive hashing. FLANN is mainly implemented based on a k-d tree or k-means tree. The effective search type and retrieval parameters are given by the distribution characteristics of the known data set and the required space resource consumption. The feature space required by this algorithm is usually a vector space where n is a real number. The key point of this algorithm is to find the nearest neighbour point in the nearest neighbourhood according to the Euclidean distance. The mathematical definition of the Euclidean distance is shown as Equation (1): If the obtained value of D is smaller, it means that the distance between the existing feature points is very "close", that is, the higher similarity of the feature point pairs in the image.
We use FLANN based on KNN (K Nearest Neighbour search) method to search and filter the search results [15]. The KNN search method is adopted when K = 2, that is, for each feature point in the search image, the KNN two-nearest neighbour search method is used in the training image to find its nearest neighbour and second nearest neighbour. We need to compare the distance between the feature point and two neighbouring points, denoted by D 1 and D 2 respectively. Only when the distance between the feature point and the nearest neighbour point is much smaller than the distance between the feature point and the next neighbour point, we consider that the feature point and the nearest neighbour point are correct matching pairs. Otherwise, when the two distances are relatively close, we can consider that the feature point and the two neighbouring points are not correctly matching pairs, which should be eliminated. Alternatively, we could think that the feature point can be correctly matched with the two neighbouring points (for example, there are several identical objects in the scene), but we should also eliminate them to reduce the subsequent impact of this matching relationship. In the finally realized system, when the ratio of the distance between the feature point and the nearest neighbour to the distance between the feature point and the next neighbour is less than 0.75, the corresponding matching pair is retained, otherwise, it is eliminated.

RANSAC
In the FLANN pre-matching process, the accuracy of the obtained matching points is not ideal, and the accuracy of target recognition will be affected to a certain extent, so the wrong matching points need to be deleted. We delete false matching points based on RANSAC. This algorithm can handle matching pairs with an error rate of more than 50%. It is one of the more robust feature matching screening algorithms and will greatly reduce false matching points.
RANSAC is the abbreviation of Random Sample Consensus. It is a set of sample data sets containing abnormal data. The mathematical model parameters of the data are ISPRS Int. J. Geo-Inf. 2021, 10, 402 5 of 27 calculated and the effective sample data is obtained. Fischler and Bolles proposed this algorithm in 1981, and it is widely used in computer vision.
The basic assumption of the RANSAC algorithm is that the sample contains inliers that can be described by the model and outliers that deviates far from the normal range and cannot adapt to the mathematical model. It means that the data set contains noise. The mismatched feature pairs in this paper are outliers.
RANSAC is an iterative process. In each iteration of the original RANSAC, the characteristic points are randomly sampled as initial values, and a model is fitted based on the inliers. This model is adapted to the assumed inliers, and all unknown parameters can be calculated from the assumed inliers. The currently obtained model is used to test all other data. If a certain point is suitable for the estimated model and the error is less than the set threshold, it is considered to be an inlier. If enough points are classified as hypothetical inliers, then the estimated model is reasonable enough. Then all the assumed inliers could be used to update the model estimated based on the inliers of the initial hypothesis. We need to iterate continuously, and finally, evaluate the model by estimating the error rate of the inliers.

Open Source Computer Vision Library
In the field of GIS (Geographical information system) research, Free and Open-Source Software and tools have been widely used. In the research of indoor positioning and navigation, more and more researchers are developing and exploring based on Open Source Software and tools such as OpenCV. It is a cross-platform computer vision and machine learning software library released under the BSD license. This manuscript uses the free tool OpenCV for programming to improve the replicability, reusability and accessibility, which is conducive to other researchers for subsequent verification and development.

Feature Extraction Model Construction
When using visual methods for indoor navigation, the quality of the feature points extracted from the images and the effect of feature matching are closely related to the results of indoor navigation. The more correct matching pairs, the higher the probability of determining the same situation, the better effect of positioning and navigation. In addition, for indoor images, due to different shooting positions, angles, and lighting conditions, the corresponding scale and brightness changes will affect the detection results. Therefore, in the process of feature extraction, we must also consider the interference caused by factors such as image angle, spatial location, and illumination. Designing an effective feature extraction method will help the subsequent feature matching and navigation in the indoor positioning process.
Aiming at the problems in the engineering context mentioned in the introduction, this paper proposes a feature extraction method ARAC, which can adaptively divide the foreground and background regions. The feature extraction model is shown in Figure 2. 1. This feature extraction model takes the corridor on the seventh floor of the experimental building of Heilongjiang University as an example. The first step is to read the image to be feature extracted in Visual Studio Code; 2. Grab Cut in OpenCV is used to adaptively divide the input image into foreground

1.
This feature extraction model takes the corridor on the seventh floor of the experimental building of Heilongjiang University as an example. The first step is to read the image to be feature extracted in Visual Studio Code; 2.
Grab Cut in OpenCV is used to adaptively divide the input image into foreground and background. Then different areas are processed separately; 3.
In the foreground area, the method proposed in this paper firstly builds a scalespace, and secondly compares the response value of the candidate feature points with the Hessian matrix threshold to determine the feature points. Then it calculates the Haar response value to determine the main direction, and finally generates the 64-dimensional feature descriptors of the foreground area; 4.
In the background area, ARAC firstly preprocesses the image to limit the size of the image and divides the background area into grids of equal size. It extracts features in each grid, and finally generates 128-dimensional feature descriptors of the background area; 5.
This completes the feature extraction of the Heilongjiang University badge in the foreground area of the corridor and the wall and lamp in the background.

Adaptive Region Division
For the input indoor image, the foreground and the background area are divided adaptively. This manuscript uses the Grab Cut method to process the image. This method is an image segmentation algorithm based on graph cut, which uses a bounding box specified by the user as the location of the segmentation target to achieve segmentation of the foreground object and the background image.
This method first loads the image to be processed, creates a rectangular mask with the same size as the foreground image and fills it with zeros. The area outside the rectangle is automatically recognized as the background. According to the rectangular area defined by the user, the data in the background can be used to distinguish the background and foreground areas in the rectangular frame. Then the GMM (Gaussian Mixture Model) is used to model the background and foreground, and the undefined pixels will be marked as possible foreground and background. Each pixel in the image is considered to be connected with surrounding pixels through a virtual edge, and each edge has a probability of belonging to the background or foreground, based on the colour similarity between itself and the surrounding pixels. Each pixel is connected to a foreground or background node. After the nodes are connected, if the edges between the nodes belong to different terminals, that is, one node belongs to the foreground and one node belongs to the background, the edges between them will be cut off, which will separate the parts of the image. This method is used for the foreground and background segment of the indoor images. The results of regional segmentation are to retain the foreground or background, and the rest is filled with black which are shown in Figure 3.

Feature Extraction in Foreground Area
In feature extraction of foreground region, Hessian matrix and integral calculation template are introduced, and the process of Gaussian filtering in SIFT is replaced by several addition and subtraction operations. It greatly improves the efficiency of feature description to reduce the dimension of the feature descriptor instead of the 128-dimension of SIFT. In addition, the complexity of the segmented image is reduced, which leads to lower computational complexity. Moreover, the foreground feature extraction method

Feature Extraction in Foreground Area
In feature extraction of foreground region, Hessian matrix and integral calculation template are introduced, and the process of Gaussian filtering in SIFT is replaced by several addition and subtraction operations. It greatly improves the efficiency of feature description to reduce the dimension of the feature descriptor instead of the 128-dimension of SIFT. In addition, the complexity of the segmented image is reduced, which leads to lower computational complexity. Moreover, the foreground feature extraction method only works in the local foreground area, so it need not traverse the whole image like SIFT and SURF, which greatly improves the operation speed.

Introduction of Calculation Template
To speed up feature extraction and detection, this paper introduces a determinant approximation image of the Hessian matrix. The Hessian matrix was proposed by the German mathematician Ludwin Otto Hessian in the 19th century. It is a square matrix composed of the second-order partial derivatives of a multivariate function, which describes the local curvature of the function. Its determinant is shown in Equation (2), and the corresponding Hessian matrix can be calculated for each pixel.
Through the determinant of the Hessian matrix shown in Equation (3), whether it is an extreme point could be judged.
When the discriminant of the Hessian matrix obtains a local maximum value, it is determined that the current point is brighter or darker than other points in the surrounding neighbourhood, to locate the position of the key point.
Before constructing the Hessian matrix, Gaussian filtering is required to keep scaleinvariant. In discrete space, because the Gaussian kernel can construct response images at different scales, this paper uses the second-order standard Gaussian kernel function to convolve the image to obtain the four elements of the Hessian matrix. At scale σ, at point f (x,y), the corresponding Hessian matrix is shown in Equation (4) H Among them, according to the LoG (Laplacian of Gaussian), the derivative of a function is equal to the convolution of the function and the derivative of the Gaussian function, that is, L xx (x,σ) is the convolution result of the second-order partial derivative of the Gaussian function g(x,y,σ) at the point f (x,y), as shown in Equations (5) and (6). The calculation method of L xy (x,σ) and L yy (x,σ) is the same.
To speed up the calculation of the foreground area, this paper operates on the integral image to achieve acceleration. The box filter is used instead of the Gaussian filter to approximate the Gaussian second-order derivative template. It only takes a few addition and subtraction operations to calculate the Hessian matrix of each pixel, and the amount of calculation is independent of the template size, so the scale pyramid of ARAC can be quickly constructed. The value of each pixel in the integral image, calculated according to Equation (7), is the sum of all elements in the upper left corner of the corresponding position on the original image.
A template can replace the two steps of Gaussian filtering and finding the second derivative which is shown in Figure 4. D xx , D yy and D xy are approximations to L xx , L yy and L xy respectively. The numbers in Figure 4 indicate the weights of the corresponding colour areas. The weight of the grey is zero. The box filter and image convolution results are denoted as D xx , D yy and D xy respectively.
To speed up the calculation of the foreground area, this paper operates on the integral image to achieve acceleration. The box filter is used instead of the Gaussian filter to approximate the Gaussian second-order derivative template. It only takes a few addition and subtraction operations to calculate the Hessian matrix of each pixel, and the amount of calculation is independent of the template size, so the scale pyramid of ARAC can be quickly constructed. The value of each pixel in the integral image, calculated according to Equation (7), is the sum of all elements in the upper left corner of the corresponding position on the original image.
A template can replace the two steps of Gaussian filtering and finding the second derivative which is shown in Figure 4. Dxx, Dyy and Dxy are approximations to Lxx, Lyy and Lxy respectively. The numbers in Figure 4 indicate the weights of the corresponding colour areas. The weight of the grey is zero. The box filter and image convolution results are denoted as Dxx, Dyy and Dxy respectively. The template is widely used because of its fast calculation speed on the integral image. However, the template is an approximation of the original Hessian matrix. The derivation can prove that the following formula is closer to the true value. When we use Gaussian second-order partial derivatives with σ = 1.2, the template size is 9 × 9, which is the smallest scale-space value for image filtering and spot detection. As shown in Equation (8), the determinant of the Hessian matrix can be simplified as follows:  The template is widely used because of its fast calculation speed on the integral image. However, the template is an approximation of the original Hessian matrix. The derivation can prove that the following formula is closer to the true value. When we use Gaussian second-order partial derivatives with σ = 1.2, the template size is 9 × 9, which is the smallest scale-space value for image filtering and spot detection. As shown in Equation (8), the determinant of the Hessian matrix can be simplified as follows: At the same time, according to Equation (9), the Hessian matrix determinant is further modified to calculate the weight of each expression. The constant C does not affect the comparison of extreme points, so the final simplified formula is shown in Equation (10): The Hessian matrix response value of each pixel is calculated by Equation (10) as the response image at the scale σ.
From this, we can get a determinant that approximates the Hessian matrix. In addition, to balance the error caused by using the box filter approximation, the response value should be normalized according to the filter size to ensure that the F-norm of any size filter is uniform. The corresponding relative weight w of the filter is 0.9. Theoretically, the size of the template corresponding to different σ is not the same, and the value of w is different. For the sake of simplicity, it is considered to be the same constant.

Construction of Scale-Space
The purpose of constructing the scale space is to find the extreme points in the spatial domain and the scale domain as preliminary feature points. The traditional method of constructing scale space is to construct a Gaussian pyramid with the original image as the bottom layer, and then perform Gaussian blur and downsampling on the image as the next layer of the image. Next, it continues to iterate until meeting the condition. In the Gaussian pyramid, the size of the original image is constantly changing, and the size of the Gaussian template remains unchanged. The establishment of each layer can only be processed after the construction of the previous layer is completed. The dependence is very strong, which causes the speed to be very slow.
The method used in this paper to construct the scale pyramid is the opposite. The original image size remains unchanged, but the template size is changed, that is, the original image is filtered by the template box with the size changing to construct the scale space. Parallel operations are used to process each layer of the pyramid simultaneously. The response image of the Hessian matrix generated by the convolution of the gradually increasing box size filter template and the integral image is used to suppress the 3D non-maximum value on the response image to obtain various spots of different scales. The down-sampling process in traditional scale pyramid construction is omitted, thereby improving the processing speed. The pyramid image constructed by ARAC is shown in Figure 5. ARAC uses the box filter of 9 × 9 as the initial filter, the response image obtained is taken as the bottom image. Then the size of the filter gradually increases which can be calculated by Equation (11), and the original image continues to be filtered.
Both octave and interval start from one in Equation (11), that is, when on the 0th interval of the 0th octave, in the Equation (11), octave = 1 and interval = 1.
The reasons for choosing this method to define the filter size are as follows. The pyr- ARAC uses the box filter of 9 × 9 as the initial filter, the response image obtained is taken as the bottom image. Then the size of the filter gradually increases which can be calculated by Equation (11), and the original image continues to be filtered.
Both octave and interval start from one in Equation (11), that is, when on the 0th interval of the 0th octave, in the Equation (11), octave = 1 and interval = 1.
The reasons for choosing this method to define the filter size are as follows. The pyramid image consists of a number of fixed layers. Since the integral image is discretized, the smallest scale change between the two layers is determined by the Gaussian secondorder derivative filter. It is determined by the length l 0 of the positive and negative spot response, whose size is 1/3 of the box filter template. For the box filter of 9 × 9, l 0 is 3. The response length of the next layer should be increased by at least 2 pixels to ensure one pixel on each side, that is l 0 = 5 so that the size of the template is 15 × 15. Analogously, we can get a sequence of templates with gradually increasing sizes. Their sizes are 9 × 9, 15 × 15, 21 × 21 and 27 × 27 respectively, and the length of the black and white areas increases by an even number of pixels to ensure the existence of a central pixel. According to this template sequence, a scale-space is constructed.

Determination of Foreground Feature Points
In the selection process of feature points, to locate points of interest in images of different sizes, a 3 × 3 × 3 filter is taken as an example. The response values of each pixel on each layer of the image are compared with the values in spatial and the scale neighbourhood (excluding the first and last layer). There are 8 neighbourhood pixels on the same layer and 2 × 9 = 18 pixels in the vector scale space, with a total of 26-pixel values for comparison. The maximum and minimum candidate feature points are selected. If the response value of the feature point is less than the threshold of the Hessian determinant, it is discarded. Increasing the threshold can reduce the number of detected feature points, and eventually, only a few of the strongest points will be detected. If the feature value of the pixel marked "x" in the figure is greater than the surrounding pixels, it can be determined that the point is the feature point of the area as Figure 6 shows. To ensure rotation invariance, it is necessary to assign main directions to the feature points. Taking the feature point as the centre, the Harr wavelet responses of all pixels are counted in a circle with a radius of 6S. The results of image convolution as shown in Figure  7 are adopted as the horizontal and vertical Harr wavelet response of each pixel point. The size of the Harr wavelet is 4S, as shown in Equation (12). The scale value S where the feature point is located is calculated according to the size of the current template. Figure 6. Determination of feature points. The pixel marked "x" in the middle layer is the feature point of the area whose value is greater than green points.
To ensure rotation invariance, it is necessary to assign main directions to the feature points. Taking the feature point as the centre, the Harr wavelet responses of all pixels are counted in a circle with a radius of 6S. The results of image convolution as shown in Figure 7 are adopted as the horizontal and vertical Harr wavelet response of each pixel point. The size of the Harr wavelet is 4S, as shown in Equation (12). The scale value S where the feature point is located is calculated according to the size of the current template.
7 are adopted as the horizontal and vertical Harr wavelet response of each pixel point. The size of the Harr wavelet is 4S, as shown in Equation (12). The scale value S where the feature point is located is calculated according to the size of the current template.  (12) After calculating the response values of the image in the horizontal and vertical directions of the Haar wavelet, the two values are Gaussian weighted with a factor of 2S, and the weighted values represent the directional components in the horizontal and vertical directions, respectively. The Harr feature value reflects the change of the image grayscale, and then this main direction is to describe those areas where the grayscale changes particularly sharply. The main direction can be obtained by taking the feature point as the centre and sliding the sector with an opening angle of 60° as shown in Figure 8. Then the accumulation of the horizontal and vertical responses of the Harr wavelet value in the window are calculated according to Equations (13) and (14). The main direction can be obtained by taking the feature point as the centre and sliding the sector with an opening angle of 60 • as shown in Figure 8. Then the accumulation of the horizontal and vertical responses of the Harr wavelet value in the window are calculated according to Equations (13) and (14).

Foreground Feature Point Descriptor
When generating feature descriptors, ARAC selects a square frame with a side length of 20S around the feature point and coincides the direction of the square with the main direction determined in the previous step as shown in Figure 9. The square frame is divided into 4 × 4 sub-regions, each with a size of 5S × 5S pixels. Then we count the horizontal and vertical Haar wavelet features of 25 pixels in each region. Each Haar wavelet feature contains 4 values which are the sum of the horizontal direction Σdx, the absolute value of the horizontal direction Σ|dx|, the sum of the vertical direction Σdy and the sum

Foreground Feature Point Descriptor
When generating feature descriptors, ARAC selects a square frame with a side length of 20S around the feature point and coincides the direction of the square with the main direction determined in the previous step as shown in Figure 9. The square frame is divided into 4 × 4 sub-regions, each with a size of 5S × 5S pixels. Then we count the horizontal and vertical Haar wavelet features of 25 pixels in each region. Each Haar wavelet feature contains 4 values which are the sum of the horizontal direction Σdx, the absolute value of the horizontal direction Σ|dx|, the sum of the vertical direction Σdy and the sum of the absolute value of the vertical direction Σ|dy|, where the horizontal and vertical directions are relative to the main direction. These four values are taken as the feature vector of each sub-block region, that is, a common 4 × 4 × 4 = 64-dimensional vector is used as the feature descriptor of the foreground region of ARAC. Compared with traditional SIFT, the feature dimension is generally reduced, which greatly improves the speed of scene recognition. Finally, the feature vector is normalized according to Equation (15) to prevent the influence of illumination and contrast ratio. Figure 8. Determination of the main direction of the ARAC feature point. Take the feature point as the centre and slide the sector with an opening angle of 60°: (a) general vector length; (b) minimum vector length; (c) the main direction whose victor is the largest.

Foreground Feature Point Descriptor
When generating feature descriptors, ARAC selects a square frame with a side length of 20S around the feature point and coincides the direction of the square with the main direction determined in the previous step as shown in Figure 9. The square frame is divided into 4 × 4 sub-regions, each with a size of 5S × 5S pixels. Then we count the horizontal and vertical Haar wavelet features of 25 pixels in each region. Each Haar wavelet feature contains 4 values which are the sum of the horizontal direction Σdx, the absolute value of the horizontal direction Σ|dx|, the sum of the vertical direction Σdy and the sum of the absolute value of the vertical direction Σ|dy|, where the horizontal and vertical directions are relative to the main direction. These four values are taken as the feature vector of each sub-block region, that is, a common 4 × 4 × 4 = 64-dimensional vector is used as the feature descriptor of the foreground region of ARAC. Compared with traditional SIFT, the feature dimension is generally reduced, which greatly improves the speed of scene recognition. Finally, the feature vector is normalized according to Equation (15) to prevent the influence of illumination and contrast ratio. dx, dy, dx , dy The ARAC algorithm is used to extract features of the foreground area. It can be seen from Figure 10 that the ARAC feature descriptor can intensively and effectively describe The ARAC algorithm is used to extract features of the foreground area. It can be seen from Figure 10 that the ARAC feature descriptor can intensively and effectively describe the characters and patterns in the school badge of Heilongjiang University, which is fundamental to the determination of the scene and provides effective evidence for indoor positioning and navigation.

Feature Extraction in Background Area
In the processing of the background area with strong consistency, the image is preprocessed to reduce the amount of calculation for subsequent feature extraction. When detecting feature points in the background area, ARAC does not construct a scale space

Feature Extraction in Background Area
In the processing of the background area with strong consistency, the image is preprocessed to reduce the amount of calculation for subsequent feature extraction. When detecting feature points in the background area, ARAC does not construct a scale space like SIFT and SURF but adopts a uniform sampling method, which greatly improves the processing speed. If the background can be processed in parallel with the foreground, the running time of the method proposed will be even less.

Image Background Preprocessing
The background feature extraction process of the ARAC adopts a fixed-step and fixedgrid sliding window to scan the background of the entire image, and an ARAC feature is extracted in each grid. Therefore, the higher the image resolution, the more local feature points are extracted. For an image containing large areas of smooth background, there are often a large number of similar feature points in the extracted features. These feature points not only are very helpful for classification and positioning but will increase the amount of calculation for subsequent processing. To solve such problems, this paper proposes an image preprocessing method, which can greatly reduce the resolution of the image while preserving the details of the image.
For the original image I∈R ab , assuming that the conversion matrix is R m ∈R cd , the preprocessed image is I n ∈R cd , where a and b are the size of the original image, c and d are the size of the converted image, according to Equations (16)- (18), the relationship between I, R and I n can be expressed as follows with a mathematical model: In Equation (16), c = γ × a and d = γ × b, I γ is the way of sampling the original image pixels. R m is a block matrix, and each block is a matrix of (γ −1 ) × 1. In this paper, the area is divided into 2 × 2 non-overlapping grids, and the elements in R m are shown in Equation (19).
The preprocessed image is shown in Figure 11. If the original image is 600 × 500, using Equations (16) and (18), the image can be expressed compactly as 300 × 225. Most of the details of the image after preprocessing are not lost as shown in Figure 11.
The preprocessed image is shown in Figure 11. If the original image is 600 × 500, using Equations (16) and (18), the image can be expressed compactly as 300 × 225. Most of the details of the image after preprocessing are not lost as shown in Figure 11. If you want to further reduce the image resolution, you can do the same operation on In again to reduce the size of the image. Through the above-mentioned preprocessing operation, when extracting the features of the background area, this paper limits the resolution of all images uniformly within the range 300 × 300. If you want to further reduce the image resolution, you can do the same operation on I n again to reduce the size of the image. Through the above-mentioned preprocessing operation, when extracting the features of the background area, this paper limits the resolution of all images uniformly within the range 300 × 300.

Determination of Background Feature Points
In the process of feature extraction of the background area of the indoor image, the background area of the image is divided into grids of equal size, and the features of the image are extracted in each grid. This process can be regarded as a uniform sampling of the image. It can extract the local features of the whole image completely. Even in areas where the texture and colour change smoothly, the local features of the image can also be expressed.
The generation of the ARAC background area descriptor does not acquire feature points in the Gaussian scale space, but directly samples the image uniformly in a sampling window of a specified size, and treats each pixel in the image as a key point. Then it obtains the sampling point coordinates and its feature descriptors in the image by sliding window.
A certain point is taken in the image as an example which is shown in Figure 12. A custom-sized patch is adopted to slide on the image with a certain step, and then the ARAC features of each patch block are calculated from left to right and top to bottom. This window is the sample area of the descriptor. The window size is 4bins × 4bins, and the bin size can be specified by yourself. The bin mentioned here corresponds to the sub-region in the ARAC feature extraction. The bounding box in the figure is the range of ARAC feature points.

Background Feature Point Descriptor
The algorithm divides the rectangular area representing the target into small blocks of the same size, calculates the characteristics of each small block, and then samples the ARAC features of each small block at the centre position. The black dot at the centre is the selected pixel, and the outermost frame represents the selected image block as shown in Figure 13.

Background Feature Point Descriptor
The algorithm divides the rectangular area representing the target into small blocks of the same size, calculates the characteristics of each small block, and then samples the ARAC features of each small block at the centre position. The black dot at the centre is the selected pixel, and the outermost frame represents the selected image block as shown in Figure 13.
It calculates the gradient of each pixel, counts the gradient histogram of the pixel in each bin in 8 directions, and takes the peak of the histogram as the main direction of the block. The 8-bit histograms of 4 × 4 small blocks in each sampling window are connected to form a 128-dimensional ARAC feature descriptor. The effect of feature extraction on the background with ARAC is shown in Figure 14.

Background Feature Point Descriptor
The algorithm divides the rectangular area representing the target into small blocks of the same size, calculates the characteristics of each small block, and then samples the ARAC features of each small block at the centre position. The black dot at the centre is the selected pixel, and the outermost frame represents the selected image block as shown in Figure 13.

Match Condition Construction
When students have classes, take large exams, listen to lectures, and participate in conferences, indoor navigation with high real-time performance is required. In the actual navigation process, we often encounter scenes with strong background consistency such as corridors and stairwells, or indoor environments with many rooms such as teaching buildings, hospitals, science parks, and libraries. Such scenes have fewer feature points and higher similarity, so feature extraction and matching are more difficult. Therefore, we shot in the Huiwen Building of Heilongjiang University whose construction meets the standards of large-scale examinations. The appearance of the teaching building and several indoor panoramic pictures are shown in Figure 15. Nine floors of the building are the main teaching areas, with 80 classrooms on each floor. It also contains four large lecture theatres, where meetings and lectures are often held.

Match Condition Construction
When students have classes, take large exams, listen to lectures, and participate in conferences, indoor navigation with high real-time performance is required. In the actual navigation process, we often encounter scenes with strong background consistency such as corridors and stairwells, or indoor environments with many rooms such as teaching buildings, hospitals, science parks, and libraries. Such scenes have fewer feature points and higher similarity, so feature extraction and matching are more difficult. Therefore, we shot in the Huiwen Building of Heilongjiang University whose construction meets the standards of large-scale examinations. The appearance of the teaching building and several indoor panoramic pictures are shown in Figure 15. Nine floors of the building are the main teaching areas, with 80 classrooms on each floor. It also contains four large lecture theatres, where meetings and lectures are often held.
To verify the effectiveness and accuracy of the ARAC proposed in this paper, as well as the robustness under different conditions of illumination and shooting angles, SIFT, SURF and ARAC algorithms are used to extract the features of the image. Under the combined conditions of illumination, affine and scale variations, FLANN is adopted to match the extracted features, and RANSAC is used to filter the matching pairs. Then the images matching results are compared and the data are analysed shot in the Huiwen Building of Heilongjiang University whose construction meet standards of large-scale examinations. The appearance of the teaching building and eral indoor panoramic pictures are shown in Figure 15. Nine floors of the building ar main teaching areas, with 80 classrooms on each floor. It also contains four large le theatres, where meetings and lectures are often held.  In the experiment, the PC is configured as Inter(R) Core(TM) i5-5200U CPU @ 2.2 GHz, and the memory is 4 G. The integrated development environment is Visual Studio Code, and the development language is python3.7. Based on the Open-Source tool, OpenCV, the image features are extracted and matched. The specific matching flowchart is shown in Figure 16 below. To verify the effectiveness and accuracy of the ARAC proposed in this paper, as well as the robustness under different conditions of illumination and shooting angles, SIFT, SURF and ARAC algorithms are used to extract the features of the image. Under the combined conditions of illumination, affine and scale variations, FLANN is adopted to match the extracted features, and RANSAC is used to filter the matching pairs. Then the images matching results are compared and the data are analysed In the experiment, the PC is configured as Inter(R) Core(TM) i5-5200U CPU @ 2.2 GHz, and the memory is 4 G. The integrated development environment is Visual Studio Code, and the development language is python3.7. Based on the Open-Source tool, OpenCV, the image features are extracted and matched. The specific matching flowchart is shown in Figure 16 below.

Test results of Each Feature Point Detection Method
To compare the actual detection effects of several feature extraction methods, eight indoor images are used as experimental objects. It includes corridors with logos and signposts, as well as classrooms and other real scenes with fewer background feature points and higher similarity. We use SIFT, SURF and the ARAC feature extraction method proposed in this paper to extract feature of the images which are shown in Figure 17a.

Test results of Each Feature Point Detection Method
To compare the actual detection effects of several feature extraction methods, eight indoor images are used as experimental objects. It includes corridors with logos and signposts, as well as classrooms and other real scenes with fewer background feature points and higher similarity. We use SIFT, SURF and the ARAC feature extraction method proposed in this paper to extract feature of the images which are shown in Figure 17a.

Test results of Each Feature Point Detection Method
To compare the actual detection effects of several feature extraction methods, eight indoor images are used as experimental objects. It includes corridors with logos and signposts, as well as classrooms and other real scenes with fewer background feature points and higher similarity. We use SIFT, SURF and the ARAC feature extraction method proposed in this paper to extract feature of the images which are shown in Figure 17a.  From the detection results in Figure 17, it can be found that there are differences in the ability of various feature detection methods to extract features. The results of SIFT feature detection show that different feature points have different sizes of feature points From the detection results in Figure 17, it can be found that there are differences in the ability of various feature detection methods to extract features. The results of SIFT feature detection show that different feature points have different sizes of feature points in the original image because of the different scale spaces detected. This method extracts feature points from the letters and Chinese characters on the school badge of Heilongjiang University, the corners of the oven, the cutting board, and even the shadow of the trash can, but the number of feature points extracted by this method is relatively small. The feature detection method of SURF detects a large number of feature points, and the selected neighbourhood range is larger. As shown in the fifth group of Figure 16, it not only extracts a large number of feature points in the foreground area but also extracts a lot of features in the distant light and shadow. From the comparison of the feature extraction results, it can be seen that the feature points extracted by SIFT and SURF are concentrated in the foreground area with obvious feature changes. The ARAC feature detection method in this paper divides the foreground and background areas, and then extract features with their own methods individually. It not only detects the overall and local features of the oven, school badge and signs in the foreground area but also extracts the features of the white walls, cabinets and desktops with gently changing in the background area, achieving the effect of not losing positioning information.

Feature Extraction Performance Analysis
To compare the feature extraction methods in more detail, each method is used for detecting 100 times to get the average time of images as shown in Table 1. The results of Table 1 show that the number of features detected by SURF is four to eight times that of SIFT, and the detection time is also longer than SIFT. Nevertheless, provided that the number of features extracted is the same, SURF needs less time. Through calculation, it can be found that the feature extraction efficiency of SURF is about three times that of SIFT. The extraction time of ARAC is longer than that of SIFT but it is similar to SURF because it extracts a large number of features from both foreground and background. If ARAC could be processed in parallel, it will take less time as shown in Table 1 named ARAC(PP). Data in Table 1 suggested that it takes longer to extract features of the images of the groups (vi) to (viii) with ARAC than SURF because the foreground areas of these three groups of images are relatively large. Moreover, when the image is segmented, a part of the background area is mistakenly divided into the foreground, which results in a slightly longer extraction time. But no matter which group of images, if parallel processing is adopted for feature extraction of foreground and background areas, the time consumed is always less than that of SURF.

Feature Matching Effect Comparison
The previous section compares the speed and number of features extracted by ARAC and other feature detection methods in this manuscript, but only the extraction of feature points does not make much sense. The repeatability and discrimination of features are the references for the quality of feature extraction methods. Feature detection is related to the repeatability of features, and feature descriptors are related to the distinguishability of features. The features applied to scene recognition need to be able to cope with various changes. The repetitiveness and discrimination of features can be reflected in the matching of images. Eight groups of indoor images under different shooting conditions are taken as the matching objects which are shown in Figure 18. Among them, the first two groups are regular scenes with desktops and cabinets as the background. The 3rd and 4th groups are scenes with school badges and logos in the corridor. The 5th to 8th groups are all scenes inside the teaching building. The foreground of the 5th group is a signpost, and the 6th to 8th groups contain very similar classroom doors and elevator doors.

Matching Effect Under Affine and Illumination Conditions
The eight groups of pictures are matched under the conditions of illumination changes and affine transformation as shown in Figure 18. The data are recorded in the Table 2.
It can be seen from Table 2 that in the case of a small angle of affine transformation and illumination changes, the number of feature points extracted by SURF in the group (ii) to group (viii) is the largest, followed by ARAC and SIFT. It can be observed that except for the two objects in the foreground, the trash can and the lamp in Figure 18a,b, the large area is the background area with gentle grayscale changes. In this region, SIFT and SURF have lost feature extraction and matching capabilities. Because the ARAC background feature extraction method extracts a large number of feature points in this type of area when using ARAC to extract images of Figure 19a,b, the number of feature points obtained is more than SURF.
The feature pairs matched are restricted by the strict conditions of RANSAC, the matching effect of SIFT is the worst and ARAC is the best. After experimenting with SIFT on the groups (v) and (viii), only 28 and 27 correct matches were obtained respectively. The SURF method obtained 42 and 38 correct matching pairs, while the ARAC method obtained the largest number of correct matching pairs, 95 and 92 respectively.
In terms of feature extraction and matching time, SURF is the longest and SIFT is the shortest but combined with the number of feature points, it can be found that the feature extraction efficiency of SURF is about three times that of SIFT. The number of feature points extracted by SIFT is the least, so the matching time is the shortest. The ARAC proposed in this paper is longer than SIFT in feature extraction and matching time, but shorter than SURF. If the feature extraction of foreground and background is processed in parallel, the total time becomes shorter. Considering the number of correct matching pairs, feature extraction and matching time, this result proves that the method of feature extraction The pixel size of each image in the object to be matched shown in Figure 18 is 600 × 450. If the first image in each group of scenes is considered as a reference, there are illumination and affine variations in the second image compared with the first one. Similarly, the third image in this scene adds scale changes to the conditions of the second one. In a corridor environment with limited space, there will be a certain scale change, which can cause differences in the number of objects contained in the two images to be matched and the range of pixels occupied. Figure 18i contains more characters than Figure 18g,h. Similarly, Figure 18r has more posters on the wall than Figure 18p,q. Although the indoor corridor environment space has limitations, changes such as the increase in the number of objects due to scale changes will affect the efficiency of feature extraction. All images are extracted by three methods. Then we match the features of the first image and the other two in the same scene in Figure 18 by FLANN. After screening by the RANSAC method, the average time of their 100 matches, the number of feature points of the two images and the number of correct matching pairs are counted to verify the effect of the experiment.

Matching Effect under Affine and Illumination Conditions
The eight groups of pictures are matched under the conditions of illumination changes and affine transformation as shown in Figure 18. The data are recorded in the Table 2. It can be seen from Table 2 that in the case of a small angle of affine transformation and illumination changes, the number of feature points extracted by SURF in the group (ii) to group (viii) is the largest, followed by ARAC and SIFT. It can be observed that except for the two objects in the foreground, the trash can and the lamp in Figure 18a,b, the large area is the background area with gentle grayscale changes. In this region, SIFT and SURF have lost feature extraction and matching capabilities. Because the ARAC background feature extraction method extracts a large number of feature points in this type of area when using ARAC to extract images of Figure 19a,b, the number of feature points obtained is more than SURF.
The feature pairs matched are restricted by the strict conditions of RANSAC, the matching effect of SIFT is the worst and ARAC is the best. After experimenting with SIFT on the groups (v) and (viii), only 28 and 27 correct matches were obtained respectively. The SURF method obtained 42 and 38 correct matching pairs, while the ARAC method obtained the largest number of correct matching pairs, 95 and 92 respectively. divided into foreground and background in this paper has good accuracy and effectiveness.
(i)   In terms of feature extraction and matching time, SURF is the longest and SIFT is the shortest but combined with the number of feature points, it can be found that the feature extraction efficiency of SURF is about three times that of SIFT. The number of feature points extracted by SIFT is the least, so the matching time is the shortest. The ARAC proposed in this paper is longer than SIFT in feature extraction and matching time, but shorter than SURF. If the feature extraction of foreground and background is processed in parallel, the total time becomes shorter. Considering the number of correct matching pairs, feature extraction and matching time, this result proves that the method of feature extraction divided into foreground and background in this paper has good accuracy and effectiveness.

Matching Effect under Large-Angle Affine and Illumination
The eight groups of images are matched under the conditions of large-angle affine and illumination as shown in Figure 20. The data are recorded in Table 3 The eight groups of images are matched under the conditions of large-angle affine and illumination as shown in Figure 20. The data are recorded in Table 3. According to Figure 20, it can be seen that the positions of the correct matching pairs using SIFT and SURF are concentrated on the foreground objects changed significantly, such as the characters on the school badge and the signpost, the corners and edges of the oven, the lamp, and the trash can. The correct matching pairs after using ARAC appear not only on the objects with distinct features but also on the walls, cabinets and desktops as shown in the group (iv) in Figure 20.    According to Figure 20, it can be seen that the positions of the correct matching pairs using SIFT and SURF are concentrated on the foreground objects changed significantly, such as the characters on the school badge and the signpost, the corners and edges of the oven, the lamp, and the trash can. The correct matching pairs after using ARAC appear not only on the objects with distinct features but also on the walls, cabinets and desktops as shown in the group (iv) in Figure 20.
From the results in Table 3, it is obvious that under such a large angle change, SURF extracts the most feature points, followed by ARAC, and SIFT does the least. In the experiments of groups (i), (v) and (vii), SIFT only got 21, 21 and 14 correct matching pairs, while ARAC got the most correct matching pairs, 91, 71 and 74, respectively. The ability of feature extraction and matching time of ARAC are comparable to SURF and even slightly better than SURF. Similarly, the feature extraction and matching time of ARAC using parallel processing are shorter.
The number of correct matching pairs and the total time of extraction and matching obtained by the three methods for feature matching in the two cases are compared as shown in Figures 21 and 22.
According to Figure 21, the ARAC algorithm identified by the yellow bar has the largest number of correct matching pairs in any case, which is better than SIFT and SURF. It verifies the accuracy of the method proposed in this paper. It is found from Figure 21, the extraction and matching time of SIFT is the shortest, but whose ability to extract and match features in such scenes is particularly poor shown in Figure 20. Even if its time of extraction is short, it cannot meet positioning and navigation demand. According to the yellow bar and the red bar, it can be observed that the feature extraction and matching speed of ARAC is slightly better than SURF. And the ARAC with parallel processing shown in the green column with the legend of ARAC(PP) takes less time. Combined that the number of correct matching pairs of ARAC is the largest, it verifies the efficiency of ARAC and meets the requirements of real-time indoor visual positioning and navigation. From the results in Table 3, it is obvious that under such a large angle change, SURF extracts the most feature points, followed by ARAC, and SIFT does the least. In the experiments of groups (i), (v) and (vii), SIFT only got 21, 21 and 14 correct matching pairs, while ARAC got the most correct matching pairs, 91, 71 and 74, respectively. The ability of feature extraction and matching time of ARAC are comparable to SURF and even slightly better than SURF. Similarly, the feature extraction and matching time of ARAC using parallel processing are shorter.
The number of correct matching pairs and the total time of extraction and matching obtained by the three methods for feature matching in the two cases are compared as shown in Figures 21 and 22.  According to Figure 21, the ARAC algorithm identified by the yellow bar has the largest number of correct matching pairs in any case, which is better than SIFT and SURF. It verifies the accuracy of the method proposed in this paper. It is found from Figure 21, the extraction and matching time of SIFT is the shortest, but whose ability to extract and match features in such scenes is particularly poor shown in Figure 20. Even if its time of extraction is short, it cannot meet positioning and navigation demand. According to the yellow bar and the red bar, it can be observed that the feature extraction and matching speed of ARAC is slightly better than SURF. And the ARAC with parallel processing shown in the green column with the legend of ARAC(PP) takes less time. Combined that the number of correct matching pairs of ARAC is the largest, it verifies the efficiency of ARAC and meets the requirements of real-time indoor visual positioning and navigation.

Discussion
The scenes used in this manuscript are corridors, stairwells, libraries, computer rooms, etc. The foreground and background scenes are very different, with clear outlines

Discussion
The scenes used in this manuscript are corridors, stairwells, libraries, computer rooms, etc. The foreground and background scenes are very different, with clear outlines and obvious colour divisions. Therefore, this paper uses Grab Cut to segment the area when distinguishing the foreground and the background. This method can find the edge of the foreground and the background in the candidate area according to the determined background area. I found that in a dazzling array of shopping malls, supermarkets and other areas with rich colours, the blur of the foreground and the background is not much different, and too much colour information is not conducive to the discrimination of the foreground and the background. At this time, the effect and advantages of feature extraction in this paper are not obvious. Being able to better distinguish the foreground and the background will provide more effective information for indoor positioning and scene recognition. How to distinguish the foreground and the background more accurately is one of my follow-up research directions.

Conclusions
This paper focuses on the feature extraction of corridors, stairwells, libraries, computer rooms and other scenes in indoor navigation, which contain large areas of white walls and cabinets, with strong background consistency and sparse feature points. They are the difficulties of indoor visual positioning and navigation. In the current research, methods such as SIFT and SURF are suitable for indoor and wide-area outdoor scenes with rich feature points, and it is not realistic for researchers, themselves, to post pictures and logos on the wall to increase the number of feature points to assist positioning. Therefore, this paper designs a feature extraction method ARAC. It mainly carried out the following work: Firstly, an ARAC feature extraction model is built. Secondly, the Grab Cut in OpenCV is used to divide the foreground and background areas of the indoor image in Visual Studio Code. Then, this paper designed the ARAC feature extraction method to extract the features of the foreground and background areas of the image individually. Next, the image under the combined conditions of illumination, scale and affine transformation is matched with SIFT, SURF algorithms in OpenCV. At last, the matching results of ARAC are compared with those of the SIFT and SURF algorithms.
According to the matching data, SIFT algorithm is the fastest in the face of indoor environments where various interference conditions change randomly, but the number of matching pairs is too small, which does not meet the requirements of indoor positioning and navigation. Although the number of feature points extracted by SURF is the largest, the basis for judging whether the feature extraction and matching method are good or not is the number of correct matching pairs. The feature extraction method ARAC designed in this paper has the largest number of correct matching pairs under the combined conditions of illumination, scale and affine transformation, and the time of feature extraction and matching is comparable to that of SURF. Moreover, the time extracted by ARAC with parallel processing is even less. Therefore, the ARAC algorithm in this paper is more adaptable. It not only maintains a certain degree of stability for illumination and viewing angle changes, but also is more robust and real-time. It meets the needs of fast, efficient and stable indoor navigation in this paper.