This section focuses on the characterization of data from GNSS, cellular network systems, optical sensors, magnetometers, and Wi-Fi sensors, all of which are commonly used in smartphones and are less affected by variations in terminal devices. These five sensors were chosen due to their widespread use and reliability. The temporal, spatial, and mathematical/statistical features of the data vary depending on whether the environment is indoor or outdoor, and thus, this paper emphasizes data mining for these five sensors to account for these differences.
  2.1. Satellite Information Data Mining
Satellite azimuth and altitude angles indicate the position of a satellite in space. The azimuth angle is measured relative to true north, ranging from 0° to 360°, while the altitude angle ranges from 0° to 90°. Based on the distribution of visible satellite azimuth and altitude angles, it is possible to roughly infer areas of obstruction at the current location. 
Figure 2 illustrates this, showing the satellite zenith maps at various locations.
At the indoor west-side window, 90% of the visible satellites are distributed between 150° and 345°. Similarly, at the indoor south window, over 90% of visible satellites are located between 75° and 210°. In deeper indoor environments, only four visible satellites are detected, while in outdoor open areas, the satellites are more uniformly distributed. The altitude angle can also serve as a feature for scene classification.
Based on the above analysis, we divide the azimuth angle into 24 regions, each spanning 15 degrees, and construct a 24-dimensional feature vector to represent the satellite distribution. For each region, a value of 0 indicates no visible satellites, while a value of 1 indicates the presence of visible satellites. Additionally, the percentage of regions containing visible satellites relative to the total number of regions is also used as a feature input, which is represented by the following equation:
For precision analysis, we use the geometric dilution of precision (DOP) to assess how the spatial geometric distribution of observation satellites affects positioning accuracy. DOP is an indicator of position quality—higher DOP values indicate poor satellite geometry and lower precision, while smaller DOP values suggest a better satellite distribution and higher potential for precision.
The Position Dilution of Precision (PDOP) represents the three-dimensional position accuracy, the Horizontal Dilution of Precision (HDOP) measures the accuracy in the horizontal plane, and the Vertical Dilution of Precision (VDOP) focuses on vertical accuracy [
21]. These factors describe the positioning accuracy in their respective dimensions, and their relationship is expressed as follows:
Researchers analyzed PDOP, HDOP, and VDOP statistics across various scenarios and derived their probability distributions, as shown in 
Figure 3. In 
Figure 3a, the graph represents the changes in accuracy factors from an open outdoor environment to a semi-outdoor environment. In open outdoor settings, the accuracy factors fluctuate less, but as the environment transitions to semi-outdoor, the fluctuations become more pronounced with 1127 calendar elements marking the transition.
Figure 3b shows the changes in accuracy factors as the environment shifts from semi-outdoor to semi-indoor and then to a deep indoor environment. At 352 calendar elements, the environment transitions to an indoor state, where the terminal remains in tracking mode without immediate changes in accuracy factors. However, sudden large fluctuations occur indoors, which is likely due to weak satellite signals in certain indoor areas.
 Comparing the data from the open outdoor environment (calendar elements 0–1127) with the semi-outdoor environment (calendar elements 1128–3000), as shown in 
Table 1, 
Table 2 and 
Table 3, we analyzed the PDOP, HDOP, and VDOP values. It was observed that the minimum values in both outdoor and semi-outdoor environments were approximately the same, indicating that the minimum accuracy factor does not significantly differentiate these scenarios. However, the variance, peak, and mean values show clear differences and can be used as statistical features to distinguish between scenes.
In particular, the variance shows the most noticeable difference. In outdoor environments, the variance of the accuracy factor is generally low with values of 0.0060, 0.0015, and 0.0057, respectively. In contrast, in semi-outdoor environments, where objects like buildings cause obstructions, the variance is significantly higher with values of 0.2775, 0.1318, and 0.1644.
The number of visible satellites can also serve as an important marker for scene classification. Changes in the number of visible satellites occur continuously, containing rich temporal information. Even at the same location, the number of visible satellites may vary under different weather conditions, as shown in 
Figure 4a. The figure depicts tests conducted under three weather conditions as the environment transitions from outdoor to indoor, then to semi-indoor, and back to outdoor.
On rainy days, the overall number of visible satellites is lower, likely due to the thicker cloud layers, which affect satellite signal reflection and refraction. Most of the time in outdoor environments, the number of visible satellites ranges between 20 and 25, but it fluctuates significantly when switching from outdoor to indoor. In deep indoor environments, the number of visible satellites drops to 2 or fewer. On cloudy days, the number of visible satellites mostly ranges from 25 to 30, which is likely due to cloud movement. This condition shows the greatest overall fluctuation. On sunny days, the number of visible satellites is the highest, typically ranging between 30 and 35.
The transitions between indoor and outdoor environments under the three different weather conditions reveal that the number of visible satellites does not immediately decrease upon moving indoors from outdoors. This is because the terminal maintains the tracking of satellite signals until the signal lock is lost, which occurs with a lag that is variable. Similarly, when moving outdoors from indoors, the number of visible satellites does not increase immediately. This delay is due to the time required for the terminal to capture, track, and process satellite signals, which involves solving the ephemeris. This process is influenced by both the terminal’s performance and environmental factors.
To minimize the influence of weather, environment, and equipment variability on the parameters, the number of visible satellites is processed using a time-series differencing method. The differencing formula is shown in the following equation, where D_num_sateN represents the number of difference features.  indicates the number of visible satellites at moment , and N refers to the number of calendar elements, which also determines the size of the sliding window.
When processing time-series data, the selection of the time window plays a critical role, as the features in the time series depend on the window size. As shown in 
Figure 4b, a small time window captures finer, localized features, while a larger time window reveals broader, global changes. This approach helps provide more useful information for scene classification by capturing multi-scale data. Consequently, time-series information is typically processed using multiple time windows to capture both small- and large-scale features, which are then used for scene classification.
        
To evaluate the quality of a satellite signal, the most commonly used indicator is the carrier-to-noise ratio (CNR). The CNR is the ratio of the carrier signal power to the noise power spectral density, which is typically expressed in dBHz. This measure is crucial for assessing signal quality and the performance of the receiver. The CNR can vary significantly across different scenarios, such as indoor and outdoor environments, leading to different changes and results.
Figure 5 illustrates the variation in carrier-to-noise ratio (CNR) and the difference in CNR (DCNR) of three satellites during the indoor–outdoor state switching process. In the figure, T1, T2, and T3 represent the outdoor, semi-indoor, and indoor states, respectively. The analysis of 
Figure 5a reveals that when transitioning from the outdoor state to the indoor state, there is a noticeable time delay in the satellite CNR with a consistent downward trend observed over a certain period.
 Figure 5b further indicates that during the transition from outdoor to indoor, the DCNR primarily consists of negative values for an extended duration. Conversely, in the transition from indoor to outdoor shown in 
Figure 5a, there is also a noticeable hysteresis effect with a steep increase in the CNR. This change results in several peaks in the DCNR, ranging from 30 to 40, as illustrated in 
Figure 5b.
 Since different satellites exhibit varying trends, amplitudes, and timings in their CNR, it is essential to minimize the impact of these inconsistencies on scene classification. To address this, the algorithm analyzes the CNR and DCNR of all visible satellites to establish a comprehensive trend. By examining the peaks in DCNR changes, the algorithm can more accurately assess the transitions between indoor and outdoor states. Additionally, it considers the influence of sliding window size on this judgment. The definitions of the relevant formulas are as follows:
        where 
 denotes the CNR of the ith star at moment x, N denotes the sliding window size as well as the statistics of the percentage of ascending features, the percentage of descending features and the percentage of flat features within the sliding window. The formula can be expressed as
        
  2.3. Wi-Fi AP Node Data Mining
Nowadays, Wi-Fi nodes are ubiquitous in society, and the AP (access point) nodes of Wi-Fi refer to wireless access points [
24,
25]. Generally, the number of AP nodes available outdoors is greater than those indoors, especially in dense urban areas, commercial streets, office buildings, campuses, and parks. In contrast, residential areas tend to have fewer AP nodes. When entering an urban indoor space, the presence of multiple walls can obstruct signals, leading to a generally lower number of accessible AP nodes compared to outdoor locations.
Figure 7a and 
Figure 7b depict the spectrum maps of indoor and outdoor Wi-Fi channels, respectively. The frequency points are primarily distributed in the 2.4 GHz, 5 GHz, and 6 GHz bands introduced by Wi-Fi 6E. It can be observed that nodes operating in the 2.4 GHz band typically exhibit stronger signal strength. In contrast, the signal strengths in the other frequency bands are generally lower. The 2.4 GHz band offers wider coverage and better penetration but is more susceptible to interference due to its narrower bandwidth.
 The 5 GHz band provides a larger coverage area and more channels, resulting in higher data rates and reduced interference; however, it has smaller coverage and poorer penetration. The 6 GHz band offers even more channels and less interference, making it suitable for environments with high data demands, although its signal strength and coverage may be slightly reduced.
Figure 8a presents the graph showing the number of Wi-Fi AP nodes. A total of 200 experiments were conducted, comprising 100 groups of indoor scanning results and 100 groups of outdoor scanning results across various environments, including office buildings, campuses, playgrounds, residential buildings, and shopping malls. The graph indicates that the number of indoor AP nodes typically ranges from 0 to 30, while the number of outdoor AP nodes generally falls between 25 and 90, demonstrating that outdoor AP nodes are typically more numerous than indoor nodes.
 Figure 8b illustrates the Wi-Fi signal strength, which tends to fluctuate significantly indoors, with overall signal strengths ranging from −88 to −50 dBm. In contrast, outdoor signal strengths are lower, generally falling between −70 and −95 dBm. This difference can be attributed to the proximity of indoor Wi-Fi signals to the transmitting point, which experiences less obstruction. Conversely, outdoor signals are generally farther from the transmitting point but can receive a greater number of signals. Therefore, the number of Wi-Fi AP nodes and the average signal strength can also serve as important factors in scene classification.
   2.6. Complex Indoor and Outdoor Scene Classification Algorithm Based on Spatio-Temporal Features
In this paper, we propose a complex indoor and outdoor scene classification model based on spatio-temporal features. This model leverages data collected from cellular networks, satellite signals, Wi-Fi signals, light sensors, and inertial sensors found in smartphones. By performing spatio-temporal characterization and feature extraction on the collected information, we develop a network that integrates a multi-scale convolutional neural network (CNN) with a bidirectional long short-term memory (BiLSTM) network. The multi-scale CNN captures spatial features, while the BiLSTM further extracts spatio-temporal features from time series data.
The overall architecture of the algorithmic model is illustrated in 
Figure 10. In this model, features from the analyzed data are extracted through convolution, resulting in multiple feature sets that are then flattened. These features are passed through the BiLSTM network, and scene classification and identification are performed using a fully connected layer followed by a softmax activation function. Additionally, an intelligent optimization algorithm known as the whale optimization algorithm (WOA) is employed to determine the optimal parameter values. The data feature extraction process has been thoroughly described and standardized in the preceding sections.
To analyze the relationship between various sensor features and environmental characteristics, we introduced Pearson’s correlation coefficient to quantify the degree of linear correlation between two sensors [
29]. This coefficient is expressed as follows:
In the equation, 
 and 
 denote the first and second sensor variables, respectively, while n represents the number of samples. The calculation yields the correlation diagram of sensors and scene features, as shown in 
Figure 11. The correlation coefficients indicate that the relationships between the sensors are generally low with the GNSS sensors and light sensors exhibiting a stronger correlation. Furthermore, the GNSS sensors show the highest correlation with the scene, whereas the IMU sensors demonstrate the weakest correlation.
  2.6.1. Dual-Scale CNN Convolutional Networks
Convolutional networks are deep learning models designed for processing data with a grid topology. They automatically extract and classify features from data using convolutional, pooling, and fully connected layers. These networks have gained widespread application in fields such as image processing and computer vision. In recent years, researchers have achieved significant advancements in enhancing model efficiency, lightweighting, and task adaptability.
For instance, Bello proposed the ResNet-RS network, which optimizes the classical ResNet architecture. This enhancement improves performance on large-scale image classification tasks through better data augmentation, optimization techniques, and refined training processes  [
30]. Similarly, Liu et al. introduced ConvNeXt, which is a modernized version of the classical CNN. ConvNeXt adapts the original structure to close the performance gap with Vision Transformers while maintaining the efficiency characteristic of CNNs  [
31]. Additionally, Liu et al. developed MS-Net, which is a deep learning model aimed at improving prostate segmentation in MRI scans [
32].
In this paper, we employ a dual-scale convolutional network, as illustrated in 
Figure 12. This dual-scale neural network captures detailed information at different scales. Compared to traditional single-scale convolutional networks, the dual-scale approach allows for the extraction of feature information changes that may not be evident at a single scale. By integrating both local and global information, this method enhances the model’s generalization ability and supports more comprehensive decision making.
  2.6.2. BiLSTM Network
Although multi-scale neural networks effectively extract spatial features, their ability to capture temporal features is less satisfactory. To address this limitation, we incorporate a BiLSTM (bidirectional long short-term memory) network into the model for temporal feature extraction. LSTMs, or long short-term memory networks, consist of forget gates, input gates, and output gates, which work together as a memory unit. These gates dynamically regulate the storage and output of memory information at each time step.
In recent years, various LSTM variants have been proposed to enhance its capabilities [
33]. For example, Qodim et al. introduced a spatio-temporal attention mechanism to improve video classification models. This approach utilizes an attention mechanism to weight important regions and time periods within video frames, enhancing the model’s ability to manage complex spatio-temporal dependencies in sequences. Additionally, they proposed a multi-scale LSTM model, which addresses sequential data across different time scales. This model enables LSTMs to capture features over varying time spans by integrating a multi-scale mechanism, making it suitable for processing temporal data with multiple cycles or hierarchical dependencies [
34].
Moreover, Zhou et al. combined BiLSTM with graph neural networks (GNNs) for text classification tasks [
35]. In this setup, the BiLSTM handles serialized features of the text, while the GNN further processes the graph structure, allowing for the better modeling of dependencies within the text.
In this paper, we utilize a BiLSTM network, and its structure is illustrated in 
Figure 13.
This architecture significantly enhances the recognition of long-distance dependent patterns by simultaneously capturing the forward and backward temporal dependencies of data sequences. In this structure, the data first enter the bidirectional network, where the information flow to the hidden layer is precisely controlled through the coordinated tuning of two sets of parameters. Finally, the softmax layer is employed to effectively classify the extracted features. The theoretical expression for this process is as follows:
          where 
 is the BiLSTM forward layer output, 
 is the BiLSTM backward layer output, and 
 is the hidden layer output. Since one-way LSTMs can only consider the influence of previous sequence data on the current data, they cannot incorporate feedback from later data to influence earlier judgments. This limitation prevents the integration of front and back sequences for comprehensive learning. In contrast, the model proposed in this paper possesses the capability to utilize contextual information, allowing it to make informed judgments by linking both past and future sequences.
  2.6.3. WOA Optimization Algorithm
The whale optimization algorithm (WOA) is an emerging meta-heuristic algorithm that explores the solution space by simulating the group behavior of humpback whales. It is known for its simplicity in implementation and strong global search capability [
36]. The algorithm draws inspiration from the hunting techniques of humpback whales, which create bubble nets around their prey. During this hunting process, the whales swim in a spiral trajectory, gradually tightening their encirclement around the prey.
WOA models this hunting behavior through two primary mechanisms: encircling the prey and spiral motion.
In this context, we assume that the optimal solution corresponds to the location of the prey. As such, the whale will progressively approach the optimal solution. The algorithm formulates this behavior with the following equation:
In the whale optimization algorithm, the position vector of the whale at the t-th generation is denoted as , while  represents the current global optimal solution. Two dynamic vector coefficients,  and , are utilized in the algorithm. Coefficient A linearly decreases from 2 to 0 as the number of iterations progresses.
The value of  plays a crucial role in determining the whale’s movement dynamics. Specifically, when |A|< 1, the whale approaches the prey, while|A|> 1 prompts the whale to randomly select a new location, enhancing the diversity of the search process.
To simulate the spiral hunting behavior of the whale, a spiral motion model is introduced. This model allows the whale’s spiral trajectory to converge toward the prey, and the process is expressed as follows:
          where b is a constant defining the shape of the spiral and l is a random number between [−1, 1]. The encircling prey and the updating mode are alternately updated, and the updating mode control can be expressed by the following equation:
To reduce the chances of the algorithm getting stuck in a local optimum, an exploration mechanism is introduced in the early stages of iteration. During this exploration phase, the search will be conducted far from the current optimal solution to enhance global search capabilities. The formula for the exploration stage is as follows:
          where 
 is a random solution in the current population. We use the WOA algorithm to train the initial learning rate, the regularization parameter and the number of BiLSTMs to improve the classification accuracy.
  2.6.4. Assessment of Indicators
The researchers used four metrics—accuracy, precision, F1 score, and recall—to evaluate the classification results of spatio-temporal scene recognition. The true positive (TP), false negative (FN), false positive (FP), and true negative (TN) are parameters obtained from the confusion matrix [
37]. The accuracy rate represents the proportion of all samples that can be correctly predicted, and the equation is as follows:
Precision measures the proportion of positive samples that are correctly identified in the prediction results, and the equation is as follows:
Recall measures a classification model’s ability to recognize samples in the positive class (the class of interest). It is defined as the proportion of samples correctly identified as positive by the model out of all samples that are actually positive.
          
F1 combines two metrics, precision and recall, in the following equation:
The average value of F1 for all categories is called Macro-F1, and the equation is as follows: