Next Article in Journal
Parkinson’s Disease Classification Using Machine Learning and Wrist Rigidity Measurements from an Active Orthosis
Previous Article in Journal
A Design of Rectifier with High-Voltage Conversion Gain in 65 nm CMOS Technology for Indoor Light and RF Energy Harvesting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Passive Localization in GPS-Denied Environments via Acoustic Side Channels: Harnessing Smartphone Microphones to Infer Wireless Signal Strength Using MFCC Features

by
Khalid A. Darabkh
1,*,
Oswa M. Amro
2,3 and
Feras B. Al-Qatanani
1
1
Department of Computer Engineering, School of Engineering, The University of Jordan, Amman 11942, Jordan
2
Department of Cybersecurity, Faculty of Artificial Intelligence, Al-Balqa Applied University, Salt 19117, Jordan
3
Department of Computer Science and Engineering, Indian Institute of Technology Kanpur, Kanpur 208016, India
*
Author to whom correspondence should be addressed.
J. Sens. Actuator Netw. 2025, 14(6), 119; https://doi.org/10.3390/jsan14060119
Submission received: 4 October 2025 / Revised: 19 November 2025 / Accepted: 25 November 2025 / Published: 16 December 2025

Abstract

The Global Positioning System (GPS) and Received Signal Strength Indicator (RSSI) usage for location provenance often fails in obstructed, noisy, or densely populated urban environments. This study proposes a passive location provenance method that uses the location’s acoustics and the device’s acoustic side channel to address these limitations. With the smartphone’s internal microphone, we can effectively capture the subtle vibrations produced by the capacitors within the voltage-regulating circuit during wireless transmissions. Subsequently, we extract key features from the resulting audio signals. Meanwhile, we record the RSSI values of the WiFi access points received by the smartphone in the exact location of the audio recordings. Our analysis reveals a strong correlation between acoustic features and RSSI values, indicating that passive acoustic emissions can effectively represent the strength of WiFi signals. Hence, the audio recordings can serve as proxies for Radio-Frequency (RF)-based location signals. We propose a location-provenance framework that utilizes sound features alone, particularly the Mel-Frequency Cepstral Coefficients (MFCCs), achieving coarse localization within approximately four kilometers. This method requires no specialized hardware, works in signal-degraded environments, and introduces a previously overlooked privacy concern: that internal device sounds can unintentionally leak spatial information. Our findings highlight a novel passive side-channel with implications for both privacy and security in mobile systems.

Graphical Abstract

1. Introduction

Localization technologies such as Global Positioning System (GPS) and Received Signal Strength Indicator (RSSI)-based systems are commonly used to track outdoor devices. However, they often struggle in real-world scenarios, especially in case of low connectivity or when signals are obstructed [1]. GPS accuracy decreases in dense urban environments primarily because tall buildings obstruct and reflect satellite signals, causing the receiver to receive delayed or multiple signal paths [2]. This multipath effect leads to errors in calculating the exact distance to satellites. Additionally, high-rise structures reduce the number of satellites visible to the receiver, resulting in poor satellite geometry and lower positioning precision [2]. Furthermore, electromagnetic (EM) interference, common in urban areas, from sources like cellular networks and WiFi, can further degrade signal quality, reducing the overall accuracy of GPS positioning in such settings [3].
WiFi RSSI can be more promising than other RSSI-based networks and even GPS, especially in urban and indoor settings. WiFi access points are scattered everywhere, including inside buildings where GPS signals often struggle or fail [4]. Unlike GPS, which depends on satellites and loses accuracy when walls or tall structures block signals, WiFi signals remain strong and accessible in these environments. Other networks, such as cellular, have more extensive coverage but fewer unique signal sources nearby, which limits their ability to locate the device precisely. WiFi RSSI data can be collected easily using standard devices without special equipment [1]. Although WiFi signals can also be affected by interference and reflections, their availability and density often make WiFi RSSI a more reliable choice for positioning where GPS falls short.

1.1. Summary of Prior Works, Methodology and Major Contributions

Active sound-based localization offers a valuable alternative or complement to GPS and wireless signal methods, especially when radio signals are weak or unreliable [5]. By analyzing how sound waves travel, reflect, and change across space, devices can estimate their position relative to known sound sources or through ambient noise patterns [6]. Sound-based localization techniques do not rely on satellites; they can work indoors or underground where GPS signals cannot reach. Compared to WiFi or cellular signals, acoustic signals can provide finer spatial details by capturing subtle differences in time delays, frequency shifts, or signal intensity [7].
Notably, the literature has explored a range of techniques for smartphone localization using audio and sensor data. Some systems, such as Echo-ID [8] and ELF-SLAM [9], utilize actively emitted sounds to recognize indoor regions or construct maps of the environment using deep learning. Others, such as RATBILS [10], rely on signal processing techniques to estimate real-time coordinates without using Machine Learning (ML). On the passive side, methods like Noise Signature Localization [11] and the work by Khan et al. [12] use Mel-Frequency Cepstral Coefficients (MFCCs) to classify environments, but they do not estimate coordinates or support real-time prediction. Despite these advancements, very few approaches rely solely on passive audio features, without the help of GPS, WiFi, or actively generated signals, to estimate location. Our work addresses this gap by using only MFCC features from internal and ambient audio to predict coarse coordinates in indoor and outdoor settings.
This research presents a novel localization method that operates without GPS, network connectivity, or active audio and wireless scanning. Instead, we explore the potential of using passive audio signals recorded by the smartphone’s built-in microphone. The key idea is to exploit an acoustic side channel generated by the device. Power consumption draws current through onboard voltage-regulating circuits when wireless modules, such as WiFi or cellular, are active. This dynamic load can cause high-frequency components, especially ceramic capacitors, to emit faint mechanical vibrations, known as “singing capacitors” [13]. Although these sounds are typically inaudible to the human ear, they can be captured and analyzed by the smartphone’s built-in microphone [14].
At the start of our testing, we noticed a clear connection between the device’s WiFi RSSI and specific patterns in the recorded audio. These patterns were especially noticeable when we analyzed the audio using MFCC features in different places. This made us curious to see if we could estimate the device’s location using only these audio features. After further experiments, we found it possible to predict the location based solely on recorded audio, with an average error of about four kilometers. Although this approach is not highly accurate, it demonstrates that passive localization can still be effective.
This paper makes several contributions that can be enumerated as follows:
1.
We describe an acoustic side channel resulting from the power behavior of wireless components in mobile devices.
2.
We show a strong correlation between RSSI values and the MFCC features of audio captured during wireless activity.
3.
We present a working localization model that operates solely on recorded audio signals.
4.
We discuss the broader implications of this approach, including its potential applications in low-connectivity settings and the associated privacy risks of acoustic leakage.

1.2. Work Structure

The structure of this paper is as follows. Related work is discussed in Section 2. The theoretical background of the problem we are solving is presented in Section 3. Then, Section 4 describes the origin of the audio signals and the hardware behavior that produces them. Section 5 describes the threat model. Section 6 explains the methods for processing and analyzing the audio. Section 7 outlines our data collection and recording process. In Section 8, we present the experimental results. Section 9 and Section 10 focus on the limitations of our method and discuss its privacy implications. Furthermore, Section 11 gives a detailed analysis of the main experiment. We conclude in Section 12 with final reflections and suggestions for future research.

2. Related Work

As compared in Table 1, many traditional localization systems depend on GPS and RSSI fingerprinting. GPS remains the most widely used option for outdoor positioning, as it performs well in open spaces and offers global coverage [1]. However, its accuracy drops significantly in places like cities, tunnels, or areas with dense tree cover, where satellite signals are blocked or reflected [2]. To solve this problem, some systems utilize RSSI-based fingerprinting, which associates Wi-Fi or mobile signal strength with specific locations [4]. This device can be used indoors or in areas with weak GPS signals, but its reliability is not guaranteed. The results often change due to noise, interference, or environmental fluctuations, and they require regular updates to remain accurate [1]. Researchers have also explored methods to utilize sound to locate their positions. Some systems utilize background noise to estimate a device’s location, such as within a room or building [5]. Others use voice signals or special sound beacons to measure distances, based on how fast the sound travels or how it moves in space [6]. While these methods are effective in controlled environments, they typically require participants or known sound sources, making them less fully passive [7]. At the same time, research has shown that devices can leak information through unexpected signals. For example, some researchers have found that factors such as power usage, EM signals, or even subtle sounds from electronic components can reveal secret data, including encryption keys or system behavior [13]. This led to further research on how devices can inadvertently send useful information, even though they were not designed for that purpose. In fact, some earlier studies found that power use in small devices can cause tiny sounds, but many people thought those sounds were too weak or hard to detect. Recently, more researchers have begun investigating how mobile and small devices can generate sounds from their internal power components. Elements like voltage regulators and capacitors can create small vibrations during wireless activities, and sometimes these sounds can be picked up by sensitive microphones [14]. However, little research has been conducted to determine if these sounds can aid in locating objects, particularly when using regular phones in real-world scenarios. This study examines this approach; while most earlier research has focused on using active sound sources or signal-based methods, we introduce a new method to estimate location by utilizing the sounds naturally created by the patterns of the wireless power use.
Several studies have demonstrated that non-traditional sensing can unintentionally expose location information without the use of GPS. Jeon et al.’s work, “I’m Listening to Your Location!”, showed that electrical network frequency (ENF) embedded in ambient audio can be matched to a reference database to infer location, but it required prior environmental profiling and known network infrastructure [16]. Nagaraja and Shah introduced VOIPLoc, which utilized echo patterns in VoIP call audio to fingerprint indoor environments. Their method was passive and network-based, yet still depended on infrastructure access [17]. More recently, Amro et al. [18] investigated acoustic leakage from internal smartphone components during wireless activity, highlighting a new side channel; however, their study relied partially on fingerprinting across controlled environments. Notably, their approach focused on detecting a change in location between consecutive call recordings rather than estimating the precise location or reconstructing a full location provenance. Other researchers have achieved indoor localization by fusing sensor data from smartphones, including accelerometers, magnetometers, and barometers, but these methods typically require calibration or prior knowledge of the map. In contrast, our work utilizes only standard microphone input to capture the subtle hum and capacitor emissions associated with wireless communications. We do not fuse sensors, require prior maps, or interact with networks: we simply listen. This highlights a powerful privacy concern: the phone’s internal sounds, imperceptible to humans, can still be used to locate the phone in the same way an RF-based localization method can work, without requiring network access, by analyzing the patterned behavior of the sound in relation to RSSI patterns.

3. Background

RSSI-based localization has been employed to overcome the situations where GPS fails [1,2]. This technique maps wireless signal strength readings from known locations to later use them for localization [4]. Nevertheless, this method is highly susceptible to environmental dynamics and noise, requiring frequent recalibration [1].
More recent research has explored unconventional approaches to localization, including the use of acoustic signals. These methods analyze ambient noise, speech, or specially emitted audio signals to determine spatial context [5,6]. However, such approaches often require active user participation or known sound sources, limiting their applicability in passive scenarios.
In parallel, side-channel research has revealed how unintended physical emissions, such as sound, EM radiation, or power fluctuations, can leak valuable information about system operations. Genkin et al. demonstrated that cryptographic keys can be inferred using acoustic side channels [13], while Narain et al. later explored how subtle audio emissions during power delivery may correlate with wireless activity [14].
This study uses these ideas and introduces a novel passive localization technique that harnesses internal acoustic emissions generated by voltage regulators and other power delivery components within smartphones. These emissions, although often imperceptible to users, can be captured by built-in microphones and analyzed using audio signal processing techniques to extract features such as MFCCs.

3.1. Mathematical Formulation of the Problem

Let x i R d be the feature vector representing MFCCs extracted from audio recorded during wireless communication for the i-th observation. Let y i R 2 represent the ground truth coordinates of the device at that observation.
The training dataset is defined as Equation (1).
D = { ( x i , y i ) } i = 1 n
We aim to learn a regression function f : R d R 2 that minimizes the expected error expressed in Equation (2).
min f E ( x , y ) D [ L ( y , f ( x ) ) ]
where L ( · ) is the loss function, and we use the Mean Squared Error (MSE) expressed in Equation (3)
L ( y , y ^ ) = y y ^ 2 2
Thus, the empirical training objective can be expressed as in Equation (4).
min f 1 n i = 1 n y i f ( x i ) 2
This method enables us to train ML models to associate the sound patterns generated by the device with their corresponding possible physical locations.

3.2. Mel-Frequency Cepstral Coefficients (MFCCs)

MFCCs are a standard audio feature set that models human auditory perception by emphasizing lower frequencies via the nonlinear Mel scale [19]. As illustrated in Figure 1, extraction involves pre-emphasis filtering, framing, windowing, and Fourier transformation, followed by mapping the magnitude spectrum through Mel-scaled triangular filters. The logarithm (i.e., log ( . ) in Figure 1) of filter energies is then decorrelated via a Discrete Cosine Transform (DCT), yielding coefficients that represent the spectral envelope:
c m = k = 1 K log ( E k ) cos m π K k 1 2 , m = 1 , , M
where E k denotes filter bank energies.
MFCCs efficiently capture perceptually relevant spectral characteristics and are widely used in speech and acoustic scene analysis. In this study, they serve to characterize faint internal acoustic emissions from smartphone hardware induced by wireless activity. The spectral patterns reflected in the MFCCs correlate with the RSSI, enabling passive localization without the need for GPS or active network sensing. Extraction can be performed as in Algorithm 1. The frame length and step were chosen to balance time resolution and frequency resolution for MFCC extraction, based on common audio-processing practice and empirical tuning on our dataset.
The pseudo code, in Algorithm 1, uses the librosa library to load an audio file and segments it into short, overlapping frames. Each frame undergoes a Fast Fourier Transform (FFT) to move into the frequency domain. A Mel-scaled filter bank simulates human ear frequency perception, and the logarithm of the filter bank energies is calculated to compress the dynamic range. Finally, the DCT is applied to decorrelate the features, retaining the most relevant 13 coefficients as compact representations of the audio’s spectral characteristics. This entire process is illustrated in Figure 1.
Our research shows that MFCCs effectively capture acoustic emissions from smartphone components during wireless communication. By correlating these MFCCs with RSSI values, we gain insight into the wireless signal environment. Hence, we infer the location without the use of GPS or active network scanning. We emphasize that instantaneous RSSI is not the physical cause itself; rather, RSSI patterns correlate with transmission behavior, which in turn links to the sound of the device.
Algorithm 1: Manual MFCC Extraction from Audio Signal
1:
Load audio signal y and sampling rate ( s r ) from file
2:
Set frame length ( f r a m e _ l e n = 0.025 ) s, frame step ( f r a m e _ s t e p = 0.010 ) s
3:
Set FFT size n f f t = 512 , number of Mel filters n m e l s = 26 , MFCCs n m f c c = 13
4:
Calculate frame length ( f r a m e _ l e n _ s a m p l e s ) and step ( f r a m e _ s t e p _ s a m p l e s ) in samples:
f r a m e _ l e n _ s a m p l e s = f r a m e _ l e n × s r
f r a m e _ s t e p _ s a m p l e s = f r a m e _ s t e p × s r
5:
Segment the signal into overlapping frames with given length and step
6:
Apply Hamming window to each frame
7:
Compute the magnitude spectrum of each frame via FFT
8:
Calculate the power spectrum by squaring magnitudes and normalizing
9:
Create Mel filterbank matrix based on sampling rate, FFT size, and number of filters
10:
Apply Mel filterbank to power spectra to get Mel energies
11:
Replace zeros in Mel energies with a small epsilon for numerical stability
12:
Take the natural logarithm of Mel energies
13:
Apply Discrete Cosine Transform (DCT) to log Mel energies
14:
Select first n m f c c coefficients as the MFCC features

3.3. Description of Machine Learning (ML) Models for Localization

Linear models, such as Ridge Regression and Elastic Net, are easy to understand and efficient. Ridge Regression employs a single regularization penalty to mitigate overfitting, whereas Elastic Net combines penalties to strike a balance between sparsity and robustness. They work best with clean, linearly separable data [20,21].
Tree-based models like Decision Trees and Random Forests are popular for their ability to capture non-linear patterns. Decision Trees split data by features, while Random Forests aggregate predictions from multiple trees to reduce variance. Both handle irrelevant features and outliers well [22].
Boosting algorithms such as Gradient Boosting and AdaBoost enhance accuracy by training models sequentially, with each new model targeting the previous one’s errors [23,24]. HistGradientBoosting enhances this method by using histogram-based binning to increase computational efficiency on large datasets.
Probabilistic boosting models, like NGBoost (Natural Gradient Boosting), extend traditional boosting methods by modeling the full probability distribution of the output instead of just providing point estimates. This allows the model to give uncertainty estimates with predictions, which is valuable in high-noise environments [25].
Ensemble techniques, such as Stacking and GenCast, improve performance by merging predictions from various base models to leverage their strengths. Explainable Boosting Machines (EBMs) use Generalized Additive Models (GAMs) and boosting to build interpretable models while ensuring competitive accuracy [26].
Support Vector Regression (SVR) is a non-linear model that uses kernel functions to project data into higher dimensions, allowing it to model complex relationships effectively while maintaining good generalization properties [27].
These models are relevant for applications using RSSI and MFCC features, which tend to be noisy and non-linear. Our dataset, gathered from indoor localization experiments that combine RF signal strength and acoustic features, requires models that can effectively capture non-linear patterns and handle uncertainty.

4. Where the Sound Comes From: The Physical Basis of the Side Channel

The audio signal used in this study is not environmental noise, speech, or any other external sound. Instead, it comes from inside the smartphone itself. Specifically, it results from small but detectable acoustic emissions that occur due to changes in power consumption within the device’s wireless communication systems [13,14,18,28]. Modern smartphones manage power differently when operating on Wi-Fi and cellular networks, which can cause unintended sounds. Power consumption depends on signal strength, background data activity, and open apps, affecting components such as voltage regulators and capacitors [28]. As these components adjust to the changing power demand, they can physically vibrate. Much like how the internal parts of a speaker vibrate to create sound, these small vibrations also produce sound, though at a much lower volume and within a limited frequency range [14]. While these acoustic signals are far too quiet to be heard by the user, the phone’s built-in microphone is sensitive enough to detect them.
In Figure 2, we explain this side channel. What makes these findings interesting is that the pattern of the sounds changes based on the signal strength patterns in the environment. For example, when the signal is weak, the wireless chip tends to draw more power to maintain connectivity, which in turn increases vibration activity. When the signal is strong, the power draw is reduced, and the acoustic emissions shift accordingly [13]. Our experiments recorded this internal audio while measuring the device’s RSSI. After extracting features from the recordings using MFCCs, we found a clear correlation between the sound characteristics and the RSSI patterns. These results suggest that the sound indirectly conveys information about the quality of the wireless signal. Encouraged by this correlation, we explored whether location information could be inferred using only these audio features. Since signal strength tends to vary with physical location, and these variations influence the audio emissions, we found that MFCCs capture spatial differences, creating a passive side channel for localization without the need for GPS or continuous active measuring of Wi-Fi RSSI values. This sound source is linked to the device’s hardware and is not a typical security vulnerability. However, it can unintentionally reveal contextual information.

4.1. Experimental Proof That MFCCs Capture Smartphone Component Emissions

Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7 show representative MFCCs for device-emitted signals versus background, ranked features, and summarized quantitative separation (classifier Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) and visual patterns) that support the claim that MFCCs effectively capture component acoustic leakage. Particularly, MFCC2 captures the difference between device sound with wireless activity and its sound with airplane mode on, i.e., “Device” versus “Background”. This is obtained by ranking all sound features of the sample clips with the Maximum Relevance–Minimum Redundancy (MRMR) algorithm. Although MFCC2 was the strongest candidate for separating device vs. background activity, subsequent ROC-based predictor importance analysis revealed that the mean value of MFCC8 provides higher discriminative strength for localization (Figure 5). This difference is visualized in Figure 4c,d.
We conducted a controlled analysis using two continuous 5-minute recordings, made with an iPhone 11 (device active and background), segmented into 8-s clips with a 4-second overlap, yielding 74 clips per class. MFCCs were extracted using a 25-ms window and 10-ms hop (13 coefficients) and aggregated per clip (mean, standard deviation, and delta) to form feature vectors. A bagged-tree classifier (Random Forest style, 200 trees) was trained with a time-aware holdout (final 20% of clips per recording held out). The model achieves Test AUC = 0.880 and Test accuracy = 0.633 on the temporal holdout; 5-fold cross-validated AUC on the training set is 0.859 ± 0.026. These results indicate that MFCC-based features contain discriminative information that differentiates device-emitted acoustic leakage from background recordings, supporting the claim that MFCCs effectively capture component emissions.

4.2. Annotated Views of Power and Audio Components in iPhone 6

To complement our analysis, Figure 8 and Figure 9 present annotated views of key hardware components that play a crucial role in understanding how unintended acoustic emissions can arise from routine device operations.
The first image (Figure 8) highlights the upper section of the iPhone 6 device, focusing on two crucial areas. The microphone, located at the top back of the device, is responsible for capturing audio signals of outside noise. The noise includes intentional sounds and unintentional emissions from nearby hardware. We also highlight the wireless communication circuitry near this microphone. This region includes modules responsible for handling wireless signals, which are known to create dynamic power consumption and EM emanation patterns during operation. Its proximity to the microphone makes it highly relevant to our study, as the EM signals may affect the microphone and be captured as acoustic signals, during the recordings of the device.
The second image (Figure 9) provides a close-up view of another critical area near the microphone. Here, clusters of surface-mount capacitors are visible. These capacitors play a crucial role in power delivery by stabilizing voltage and filtering electrical signals, particularly in circuits such as the wireless module. They can experience mechanical vibrations from high currents during rapid changes in electrical load. These vibrations produce high-frequency sounds that, while inaudible to humans, can be detected by a nearby microphone. The annotated image illustrates the close proximity of the capacitors to the microphone, suggesting a potential route for sound leakage.
By presenting these annotated images, we provide visual evidence of how physical hardware arrangements within the device may enable unintended acoustic side-channel effects. The close arrangement of power delivery components, wireless circuitry, and audio capture hardware poses practical risks of such a phenomenon.

5. Threat Model and Assumptions

This study examines the feasibility of passive localization under stringent limitations. As seen in Figure 10, the system, or a hypothetical observer, is assumed to operate with minimal access and without any interaction that would actively probe the environment or the device [29]. We assume that the only available input during testing is audio recorded through the device’s microphone. This microphone access does not require elevated permissions and reflects a typical app-level capability [30,31]. There is no GPS access, Wi-Fi scanning, or communication with external infrastructure. The entire process is designed to remain passive and non-invasive. During the training phase, it is possible to use RSSI data to help understand how wireless signal strength correlates with the acoustic characteristics captured in the audio [1]. Once this relationship is established, the trained model uses audio data with the correlations to the pre-established and saved RSSI probability density function (PDF) in a map to estimate location. This passive setup highlights how seemingly insignificant side effects of normal device behavior, such as the faint sounds produced by internal hardware during wireless activity, can unintentionally reveal location information [13]. By relying solely on standard microphone input and avoiding all traditional localization signals, the approach highlights a subtle yet important privacy concern that may not be apparent to the user.

6. Analysis and Methodology

In reference to Figure 11, we start by examining how wireless signal strength correlates with the audio characteristics captured by the smartphone. Specifically, we focus on identifying whether MFCCs, extracted from internal audio, could reflect changes in RSSI patterns [11,32]. We use multiple correlation methods, including Pearson, Spearman, and Kendall tests, to establish this relationship [33]. All three consistently indicated a measurable correlation between MFCC features and signal strength, confirming that audio-based patterns provide significant signal-related information. Afterward, we prepare the dataset for modeling. We extract MFCC features directly from the recorded audio and apply standardization to enhance consistency across sessions and mitigate the effects of noise. No dimensionality reduction techniques, such as Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA), were employed, as the models were tested using both raw and normalized MFCCs to capture the most direct impact of acoustic variations on localization performance. To test the feasibility of audio-only localization, we trained various ML models using MFCC features as input. These include commonly used regression and ensemble methods such as Gradient Boosting, Random Forest, Decision Tree, Ridge Regression, XGBoost, LightGBM, AdaBoost, and SVR [34]. We also explored neural-based and advanced methods such as Multi-Layer Perceptrons, TabNet, and Gaussian Process Regression. Some models, such as Gaussian Process and Kernel Ridge, exhibited high errors and were less effective. In contrast, NGBoost, Gradient Boosting, and Random Forest achieved lower mean location errors, with the best models averaging under five kilometers. The evaluation metric was the mean location error in meters, measuring the average distance between predicted locations and ground truth. Only models with an error below twenty thousand meters were deemed viable and plotted for comparison. The top-performing model achieved an average location error of 4180 m, with most effective models operating within a five to six-kilometer range. We employed cross-validation and grid search for hyperparameter tuning to enhance performance and mitigate overfitting, thereby ensuring improved generalization across diverse locations and conditions [35]. We demonstrate that passive audio features can be utilized to create a localization model that, although less accurate than GPS, is effective for coarse localization in low-connectivity environments.

7. Experimental Setup

The setup overview is shown in Figure 12. We record using three iPhone devices (iPhone 6, 11, and 13) to examine the link between internal acoustic emissions and wireless signal strength, as shown in the training phase of the same figure. The goal was to capture subtle sounds generated during wireless activity and match them with the corresponding RSSI patterns. The audio was recorded under various real-world conditions, during the day and at night, in multiple environments. Using only iPhone 11 and 13, each recording session lasts approximately thirty seconds. All unnecessary background applications were closed before each session began to reduce external noise and interference. This helps ensure that the audio captured is primarily influenced by the phone’s internal behavior rather than external sounds or processes. We collected RSSI data over Wi-Fi while the recordings were taking place. We match each RSSI value to its corresponding session and physical location, aligning the wireless signal data with the audio samples. This alignment is crucial for training, as it enables us to examine the correlation between RSSI and features extracted from the recorded sound. For feature extraction, we use MFCCs. Both filtering and normalization steps are applied to the audio to enhance the quality of the extracted features; however, detailed parameter values, such as the number of coefficients, window size, and step size, were set to their default values in this version of the analysis. Our dataset comprises recordings from over ten different physical locations. Each location contains between one and fifteen recordings, and each recording typically includes five audio segments. This results in a diverse set of audio-RSSI pairs representing various signal strengths and acoustic responses. These recordings are analyzed as described in Section 6 and Section 11.
The dataset used in this study was collected from more than six different locations across Amman, Jordan, representing a wide range of real-world indoor environments, including residential apartments, private houses, parked and moving cars, and multi-floor buildings. The primary purpose of the dataset was to build and evaluate a localization system that integrates wireless RSSI and audio-based features to estimate the position of the recording device. Data was collected in diverse indoor contexts to ensure variability and enhance the generalizability of the localization model, including inside residential homes and apartments, inside vehicles, and inside buildings and stairwells. This diversity enabled exposure to different acoustic and wireless propagation conditions. A total of 20 recording sessions were conducted, with each session containing approximately 3 to 10 recordings depending on location accessibility, and each audio recording lasting between 30 and 40 s. Two mobile devices, an iPhone 13 and an iPhone 11, were used to capture raw audio samples and RSSI measurements obtained through external tools. Using two different phones introduced natural hardware variation, supporting stronger model robustness. The ground-truth location for each recording was obtained using GPS coordinates captured via Google Maps, specifically latitude and longitude, which serve as the target variables for the localization model. RSSI measurements were collected using the NetSpot application, with each recording containing approximately 80 RSSI values captured from multiple Wi-Fi access points within each environment, expressed in dBm as continuous numerical data, thereby providing a rich wireless fingerprint for each location. Audio recordings were collected under varying noise conditions to reflect realistic scenarios, ranging from quiet environments to those with background noise, including street sounds, indoor human activity, and background media. This variability creates a challenging and realistic dataset for audio-assisted localization. Each final dataset sample includes RSSI measurements (≈80 features), a 30–40 s audio segment, ground-truth GPS coordinates (latitude and longitude), and metadata describing the session and device characteristics.

8. Results

Spectrogram Analysis of Acoustic Emissions

To provide a qualitative perspective on the internal acoustic emissions captured during wireless activity, we generated spectrograms of two representative audio recordings taken in different indoor environments. These spectrograms (Figure 13 and Figure 14) visualize the distribution of frequency components over time, helping to identify patterns that may relate to wireless signal strength.
In both figures, specific frequency bands persist across time, suggesting the presence of consistent internal acoustic emissions rather than random environmental noise. These patterns align with the frequencies emphasized in the extracted MFCC features, supporting the hypothesis that internal hardware activity generates identifiable acoustic signatures during wireless operations. Such visual evidence strengthens the claim that passive localization using acoustic features is grounded in a measurable, repeatable phenomenon.
To evaluate the performance of our passive localization method based on internal audio features, we analyze the relationship between audio characteristics and Wi-Fi signal strength. As shown in Figure 15, we investigated the relationship between MFCC features and basic audio features, such as Zero-Crossing Rate (ZCR) and Spectral Centroid, as well as the RSSI values collected across different locations.
We use three widely accepted statistical methods to measure correlation: Pearson, Spearman, and Kendall coefficients. The results from all three methods consistently show that certain lower-order MFCCs, especially MFCC_10, MFCC_11, and MFCC_13, had moderate to strong positive correlations with RSSI values. These findings confirm that variations in wireless signal strength can influence the audio signals captured by the smartphone’s microphone [36]. Visual analysis through scatter plots and heatmaps further supported this connection, as these MFCC features clearly tracked changes in RSSI across different environments.
We observed an anomalous pattern for MFCC_9: unlike neighboring MFCCs, MFCC_9 displays a weak positive (or much weaker negative) correlation with RSSI. Inspection of the short-time spectra revealed that the frequency band represented by MFCC_9 overlaps a persistent device-specific harmonic and intermittent environmental noise (e.g., fan/motor hum) present in several recording locations. These narrowband contributions increase energy in that band independently of RSSI-driven changes, producing the observed deviation.
In addition to our main dataset, we conduct an aluminum foil shielding experiment to simulate severe signal degradation. The heatmap shown in Figure 16 illustrates the correlations during this test. Even under shielding, MFCC_10, MFCC_11, and MFCC_13 remained strongly correlated with RSSI. This suggests that these specific features are highly sensitive to changes in signal strength, even under harsh conditions. Such resilience makes them suitable candidates for passive localization in challenging environments where traditional signal-based methods often fail. The correlation heatmap was computed between basic audio descriptors (ZCR, Spectral Centroid), MFCC_1–MFCC_13, and measured RSSI (with latitude and longitude included for spatial context) collected during the aluminum foil shielding experiment. The heatmap highlights which audio features co-vary with RSSI and which are dominated by independent ambient or device-specific signals. Strong negative or positive off-diagonal cells indicate MFCC bands and basic audio descriptors that systematically change when RSSI changes; near-zero cells indicate features that are largely orthogonal to wireless-related variation.
After confirming these correlations, we train multiple ML models to predict the device’s location using only MFCC features. We test both traditional regression models and advanced ensemble techniques, with mean location error (MLE) serving as the primary performance metric [22,35]. As shown in Figure 17, the NGBoost model achieves the best performance (i.e., the lowest MLE) with an average error of approximately 4180 m. Gradient Boosting and Random Forest have also performed well, with average errors of 4349 and 4476 m, respectively. Other models, such as XGBoost, AdaBoost, and Ridge Regression, have maintained an error range of five to six kilometers, demonstrating reasonable accuracy.
However, as seen in Figure 18, some models have performed poorly. Gaussian Process Regression yields extreme errors, reaching nearly one million meters in some cases. Similarly, TabNet and Kernel Ridge Regression produce errors over 200,000 m. These results suggest that more complex models do not necessarily work better with passive audio data. In fact, simpler ensemble models seemed to generalize better, likely due to their robustness in handling noisy or nonlinear relationships in the features [24].
We also compare model performance across different environments. Overall, models have performed better in structured or semi-enclosed areas, such as indoor settings or covered outdoor spaces, where signal fluctuations were minimal. However, in busy urban locations with frequent interference and signal reflections, prediction errors have increased by about 20 to 30 percent on average [34]. These results emphasize the importance of environmental stability for achieving reliable passive localization performance.
Additionally, in Section 11 of this paper, we compare the predicted locations to the actual recorded locations to assess the system’s practical value. In quieter, stable environments, predicted locations were generally within three to five kilometers of the true location. However, we observe some significant outliers. In cases where unexpected background noise or sudden signal shifts occurred, errors exceeded seven kilometers. These large errors were typically linked to distorted MFCC patterns that no longer reflect the underlying power consumption behavior.
Despite these limitations, our results strongly suggest that internal acoustic signals do carry meaningful spatial information. The method works best in calm environments with consistent wireless signals, where its error remained relatively low. While it does not yet match the precision of GPS or advanced radio-based localization systems, it offers a promising passive alternative for coarse localization. Such a system could be especially valuable in situations where GPS is unavailable, such as in underground areas, dense urban settings, or during power-saving operations in IoT devices [37,38].

9. Discussion

The results of this study demonstrate that internal acoustic emissions, though faint and unintended, can carry enough information to support coarse localization. The observed correlation between MFCC features and RSSI suggests that audio signals captured from the microphone can indirectly reflect the wireless communication environment. The practical implications of this approach are significant because it relies only on audio. This makes it suitable for privacy-sensitive applications, offline environments, or even backups in case of signal denial or satellite outage. Localization is achieved without the need for active probing, and specialized sensors can be deployed using only software on existing smartphones. One major constraint is its sensitivity to environmental noise, as well as the data rates of the WiFi and the RSSI levels. Although our recordings were conducted with closed background apps in relatively controlled settings, real-world deployments would likely encounter more variable conditions [39,40]. Ambient noise, human speech, and unexpected system sounds, if not well processed, can all interfere with the audio patterns on which this method depends. Additionally, the approach’s effectiveness varies across different smartphone models, as hardware design can impact the strength of acoustic emissions. Another challenge is the relatively coarse accuracy of the localization results, as a three to five-kilometer error may be unsuitable for applications requiring fine-grained location awareness. Moreover, the performance drops noticeably in complex urban environments, where signal reflections and fluctuations make it harder for the model to interpret MFCC patterns.
Future work could incorporate real-time noise filtering and adaptive feature selection to reduce the impact of unpredictable sounds. Training the model on various devices and environmental conditions also helps improve its generalizability.
Compared to traditional localization techniques, this method offers a balance between low cost, privacy, and independence from external infrastructure. It is a great solution in scenarios where passive observation is the only option. In this context, acoustic side-channel localization seeks to complement existing RSSI-based systems in constrained environments where traditional options are unavailable or blocked. We emphasize that our method supports coarse localization among a set of known locations and provides per-location RSSI-PDF estimation; it is not a reliable city-scale generalization to unseen regions without retraining. To name a few examples of feasible use cases: (1) privacy-sensitive offline fallback, (2) coarse situational awareness, and (3) coarse map-matching where a prebuilt RSSI-PDF atlas exists. As for scenarios where the method is unsuitable: (1) fine-grained navigation and (2) meter-level asset tracking. We notice that within-device and cross-device comparisons (training on device A and testing on device B) are also relevant. Results show substantial device dependence; generalizability must account for profiling or building full PDFs for several device architectures. But the correlations exist between the PDF of RSSI and the PDFs of MFCCs.

9.1. Model Selection and Justification

Among various regression models, ensemble methods such as Random Forest, Gradient Boosting, and NGBoost have shown strong performance in modeling complex non-linear relationships between audio features and location data [35]. These models are particularly robust in the presence of noise and are capable of capturing hierarchical interactions between MFCC coefficients.
In our experiments, NGBoost achieved the best results due to its ability to model uncertainty through probabilistic prediction with the full PDF. Random Forest and Gradient Boosting also performed competitively, offering strong generalization and interpretability. In contrast, models such as Gaussian Process Regression and TabNet underperformed, due to the high dimensionality and variance in passive audio features, which require careful tuning and may not scale efficiently to large datasets.

9.2. Dataset Initial Analysis

We have the PDF of RSSI shown in Figure 19. We notice that it has a log-normal distribution. On the other hand, in Figure 20, the PDF analysis reveals the distinct distributions obtained from two different feature domains used in this work. The blue curve represents the distribution of MFCC-based acoustic features, while the red dashed curve, represents the distribution of RSSI (Wi-Fi signal strength) values. The MFCC distribution exhibits a wider spread, indicating variability influenced by multiple environmental and hardware-related factors, whereas the RSSI distribution demonstrates a sharper peak, reflecting its relatively stable and concentrated range of values. The clear separation between the two distributions confirms that MFCC and RSSI capture fundamentally different information relevant to location identification. Based on the observed correlation between the PDFs of RSSI and particular MFCC coefficients, we conclude that MFCC features can be used independently for location estimation, considering that RSSI alone has been well established in previous work.
To evaluate location grouping performance, in Figure 21, density-based clustering (DBSCAN) was applied using MFCC (coefficients 10 and 11) and RSSI features extracted via Kernel Density Estimation (KDE). DBSCAN was selected because it is effective for datasets where point density varies and does not require predefining the number of clusters, and it can identify noise points that fall outside meaningful groupings. The resulting clusters reveal the formation of distinct spatial regions corresponding to different physical locations. In this plots, different colors represent cluster assignments while gray x-marked points indicate noise rejected by DBSCAN due to insufficient neighborhood density. The plot in Figure 21 provides a clear separation with minimal noise. This result confirms that MFCC and RSSI, when combined, can form meaningful groupings of indoor locations. Additionally, separate K-Means clustering results using MFCC_4 alone, as shown in Figure 22, illustrate that MFCC coefficients are capable of reflecting spatial structure for various sessions in different locations.
To further evaluate the classification capability, Figure 23 shows the application of the KNN algorithm using labeled location data. The input features consisted of MFCC and RSSI values, and classification was performed by assigning test samples to the most likely location cluster based on nearest neighbors in feature space. The findings show that combining MFCC and RSSI yields higher classification performance than using either feature independently; RSSI performs more reliably in open regions with stable signal propagation, while MFCC features perform better in sound-distinctive indoor environments such as near escalators, ventilation systems, or crowded areas. These results reinforce the complementary nature of both modalities for indoor localization.
Overall, the analysis of PDF behavior, DBSCAN clustering, and KNN classification highlights a strong correlation between MFCC acoustic features and Wi-Fi RSSI distributions in indoor environments. This initial dataset investigation revealed that MFCC features alone capture spatially distinctive patterns associated with device position and background context, which motivated us to further explore MFCC-only analysis for passive location inference. The core contribution of this work lies in demonstrating this underlying correlation and in establishing the feasibility of using passive sound emissions as an auxiliary modality for location extraction in GPS-degraded indoor settings.

10. Privacy and Security Implications

The findings of this study raise important privacy considerations that are often overlooked in discussions around localization technology. Unlike traditional positioning systems, which rely on explicit data such as GPS coordinates or Wi-Fi scanning, our approach demonstrates that even passive, seemingly harmless signals, like internal acoustic emissions, can be used to estimate a device’s location. This introduces a subtle but significant privacy risk: users may unknowingly share location-related information without ever granting explicit permission [14]. This type of passive localization operates through standard microphone access, which is commonly requested by a wide range of smartphone applications for legitimate purposes, such as voice recording or video calling. Users typically accept microphone permissions without fully understanding the potential for indirect data leakage. Our work highlights that under certain conditions, such audio access may expose contextual details about the user’s environment and movement, even in the absence of traditional location data [13]. As such, the concept of “least-privilege access” must be re-evaluated in light of evolving side-channel risks. To mitigate this emerging threat, several strategies could be considered. First, mobile operating systems might need to introduce finer-grained microphone permission categories, for example, distinguishing between access for voice input versus access for continuous ambient recording. Second, energy-efficient noise randomization techniques at the hardware level could be explored to reduce the consistency of acoustic side-channel signals without affecting normal device function [28]. Additionally, application review processes need to be updated to assess not only what data is accessed but also potential unintended uses of that data. Our work highlights the importance of user awareness regarding side-channels in devices.

11. Details of the Analysis

After preparing the data for each audio file by denoising and extracting simple sound features, as outlined in Algorithm 2, we provide a detailed analysis. It begins by extracting audio features, specifically 13 MFCCs, from each audio file, with a focus on the fourth coefficient and the average of all (as explained in Algorithm 3). These features are then matched with location data (RSSI, latitude, and longitude) from a CSV file, cleaned, and merged into a single dataset (as explained in Algorithm 4). The MFCC features are visualized on a map using GPS coordinates, where bubble size and color represent feature strength (as in Figure 24 and explained in Algorithm 5). The map in Figure 24 shows how the MFCC features vary across GPS locations. Each circle on the map represents a location where audio recordings were made. The size of the circle reflects the normalized value of MFCC_4, while the color encodes the MFCC mean. Warmer colors (yellow-orange) indicate higher MFCC mean values, and cooler tones (purple) indicate lower ones. This dual encoding allows us to visualize how acoustic features vary with location, which helps identify regions with specific audio patterns or characteristics. The color bar on the side helps interpret the MFCC mean scale numerically.
Next, various ML models, such as Random Forest and XGBoost, are trained to predict GPS locations from audio-RSSI features, with performance evaluated by MSE (as explained in Algorithm 6). For unsupervised exploration, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is applied to group similar feature patterns and compared with real locations using metrics such as the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) (as explained in Algorithm 7). A new location can be predicted by feeding the best model a fresh input (e.g., [18.3, −14] as tested in Algorithm 8). Finally, the prediction accuracy is assessed using multiple distance error calculations (Haversine, Universal Transverse Mercator (UTM), and geodesic), and results are visually compared by plotting true vs. predicted GPS positions with annotated distances (as seen in Figure 25, and explained in Algorithms 9 and 10). The plot in Figure 25 compares actual GPS coordinates with predicted ones, where true locations are represented by blue dots and model predictions are denoted by red “x” marks. A dashed line connects each pair, making the error spatially visible. Importantly, the color of each line reflects the prediction error in meters: the darker the line, the smaller the error, and the yellower it gets as the distance increases (as indicated by the color bar). This visualization helps interpret both the magnitude and direction of prediction deviations across the dataset.
Algorithm 2: Audio Denoising and Feature Extraction
1:
Mount Google Drive
2:
Define input directory for audio sessions
3:
For each session folder:
4:
      For each audio file in the session:
5:
           Load audio file
6:
           If the audio is empty, continue
7:
           Compute noise sample (first 0.5 s)
8:
           Compute SNR before denoising
9:
           If SNR before < 30:
10:
                Apply noise reduction
11:
                Save cleaned audio to output folder
12:
                Compute SNR after denoising
13:
                Print SNR improvement
14:
           Else: Print “Skipping file due to high SNR”
15:
           Extract features: Spectral Centroid, RMS, ZCR
16:
           Append features to list
17:
Save all features to CSV
Algorithm 3: Extract MFCC Features from Audio
1:
For each label directory in the audio folder:
2:
      For each file in the label directory:
3:
           If the file is an audio format:
4:
                Load audio using Librosa
5:
                Extract 13 MFCCs
6:
                Compute MFCC_4 and mean MFCC
7:
                Append features and label to lists
8:
Convert features and labels to NumPy arrays
Algorithm 4: Merge MFCC Features with RSSI CSV
1:
Load CSV file with RSSI, Latitude, Longitude;
2:
Clean CSV: remove NaNs, trim whitespace;
3:
Reset index of both DataFrames;
4:
Add MFCC features to CSV DataFrame;
5:
Drop rows with missing values in key columns;
Algorithm 5: Visualize MFCC Features on GPS Map
1:
Normalize MFCC_4 for bubble size;
2:
Plot latitude vs. longitude with MFCC mean as color;
Algorithm 6: Train and Evaluate Models
1:
Split data into train/test sets
2:
For each model in {RF, XGB, GB, SVR, MLP, CB, LGBM, NGB, TabNet}:
3:
      Train model on training data
4:
      Predict on test set
5:
      Compute and print MSE
Algorithm 7: Cluster and Evaluate with HDBSCAN
1:
Standardize feature set;
2:
Apply HDBSCAN clustering;
3:
Encode true location labels;
4:
Compute ARI and NMI between clusters and true labels;
Algorithm 8: Predict New Location
1:
Define new sample as [18.3, −14];
2:
Predict latitude, longitude using best model;
Algorithm 9: Estimate Location Error
1:
Compute Haversine error for each prediction;
2:
Compute UTM Euclidean error for each prediction;
3:
Compute Geodesic error for each prediction;
4:
Calculate mean error for each method
Algorithm 10: Visualize Prediction Accuracy
1:
Plot true vs. predicted GPS coordinates;
2:
Use lines and distance annotations to connect predicted and true points;
3:
Customize plot with labels and legends
When we compare the audio map (Figure 24) with the prediction accuracy map (Figure 25), we can begin to see patterns: locations with high MFCC_4 and highly variable MFCC values (high MFCC mean) correspond to larger prediction errors, because the model finds it harder to localize in acoustically complex backgrounds. On the other hand, more consistent environments (low MFCC mean and low MFCC_4) leads to better predictions. This insight is important because it demonstrates that even with a smartphone mic, we can gather meaningful audio data that accurately reflects the environment and has a significant influence on prediction performance.

12. Conclusions and Future Work

In this study, we introduce a new method for passively estimating location using the internal sounds recorded by a smartphone’s microphone. Our findings showed that these sounds, which are created during wireless activity, contain patterns that match Wi-Fi RSSI patterns. By extracting MFCC features from the audio, we demonstrated that ML models can roughly estimate a device’s location, even without relying on GPS, continuous network connections, or active scanning. The primary objective of this work is to identify an acoustic side channel that links the device’s power usage to its physical location. This adds a new approach to localization research, showing that it is possible to estimate location even when the device is not actively sending or receiving signals. Although the average error of about three to five kilometers is not accurate enough for detailed navigation, it is sufficient for general location tracking in situations where other methods are not feasible. This includes cases such as network failures, privacy-focused operations, or areas where GPS signals are blocked. Furthermore, this study also raises significant concerns about privacy. Our results show that apps with microphone access, which is often allowed for common tasks, could unknowingly record sounds that reveal location information. This suggests that app permission systems may need improvement, and users should be more aware of the risks associated with such hidden side channels [13,14]. Looking forward, there are several ways this work could be improved. First, testing the system with larger and more varied datasets, covering a wider range of locations, longer recordings, and different phone models, would help make the results more reliable. Testing indoors, where GPS is typically less effective and sound patterns are more stable, could also enhance accuracy. Finally, improved filtering methods and real-time processing tools could help the system function effectively even in the presence of background noise, making it more practical for everyday use. Overall, this work demonstrates that internal audio can be a useful and privacy-friendly method for estimating location. It opens the door to a new type of passive localization system that does not rely on extra infrastructure, while also highlighting the growing challenges in protecting privacy in a world full of sensors.

Author Contributions

All authors have contributed equally. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Data Availability Statement

Data is unavailable due to privacy restrictions.

Acknowledgments

The authors would like to thank Maisam M. Gneimat, a graduate from the University of Jordan, for her assistance with this work (email: mys0192850@ju.edu.jo).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Liu, H.; Darabi, H.; Banerjee, P.; Liu, J. Survey of wireless indoor positioning techniques and systems. IEEE Trans. Syst. Man Cybern. Part C 2007, 37, 1067–1080. [Google Scholar] [CrossRef]
  2. Chen, X.; Dovis, F.; Peng, S.; Morton, Y. Comparative Studies of GPS Multipath Mitigation Methods Performance. IEEE Trans. Aerosp. Electron. Syst. 2013, 49, 1555–1568. [Google Scholar] [CrossRef]
  3. Wildemeersch, M.; Slump, C.H.; Rabbachin, A. Acquisition of GNSS signals in urban interference environment. IEEE Trans. Aerosp. Electron. Syst. 2014, 50, 1078–1091. [Google Scholar] [CrossRef]
  4. Bahl, P.; Padmanabhan, V.N. RADAR: An in-building RF-based user location and tracking system. In Proceedings of the IEEE INFOCOM, Tel Aviv, Israel, 26–30 March 2000. [Google Scholar] [CrossRef]
  5. Tarzia, S.P.; Dinda, P.A.; Dick, R.P.; Memik, G. Indoor localization without infrastructure using the acoustic background spectrum. In Proceedings of the 9th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys), Washington, DC, USA, 28 June–1 July 2011. [Google Scholar] [CrossRef]
  6. Priyantha, N.B.; Chakraborty, A.; Balakrishnan, H. The cricket location-support system. In Proceedings of the Sixth Annual International Conference on Mobile Computing and Networking (MobiCom), Boston, MA, USA, 6–11 August 2000. [Google Scholar] [CrossRef]
  7. Chakraborty, R.; Nadeu, C. Sound-model-based acoustic source localization using distributed microphone arrays. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1–5. [Google Scholar] [CrossRef]
  8. Narain, S.; Zhang, W.; Wang, G. Echo-ID: Smartphone-Based Acoustic Region Identification. Proc. ACM Interactive Mobile Wearable Ubiquitous Technol. 2020, 4, 1–27. [Google Scholar] [CrossRef]
  9. Luo, W.; Song, Q.; Yan, Z.; Tan, R.; Lin, G. Indoor Smartphone SLAM with Acoustic Echoes. IEEE Trans. Mob. Comput. 2024, 23, 6634–6649. [Google Scholar] [CrossRef]
  10. Cheng, B.; Huang, Y.; Zou, C. Robust Indoor Positioning with Smartphone by Utilizing Encoded Chirp Acoustic Signal. Sensors 2024, 24, 6332. [Google Scholar] [CrossRef]
  11. Zhang, Z.; Lee, M.; Choi, S. Deep-Learning-Based Wi-Fi Indoor Positioning System Using Continuous CSI of Trajectories. Sensors 2021, 21, 5776. [Google Scholar] [CrossRef] [PubMed]
  12. Khan, D.; Alonazi, M.; Abdelhaq, M.; Al Mudawi, N.; Algarni, A.; Jalal, A.; Liu, H. Robust human locomotion and localization activity recognition over multisensory. Front. Physiol. 2024, 15, 1344887. [Google Scholar] [CrossRef] [PubMed]
  13. Genkin, D.; Shamir, A.; Tromer, E. RSA key extraction via low-bandwidth acoustic cryptanalysis. In Proceedings of the Advances in Cryptology—CRYPTO 2014, Santa Barbara, CA, USA, 17–21 August 2014. [Google Scholar] [CrossRef]
  14. Adhin, V.S.; Maliekkal, A.; Mukilan, K.; Sanjay, K.; Chitra, R.; James, A.P. Acoustic Side Channel Attack for Device Identification using Deep Learning Models. In Proceedings of the 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), Lansing, MI, USA, 9–11 August 2021; pp. 857–860. [Google Scholar] [CrossRef]
  15. Hailu, T.G.; Guo, X.; Si, H.; Li, L.; Zhang, Y. Theories and methods for indoor positioning systems: A comparative analysis, challenges, and prospective measures. Sensors 2024, 24, 6876. [Google Scholar] [CrossRef] [PubMed]
  16. Jeon, Y.; Kim, M.; Kim, H.; Yoon, J.W. I’m Listening to Your Location! Inferring User Location with Acoustic Side Channels. In Proceedings of the World Wide Web Conference (WWW), Lyon, France, 23–27 April 2018; pp. 339–348. [Google Scholar] [CrossRef]
  17. Nagaraja, S.; Shah, R. VoIPLoc: Passive VoIP Call Provenance via Acoustic Side-Channels. In Proceedings of the ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec), Seoul, Republic of Korea, 27–29 May 2021; p. 12. [Google Scholar] [CrossRef]
  18. Amro, O.; Jaswanth, S.; Banoth, S.D.; Chatterjee, U. i-Know What You Do: Privacy Evaluation of Apple Smartphones with Remote Acoustic Side-Channels. In Proceedings of the International Symposium on Quality Electronic Design (ISQED), San Francisco, CA, USA, 23–25 April 2025. [Google Scholar] [CrossRef]
  19. Stevens, S.S.; Volkmann, J.; Newman, E.B. A Scale for the Measurement of the Psychological Magnitude Pitch. J. Acoust. Soc. Am. 1937, 8, 185–190. [Google Scholar] [CrossRef]
  20. Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
  21. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
  22. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  23. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
  24. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  25. Duan, T.; Avati, A.; Ding, D.Y.; Thai, S.; Basu, S.O.; Ng, A.Y. NGBoost: Natural gradient boosting for probabilistic prediction. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020. [Google Scholar]
  26. Caruana, R.; Lou, Y.; Gehrke, J.; Koch, P.; Sturm, M.; Elhadad, N. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August, 2015; pp. 1721–1730. [Google Scholar] [CrossRef]
  27. Drucker, H.; Burges, C.J.; Kaufman, L.; Smola, A.; Vapnik, V. Support vector regression machines. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1997; Volume 9, pp. 155–161. Available online: https://www.researchgate.net/publication/309185766_Support_vector_regression_machines (accessed on 21 October 2025).
  28. Yan, L.; Guo, Y.; Chen, X.; Mei, H. A Study on Power Side Channels on Mobile Devices. arXiv 2015, arXiv:1512.07972. [Google Scholar] [CrossRef]
  29. Wang, P.; Nagaraja, S.; Bourquard, A.; Gao, H.; Yan, J. SoK: Acoustic Side Channels. arXiv 2023, arXiv:2308.03806. [Google Scholar] [CrossRef]
  30. Arp, D.; Quiring, E.; Wressnegger, C.; Rieck, K. Privacy Threats through Ultrasonic Side Channels on Mobile Devices. In Proceedings of the EuroSP, Paris, France, 26–28 April 2017. [Google Scholar] [CrossRef]
  31. Genkin, D.; Nissan, N.; Schuster, R.; Tromer, E. Lend Me Your Ear: Passive Remote Physical Side Channels on PCs. In Proceedings of the USENIX Security Symposium, Boston, MA, USA, 10–12 August 2022; Available online: https://www.usenix.org/system/files/sec22-genkin.pdf (accessed on 21 October 2025).
  32. Xu, H.; Fan, Z.; Liu, X. Application of personalized federated learning methods to environmental sound classification: A comparative study. Eng. Appl. Artif. Intell. 2024, 135, 108760. [Google Scholar] [CrossRef]
  33. Xu, W.; Hou, Y.; Hung, Y.S.; Zou, Y. A comparative analysis of Spearman’s rho and Kendall’s tau in Normal and Contaminated Normal Models. Signal Process. 2013, 93, 261–276. [Google Scholar] [CrossRef]
  34. Singh, N.; Sangho, C.; Rajiv, P. Machine learning based indoor localization using Wi-Fi RSSI fingerprints: An overview. IEEE Access 2021, 9, 127150–127174. [Google Scholar] [CrossRef]
  35. Liashchynskyi, P.; Liashchynskyi, M. Grid search, random search, genetic algorithm: A big comparison for NAS. In Proceedings of the 2019 IEEE Third International Conference on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, 21–25 August 2019; pp. 457–460. [Google Scholar] [CrossRef]
  36. Mathur, S.; Reznik, A.; Ye, C.; Mukherjee, R.; Rahman, A.; Shah, Y.; Trappe, W.; Mandayam, N. Exploiting the physical layer for enhanced security [Security and Privacy in Emerging Wireless Networks]. IEEE Wirel. Commun. 2010, 17, 63–70. [Google Scholar] [CrossRef]
  37. Youssef, M.; Agrawala, A. The Horus WLAN location determination system. In Proceedings of the 3rd International Conference on Mobile Systems, Applications, and Services (MobiSys), Seattle, WA, USA, 6–8 June 2005; pp. 205–218. [Google Scholar] [CrossRef]
  38. Chen, Z.; Zou, H.; Jiang, H.; Zhu, Q.; Xie, L. Fusion of WiFi, Smartphone Sensors and Landmarks Using the Kalman Filter for Indoor Localization. Sensors 2015, 15, 715–732. [Google Scholar] [CrossRef] [PubMed]
  39. Jekaterynczuk, P.; Kozlowski, K.; Niezabitowski, M. A Survey of Sound Source Localization and Detection Methods and Their Applications. Sensors 2024, 24, 68. [Google Scholar] [CrossRef] [PubMed]
  40. Nogueira, A.F.R.; Oliveira, H.S.; Machado, J.J.; Tavares, J.M.R. Sound classification and processing of urban environments: A systematic literature review. Sensors 2022, 22, 8608. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Process of Extracting MFCCs.
Figure 1. Process of Extracting MFCCs.
Jsan 14 00119 g001
Figure 2. Illustration of the “singing capacitor” phenomenon and its connection to passive localization. Wireless transmissions cause variable power draw, prompting the PMIC to adjust current, which mechanically excites capacitors/inductors. These oscillations generate faint internal sounds that can be recorded by the smartphone’s microphone and analyzed (using MFCC), revealing correlations with RSSI and enabling coarse-grained location inference.
Figure 2. Illustration of the “singing capacitor” phenomenon and its connection to passive localization. Wireless transmissions cause variable power draw, prompting the PMIC to adjust current, which mechanically excites capacitors/inductors. These oscillations generate faint internal sounds that can be recorded by the smartphone’s microphone and analyzed (using MFCC), revealing correlations with RSSI and enabling coarse-grained location inference.
Jsan 14 00119 g002
Figure 3. Representative evidence that MFCCs capture device acoustic leakage: Comparative MFCC and spectral diagnostics between device and background recordings; Panels show (Top-left) example MFCC heatmap for a device, (Top-right) example MFCC heatmap for background, both have MFCC_2 index darker than others. (Bottom-left) mean ± SEM of MFCC coefficients for device vs. background, (Bottom-right) t-SNE embedding of all MFCC extracted features with class labels.
Figure 3. Representative evidence that MFCCs capture device acoustic leakage: Comparative MFCC and spectral diagnostics between device and background recordings; Panels show (Top-left) example MFCC heatmap for a device, (Top-right) example MFCC heatmap for background, both have MFCC_2 index darker than others. (Bottom-left) mean ± SEM of MFCC coefficients for device vs. background, (Bottom-right) t-SNE embedding of all MFCC extracted features with class labels.
Jsan 14 00119 g003
Figure 4. Features ranked for examples of device and background clips. MFCC_2 has several separable time series clips, whereas MFCC_8 has matching behavior for all clips.
Figure 4. Features ranked for examples of device and background clips. MFCC_2 has several separable time series clips, whereas MFCC_8 has matching behavior for all clips.
Jsan 14 00119 g004
Figure 5. Predictor feature importance for device and background clips.
Figure 5. Predictor feature importance for device and background clips.
Jsan 14 00119 g005
Figure 6. Average power spectral density (PSD) across example clips. The power of frequencies below 7.5 kHz are slightly separable (Device powers are higherin many cases).
Figure 6. Average power spectral density (PSD) across example clips. The power of frequencies below 7.5 kHz are slightly separable (Device powers are higherin many cases).
Jsan 14 00119 g006
Figure 7. ROC curve (test) for the device/background classifier with MFCC ranks shown in Figure 5 (AUC = 0.953).
Figure 7. ROC curve (test) for the device/background classifier with MFCC ranks shown in Figure 5 (AUC = 0.953).
Jsan 14 00119 g007
Figure 8. Annotated view of the iPhone 6 motherboard showing the microphone and wireless communication circuitry.
Figure 8. Annotated view of the iPhone 6 motherboard showing the microphone and wireless communication circuitry.
Jsan 14 00119 g008
Figure 9. Close-up showing surface-mount capacitors near the iPhone 6 microphone.
Figure 9. Close-up showing surface-mount capacitors near the iPhone 6 microphone.
Jsan 14 00119 g009
Figure 10. Threat Model. During the testing phase, there are no elevated permissions, and the system operates in a passive mode, with no interaction with external network infrastructure.
Figure 10. Threat Model. During the testing phase, there are no elevated permissions, and the system operates in a passive mode, with no interaction with external network infrastructure.
Jsan 14 00119 g010
Figure 11. Workflow of audio-based localization analysis using MFCC features and machine learning.
Figure 11. Workflow of audio-based localization analysis using MFCC features and machine learning.
Jsan 14 00119 g011
Figure 12. Overview of the experimental setup: audio and RSSI data are collected using an iPhone under various conditions, processed, and aligned for training.
Figure 12. Overview of the experimental setup: audio and RSSI data are collected using an iPhone under various conditions, processed, and aligned for training.
Jsan 14 00119 g012
Figure 13. Spectrogram of Environment 1. Bands between 2-15 kHz exhibit high intensities, which are influenced by variations in environmental reverberation.
Figure 13. Spectrogram of Environment 1. Bands between 2-15 kHz exhibit high intensities, which are influenced by variations in environmental reverberation.
Jsan 14 00119 g013
Figure 14. Spectrogram of recorded audio in Environment 2. A consistent low-frequency band is observed around 2–5 kHz, which corresponds to internal power activity during stable signal transmission in a quiet environment.
Figure 14. Spectrogram of recorded audio in Environment 2. A consistent low-frequency band is observed around 2–5 kHz, which corresponds to internal power activity during stable signal transmission in a quiet environment.
Jsan 14 00119 g014
Figure 15. Correlation of audio features with RSSI using Pearson, Spearman, and Kendall coefficients. Most lower-order MFCCs show a negative correlation with RSSI. MFCC_9 deviates from this trend because its corresponding frequency band overlaps a persistent device-specific harmonic and environmental noise in our recordings.
Figure 15. Correlation of audio features with RSSI using Pearson, Spearman, and Kendall coefficients. Most lower-order MFCCs show a negative correlation with RSSI. MFCC_9 deviates from this trend because its corresponding frequency band overlaps a persistent device-specific harmonic and environmental noise in our recordings.
Jsan 14 00119 g015
Figure 16. Correlation heatmap of ZCR, Spectral Centroid, MFCC_1–MFCC_13, RSSI, latitude, and longitude from the aluminum foil shielding experiment; colors indicate Pearson correlation coefficients and reveal MFCC bands and basic audio descriptors that co-vary with wireless signal strength.
Figure 16. Correlation heatmap of ZCR, Spectral Centroid, MFCC_1–MFCC_13, RSSI, latitude, and longitude from the aluminum foil shielding experiment; colors indicate Pearson correlation coefficients and reveal MFCC bands and basic audio descriptors that co-vary with wireless signal strength.
Jsan 14 00119 g016
Figure 17. Mean location error (meters) for each evaluated model on the held-out test set. Bars show the mean error; lower values are better. Models are trained and evaluated with identical preprocessing and data splits.
Figure 17. Mean location error (meters) for each evaluated model on the held-out test set. Bars show the mean error; lower values are better. Models are trained and evaluated with identical preprocessing and data splits.
Jsan 14 00119 g017
Figure 18. High Mean Location Error (MLE) in Poorly Performing Models.
Figure 18. High Mean Location Error (MLE) in Poorly Performing Models.
Jsan 14 00119 g018
Figure 19. The PDF of RSSI values.
Figure 19. The PDF of RSSI values.
Jsan 14 00119 g019
Figure 20. Clustering with DBSCAN of MFCC_11 and RSSI.
Figure 20. Clustering with DBSCAN of MFCC_11 and RSSI.
Jsan 14 00119 g020
Figure 21. Location with RSSI and MFCC 10 and 11 as features.
Figure 21. Location with RSSI and MFCC 10 and 11 as features.
Jsan 14 00119 g021
Figure 22. Clustering with KMeans and MFCC4 as a feature.
Figure 22. Clustering with KMeans and MFCC4 as a feature.
Jsan 14 00119 g022
Figure 23. Clustering results showing identified location groups based on MFCC and RSSI features with KNN.
Figure 23. Clustering results showing identified location groups based on MFCC and RSSI features with KNN.
Jsan 14 00119 g023
Figure 24. The strength of the fourth MFCC and the mean value of all 13 features compared to GPS locations.
Figure 24. The strength of the fourth MFCC and the mean value of all 13 features compared to GPS locations.
Jsan 14 00119 g024
Figure 25. The predicted and the true locations with the Light Gradient Boosting Machine model.
Figure 25. The predicted and the true locations with the Light Gradient Boosting Machine model.
Jsan 14 00119 g025
Table 1. Comparison of recent localization methods using audio and sensor data.
Table 1. Comparison of recent localization methods using audio and sensor data.
Study/ProjectYearMFCC FeaturesEnvironmentReal-Time PredictionRegression-Based CoordinatesMachine Learning ModelDataset SizeMain Objective
Noise Signature Indoor Localization [15]2023Yes (among features)Indoor (room-level)No (post-processed offline)No (classification of room)Ensemble classifiers (J48 DT, KNN, MLP, SVM, etc.)19 distinct rooms/hallways;  25 min ambient audio per room, segmented into 5 s samples (5700 samples total)Classify the current room or corridor by its ambient noise “signature” using a smartphone, with no extra infrastructure
Echo-ID (Smartphone Echo Region ID) [8]2023No (uses STFT of echoes)Indoor (region-level)Partially (requires 17 s scan)No (classification of region)Deep learning classifier (CNN-based) on echo spectra5 region contexts (e.g., different rooms); user holds phone at 2 orthogonal angles for  8.5 s each to record chirps per regionIdentify which predefined region/room the phone is in by actively emitting chirps and analyzing the acoustic echoes on-device
RATBILS (Encoded Chirp Positioning) [10]2024NoIndoorYes (designed for real-time)Yes (x–y coordinates via TDOA)No explicit ML (signal processing; improved TOA/TDOA detection)Two indoor test scenes with multiple acoustic base stations; evaluated regions within a lab setting (sub-1 m accuracy achieved in line-of-sight)Provide high-accuracy real-time indoor positioning using multilateration of arrival times of coded acoustic chirps with standard smartphone hardware
ELF-SLAM (Smartphone Echo SLAM) [9]2023NoIndoorYes (SLAM runs on-device)Yes (x–y trajectory mapping)Contrastive deep learning (self-supervised); integrated with IMU for SLAM loop-closure3 diverse indoor environments (office, mall, etc.);  128 reference spots in largest site; user walks with phone, echoes collected for mappingEnable simultaneous localization and mapping using a phone’s own ultrasonic echoes; <1 m accuracy without prior fingerprint database
Khan et al. (Audio + IMU Localization) [12]2024Yes (MFCC)Indoor & OutdoorNo (batch dataset analysis)No (context classification)CNN for environment classification; LSTM for motion/HARTwo public datasets: Opportunity (indoor wearable data, 12 subjects) and Extrasensory (mixed indoor/outdoor smartphone data from 60 users)Recognize user context and activities (e.g., locomotion, and whether indoors or outdoors) using smartphone sensors; MFCC features help distinguish environment type
Jeon et al. (“I’m Listening to Your Location!”) [16]2018No (uses ENF signatures)IndoorNo (requires prior ENF map)No (location classification)Signal correlation with ENF database4 buildings; each with 3–5 floors; ENF audio recorded and aligned with reference dataInfer coarse location by matching ambient audio to regional ENF power grid noise; requires environmental profiling
VOIPLoc (Nagaraja and Shah) [17]2021NoIndoor (room-scale)No (requires call recording)No (differential inference of location)Statistical analysis of echo fingerprinting40 recordings across rooms; used during real VoIP calls; focused on relative location changeDetect whether a user has changed rooms between VoIP calls by analyzing echo profile variations
Amro et al. (ISQED’25) [18]2025Yes (MFCC used on internal audio)Indoor & OutdoorNo (offline change detection)No (no absolute location estimation)MFCC feature tracking and delta analysis with t-SNE and KNN10 locations; multiple smartphone sessions with calls triggered to induce wireless activityDetect relative location changes between consecutive wireless events using internal acoustic emissions
our work2025Yes (MFCC used exclusively)Indoor & Outdoor (urban areas)No (offline post-processing)Yes (regression-based coarse coordinates)Ensemble models (NGBoost, Gradient Boosting, Random Forest); also tested SVR, MLP, TabNet, etc.>10 locations; recordings under varied conditions,  30 sec sessions;  5 segments per recordingPassive location provenience using only audio from internal hardware vibrations (no GPS or Wi-Fi scanning)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Darabkh, K.A.; Amro, O.M.; Al-Qatanani, F.B. Passive Localization in GPS-Denied Environments via Acoustic Side Channels: Harnessing Smartphone Microphones to Infer Wireless Signal Strength Using MFCC Features. J. Sens. Actuator Netw. 2025, 14, 119. https://doi.org/10.3390/jsan14060119

AMA Style

Darabkh KA, Amro OM, Al-Qatanani FB. Passive Localization in GPS-Denied Environments via Acoustic Side Channels: Harnessing Smartphone Microphones to Infer Wireless Signal Strength Using MFCC Features. Journal of Sensor and Actuator Networks. 2025; 14(6):119. https://doi.org/10.3390/jsan14060119

Chicago/Turabian Style

Darabkh, Khalid A., Oswa M. Amro, and Feras B. Al-Qatanani. 2025. "Passive Localization in GPS-Denied Environments via Acoustic Side Channels: Harnessing Smartphone Microphones to Infer Wireless Signal Strength Using MFCC Features" Journal of Sensor and Actuator Networks 14, no. 6: 119. https://doi.org/10.3390/jsan14060119

APA Style

Darabkh, K. A., Amro, O. M., & Al-Qatanani, F. B. (2025). Passive Localization in GPS-Denied Environments via Acoustic Side Channels: Harnessing Smartphone Microphones to Infer Wireless Signal Strength Using MFCC Features. Journal of Sensor and Actuator Networks, 14(6), 119. https://doi.org/10.3390/jsan14060119

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop