Categorizing Touch-Input Locations from Touchscreen Device Interfaces via on-Board Mechano-Acoustic Transducers

: Many mobile electronics devices, including smartphones and tablets, require the user to interact physically with the device via tapping the touchscreen. Conveniently, these compact devices are also equipped with high-precision transducers such as accelerometers and microphones, integrated mechanically and designed on-board to support a range of user functionalities. However, unintended access to these transducer signals (bypassing normal on-board data access controls) may allow sensitive user interaction information to be detected and thereby exploited. In this study, we show that acoustic features extracted from the on-board microphone signals, supported with accelerometer and gyroscope signals, may be used together with machine learning techniques to successfully determine the user’s touch input location on a touchscreen: our ensemble model, namely the random forest model, predicts touch input location with up to 86% accuracy in a realistic scenario. Accordingly, we present the approach and techniques used, the performance of the model developed, and also discuss limitations and possible mitigation methods to thwart possible exploitation of such unintended signal channels.


Introduction
The availability of high-precision sensors such as cameras, accelerometers and microphones on modern mobile devices afford users a wide range of functionality such as navigation, virtual assistants and even pedometers. While on-board sensors enable rich user experiences, these sensors can be exploited by malicious applications to monitor the user in unintended ways by tracking transducer signals emanating from the device, such as electrical, sound and vibration signals, and often contain information about the device processes, operation and user interactions. These "collateral" signals have significant implications in the field of cyber-security and have been used to bypass cryptographic algorithms such as RSA (Rivest-Shamir-Adleman) [1] and exploit acoustic information to extract sensitive information such as retrieve user PIN codes [2] and passwords [3]. Touchscreens, the de facto user input interface, take up a significant portion of the device physical surface. Consequently, user interactions with these touchscreens generate non-negligible signals that can be recorded by on-board sensors.
On-board motion sensors such as the gyroscope and accelerometer found on mobile devices are particularly sensitive to force and direction changes when a user interacts with the device by tapping the screen. Analysing the gyroscope and accelerate sensor readings, off-the shelf mobile applications (apps) such as TouchLogger are able to infer user text inputs on various mobile operating systems (iOS and Android) with different device form factors and physical design. Extending this work on hardware sensors, ACCessory is another application built to evaluate text input using a predictive model to infer character sequences from accelerometer data with supervised learning techniques [4].
Earlier studies analysed acoustic excitation to retrieve user input on non-touchscreen devices such as physical computer keyboards. It was demonstrated that keystroke inference can be performed using multiple microphones with relatively high accuracy [5] even when facilitated via Voice-over-IP (VoIP) services such as Skype [6]. Consequently, it is conceivable that such unauthorized audio recording can be used to recover sensitive user information, using inter-keystroke timing or statistical analysis to recover typed text [7] or even ten-character passwords within 20 attempts [8]. While some practitioners may dismiss the use of acoustic signals as a possible security loophole on mobile devices [9], recent publications [10][11][12][13][14] show that acoustic techniques such as tracking the Doppler effect, supplying an external excitation signal and Time Difference of Arrival (TDOA) that can be used to retrieve text input on physical keyboard can be adapted to compromise mobile devices as well. One such system, SonarSnoop utilises an active acoustic technique by emitting human inaudible acoustic signals and recording the echo to profile user interaction and infer touchscreen unlock patterns [14]. Comparably, passive techniques like TDOA which calculate the time difference between the reception of the signal by different transducers to infer input location can be further enriched with acoustic frequency analysis to distinguish touch input [13]. In this study, we apply machine learning techniques to predict the touchscreen input location from touchscreen device interfaces via acoustic fingerprints collected from on-board mechano-acoustic transducers.
An acoustic fingerprint is a summary of acoustic features extracted from an acoustic signal that can identify similar acoustic events [15]. Acoustic fingerprinting is often combined with statistical methods to identify similar types of sounds quickly and has seen application in a broad range of arenas from identifying pop music [16] to determining volcanic eruptions which inject ash into the tropopause [17].
Leveraging insights from various studies on acoustic signals from keystroke clicks on physical keyboards [5], we explore the use of keystroke inference and acoustic fingerprinting techniques on touchscreens. Using the mobile device's on-board microphones, we surmise that acoustic signals arising from the interactions with the touchscreen can reveal the user's touch input location and can thus be used to eavesdrop sensitive input information. Such a pathway may inadvertently allow user data input associated with screen input location to be inferred without users noticing [18]. We extract acoustic features from the on-board microphone signals, supported with accelerometer and gyroscope movement data to separate and classify user touch input location on the touchscreen. Our contributions in this study are twofold. Firstly, we generate a dataset containing 2-channel (stereo) audio recordings and movement data of user touch input location on a touchscreen surface under both controlled and realistic conditions. Secondly, we compare the performance of acoustic features and movement data in categorizing touch input location using machine learning algorithms.
To address the acoustic side channel presented in this paper one can consider mitigation techniques which can be broadly classified into three categories: prevention, jamming and shielding. Prevention techniques limit access to device sensors which can be implemented on both the hardware level with physical switches and in software with user access control policies. Jamming typically involves saturating sensor with noise or false information to mask the actual sound created by the touch input. Side-channel leakage can also be attenuated via shielding with physical means by altering or redistributing mass to guard against acoustic side channels.
Accordingly, this paper is organized as follows: Detection of different sources of touch input and the data collection approach used is described in Section 2. In Section 3, we describe experiments conducted regarding the extracted features and the classification process. Section 4 contains results produced as part of this investigation. We discuss the results of various sensors and possible mitigation measures in Section 5 and finally conclusions in Section 6.

Methodology
In this study, we record user touch input on a touchscreen and capture the corresponding physical emanations with on-board sensors. A customised Android application was adapted [19] to capture different input layouts and capture data from the hardware motion sensors and audio input. Acoustic features are extracted from the audio recordings and sensor data are categorised by touch input location. We apply machine learning techniques to the movement data and acoustic feature datasets to train separate models for each experiment. We evaluate the performance of each model and investigate the underlying phenomena that contributes to the accuracy of the model. To improve model accuracy, stereo microphone input is used and the relevant audio segments extracted with peak detection techniques. Several experiments were conducted to explore the accuracy and robustness of the selected features. Building upon prior work in this domain, we examine the use of movement data as a predictor for touch input in a realistic scenario. In addition to motion sensor such as accelerometers used in [9], we delve into the use of acoustic features for touch input classification under the same conditions and identify salient features contributing to touch input identification with a reduced feature set. Finally, we consider the relation of physical distance and separation efficiency by restricting the touch input location on the reduced touchscreen.
To investigate, we conducted a number of experiments using on-board sensors to capture the physical emanations of interactions with the touchscreen, labeled according to the corresponding touch input location.
The experiments were conducted on a Samsung Galaxy S7 Android mobile phone (SM-G930FD) using the entire touch-screen interface, 110 × 66 mm and subsequently with a portion of the screen, 43 × 66 mm divided equally into nine separate touch-input locations of three rows (Top, Mid, Bottom) and columns (Left, Mid, Right) as shown in Figure 1. An Android application was adapted to capture data from hardware motion sensors and record acoustic signals from the on board microphones. Sensor data corresponding to each of the nine locations are recorded in sequential sessions and categorised by touch-input location. As we expected to process multiple acoustic and sensor datasets, we opted to create an automated data processing pipeline to retrieve and conduct feature extraction on the recorded microphone signal to guarantee consistency across different recording sessions.

Recording Movement Data
To evaluate the feasibility of user touch input detection with movement data, we collected various hardware motion sensor signals accessible using our Android application.
The Android platform provides application with access to multiple environmental sensors available on the specific device model. For the purpose of our experiments, we extracted Linear Accelerometer L(x), L(y), L(z), Gyroscope G(y), G(x), G(z) and composite Rotation Vector R(x), R(y), R(z) signals, recorded at 1 kHz sampling rate to capture changes in physical state of the device.

Recording Audio Data
During the recording session, dual on-board microphones located near the top and bottom of the device (see Figure 2) are used to record the stereo microphone input, captured at 44.1 kHz sampling rate in 16-bit pulse-code modulation (PCM) format. Interactions with the touchscreen create a mechano-acoustic response presenting as an impulse in the continuous microphone signal as shown in Figure 3. We record and label touch inputs for each location (Top-Left, Mid-Right etc.) in separate sessions to create the corresponding training and test datasets.  Figure 3 shows the signal from the top and bottom microphones are out of phase, alluding to the opposite orientation of the microphone membranes receiving the pulse which distinguishes the two signals. The difference between the amplitude of the pulse hints at the different degrees of mechanical coupling and internal gain of the two microphones. The difference in amplitude in the signals could be attributed to the proximity of one microphone to the source of the pulse compared to the other. We further note that the initial transient in the top signal contains a sharp spike absent from the bottom signal while a larger difference is observed between the first maxima and minima of the bottom signal. The bottom signal also has a longer tail with more oscillations after the pulse while top signal fall away quickly. With a single microphone signal, the pulse generated from a tap contains information about the touch input location based on how the sound waves propagate through the device, with different locations causing the phone to vibrate and respond uniquely (the phone's interior structure is not homogeneous). The fact that pulses observed in the top and bottom signals do not simply reflect each other suggests that the signal received at the two microphones are in fact unique. Furthermore, aftershocks with multiple peaks reflect the complex interactions between the finger and mechanical response of the phone which could include internal mechanical reflections. These repeated peaks, may have unique magnitudes and time intervals associated with the location of the touch input as listened from two microphones positioned asymmetrically on the device. From these observations, we posit that the top and bottom microphones hear a different acoustic signal, distinguishable in their response from the same pulse which supports our assumption of uniqueness of the two signals. Our approach hinges on the uniqueness of the response of each microphone but it is the systematic difference between the two that allows our machine learning model to distinguish the location of the touch input. To identify and isolate the segments of acoustic signal that correlate with touch input location, we apply smoothing (moving average of 11 samples) before peak detection on the acoustic recording as seen in Figure 4, to ensure impulse-like signals are detected. A local peak detection algorithm is applied with a sliding window to select the local maximum element from the neighbourhood of the chosen frame. These detected peaks are compared against the entries from a local minimum detection algorithm applied on the original signal to exclude areas of elevated signal which are non-impulsive (plateaus) from our peak detection dataset. We apply an empirically determined peak window size of 5 k samples points ( 110 ms) and peak intensity threshold of 20% determined empirically. Acoustic features commonly used in audio recognition and audio classification problems were extracted using an Open-Source Python Library for acoustic Signal Analysis pyAudioanalysis [20]. The acoustic signals are divided into frames of 27 ms (1200 samples, empirically determined to encompass the duration of a typical touch input impulse, cf. Figure 3) and for every frame a number of 'short term' features are extracted. These features include mel frequency cepstral coefficients (MFCCs), chroma vectors, zero crossing rate (ZCR), energy, energy entropy, spectral entropy, spectral flux, spectral roll-off, spectral spread, spectral centroid and chroma deviation. Altogether, 34 acoustic features (13 MFCCs, 13 Chromas, five spectral features, energy, entropy and zero crossing rate) were extracted for every frame totalling 68 features from both acoustic channels. The details regarding these features can be found in [20,21].
Two noteworthy sets of acoustic features used in training are MFCCs and chroma vectors with significant contribution towards touch input categorization. MFCCs focus on the perceptually relevant aspects of the audio spectrum commonly used speech/speaker recognition estimating energy in various region of audio spectrum over a set of overlapped non-linear mel-filter bank. Chroma vectors characterise energy distribution across cyclical frequency bins [22].
To facilitate feature extraction, we determine a time window around the detected impulse, distinguishing and excluding it from regions of silence or noise. Starting from the detected peak, an empirically determined time buffer is applied backwards to include the start of the impulse. ZCR across a sliding window is used to identify the start of the impulse: the start of an impulse is associated with a low ZCR value as shown in Figure 5, where we apply an empirically determined ZCR threshold of 0.015. To ensure we capture the whole impulse, we begin the window with an offset from the start of the ZCR index. An empirically determined window size of 1200 samples (27 ms) with an offset ratio of 30% of the sample number between the initial index and ZCR start (shown in Figure 6) is used to encompass the entire impulse, ensuring important signal features are captured.  1) The peak is detected, then (2) a sample buffer is applied backwards, which (3) allows us to detect a drop in ZCR, thereby defining the start of the tapping impulse. To ensure we capture the whole impulse event, (4) we begin the window 30% from the start of the ZCR index.
Data normalization was performed on the extracted features to ensure the contribution of each feature is equally weighted and this avoids bias that could occur across recording sessions.

Experiments
Experiment recording conditions, parameters and constraints are kept consistent across the recording sessions to ensure the reliability and repeatability of the results. Before the start of the recording session, all external device notifications, haptic feedback and audio tone are disabled and the device screen is set to remain active during the experiment. We conduct the experiment in quiet room with relatively low ambient noise less than 50 dB. Both the audio samples and movement data are time synchronised and labelled with the touch input location. In total, 150 touch inputs were collected for each of the nine touch input locations, of which 50 were used to train the classifier with the remaining 100 used to test the model. The training dataset was deliberately limited to 50 touch inputs to avoid over-fitting and to demonstrate that a relatively accurate model can be created with limited inputs. Each session is repeated three times on different days in varying room conditions. This allows us to validate our methodology and ensure the results are consistent across each session.
The movement data and acoustic features extracted from the audio samples are used to distinguish and classify the touch input originating from different locations. To identify the touch input location, a random forest classifier created in Python with scikit-learn [23] is trained on the set of features extracted from the touch input and the results are mapped to one of the nine locations. A random forest classifier with an estimator of 100 trees is selected for this experiment. The choice of the random forest classifier is motivated by the ensemble learning benefits which perturbs-and-combines a number of machine learning models to improve the performance of the classifier. Figure 7 shows the experiment methodology. Several experiments were conducted with the movement data and acoustic features extracted from the audio samples. In the first investigation, named Movement Data Experiment, movement data is partitioned into five sets with four sets used to train the classifier and one set reserved for tests. The process is repeated five times and the designated test set is rotated in each iteration. In the second investigation, named Device Orientation Experiment, the device is relocated and rotated 90 degrees (Figure 8) between the training and test sessions and the movement data is used to identify the touch input location. In the third investigation, named Acoustic Feature Experiment, a classifier is trained to identify the touch input location associated with the audio sample. The full touchscreen surface is tested in the experiment and the device relocated and rotated between training and testing sessions to simulate a realistic user-input scenario. Finally, in the fourth investigation, named Reduced Touch-Input Area Experiment, the touch input location is reduced and the classifier is trained to evaluate audio samples from the reduced touch input area under realistic conditions where the device is relocated and rotated between training and testing sessions.

Movement Data Experiment
To investigate using using movement data from an Android device to detect touch input, we perform a validation experiment. This experiment also replicates the earlier work [9] used to detect touch input and hence verify our methodology. The device position and orientation are kept constant and the sessions are recorded sequentially to maintain the same device position and orientation across sessions. Touch input from the nine locations are collated into a single dataset and a random forest classifier with 5-fold validation applied is used. In total, 80% of the recorded dataset is allocated for training and the remaining 20% was used to predict the touch input location from the movement data extracted.

Device Orientation Experiment
In departure from [9] we now include the effects of device position and orientation on the performance and robustness of our model using movement data, under realistic user-input conditions. Training and test datasets are recorded with the device re-orientated between the sessions as seen in Figure 8. Movement data is recorded and feature extraction is performed for all nine touch input locations. A new model is created by applying the random forest classifier to the training data and evaluated against the test dataset.

Acoustic Feature Experiment
To simulate everyday smartphone usage, we evaluate the efficacy of acoustic features in detecting touch input under field recording conditions. Audio samples for the training dataset are first recorded with the device in a horizontal orientation (Figure 8). The device is relocated, rotated and oriented vertically before recording the test dataset. Segments in the acoustic signal corresponding to touch inputs in the nine locations are isolated and acoustic features described in Section 2.2 are extracted for training and evaluation. A new model is created by applying the random forest classifier to evaluate the performance of acoustic data in identifying touch input.

Reduced Touch-Input Area Experiment
To better depict everyday usage of mobile phones in vertical orientation, where the keypad area is now reduced to the bottom third of the screen, we now evaluate the performance of acoustic features in distinguishing different touch inputs within a restricted area. Applying similar experimental parameters in Acoustic Feature Experiment, touch input is now restricted to 43 × 66 mm (as opposed to the original 110 × 66 mm fullscreen area), the effective size of the default number pad input. Audio samples for each touch input location are similarly recorded with the on-board microphones with device position and orientation varied between the training and test sessions. A new model is trained by applying the random forest classifier to the acoustic features extracted from the audio samples in the training dataset. The model is then evaluated against the test dataset to determine the effects of the touch input area on prediction accuracy.

Movement Data Experiment
The validation experiment successfully identified the touch input location with high accuracy (99%) using the cross-validated movement dataset, similar to the results reported by [9]. The confusion matrix is shown in Figure 9 (correct classification and misclassification indicated in percentage).
To better understand the contribution of the various movement data to the performance of the model, the relative importance of each feature was identified. The weightage of these features contributing to touch input location classification are listed in Table 1, with rotation vector being the most important.  The rotation vector can be understood intuitively as the changes in pitch, roll and yaw of the device when a force is applied on different touch input location of the touchscreen. The rotation vector is represented by the angle of rotation around an axis which align the device's to the current orientation. A force applied near the edges of the device changes its orientation which is recorded as an angle along each axis relative to the reference device orientation. The rotation along all three axes is mapped to a specific region which can then be used alongside the other sensor features together, thus accurately predicting touch input location.

Device Orientation Experiment
When varying the device position and orientation, we are unable to replicate the findings of Movement Data Experiment and the results reported in [9] with our training and test datasets. In contrast with [9], our results show a drastic degradation of prediction accuracy from 99% to 15%, as seen in Figure 10 when the device position and orientation is altered between the training and test session. This suggests that device position and orientation is a key component in the movement data [10] and this reduces the generalisability of the model for other orientations.
Analysing movement data for the training and test session in Figure 11, we observe a separation in the feature space between readings from different sessions for the rotation vector. This suggests that the classification model derived from one session cannot be used to predict touch input location from another session as the data from each session occupies a different feature space. Now, the rotation vector is a composite of various environment sensors which includes the magnetometer; the magnetometer measures the direction, strength, or relative change of the earth's magnetic field at a particular location and is dependent on the absolute (global) position of the device [24]. Slight changes to the device position will result in a large deviation in the sensor readings. The feature space corresponding to the testing and training session is shown in Figure 11. Readings from the linear acceleration in each axis shown in Figure 11a-c overlap for both training and test sessions. Likewise, gyroscope sensor readings along each axis from the the training and test sessions seen in Figure 11g-i are distributed similarly across the entire range. In contrast, a clear separation between the training and testing sensor readings in all three axes is observed in Figure 11d-f of the rotational vector.

Acoustic Feature Experiment
The trained classifier successfully predicted if an audio sample originated from a particular touch input location with an average accuracy of 86.2% (Figure 12). The model was able to predict the touch input location with varying degrees of accuracy, achieving best performance near the corners of the touchscreen located near the on-board microphone. This fits with our understanding that using multiple microphones located at different positions allow for increased distinctiveness in acoustic signal attributes. Furthermore the relative position of a particular touch input location to the two microphones remains the same and thus maintains robustness of the acoustic features used for classification irrespective of changes in device position or orientation. The relative importance of the audio recognition features from both channels were computed over several experiments and the top common features are listed in Table 2. The classifier (newly created) from these subset of acoustic features resulted in a slight decrease in accuracy, was able to predict touch inputs with an accuracy of up to 82.2%. In the earlier experiment (Section 4.2), device position and orientation certainly matter to the success of the classifier but Section 4.3 show that acoustic features offers resistance to changes in device position and orientation with most of the distinguishing information found in chroma deviation and specific spectral features. The resulting confusion matrix corresponding to the classifier trained using a subset of acoustic features as shown in Figure 13 shows the decrease in accuracy and higher rate of misclassification in neighbouring touch input locations compared to Figure 12.  Acoustic features were further analysed using t-SNE [25]. The resulting t-SNE map that reveals the inter-region heterogeneity and intra-region homogeneity is shown in Figure 14.
In this figure, acoustic features extracted for a given touch input location are colored according to the location label. The t-SNE scatter-plot reveals clear dimensional separations based on acoustic characteristics. Figure 14 also highlights the chances of misclassification of touch input samples in each location with physically adjacent regions having a higher probability of misclassification. Note also a few stray points of red and blue near the cluster of black: this may reflect the possibility of misclassification attributed to chance similarity of the acoustic signals collected.

Reduced Touch-Input Area Experiment
Unsurprisingly, a reduction in the effective touch input area sees a decrease in accuracy with the trained classifier. However, the model is still able to predict the touch input locations with an accuracy of 78.8%, a decrease of just 7.4% from the previous experiment. Touch input locations are now constrained within a smaller area, thereby increasing intra-group variance while decreasing inter-group variability due to increased touch input proximity. Additionally, touch input locations are now further from the upper microphone and hence suffer from poorer signal-to-noise ratio, thereby reducing distinguishing acoustic features, thus reducing performance as seen from the somewhat higher misclassification rate of the middle row (47% of misclassification for middle row) of touch inputs (cf. Figure 1b) in Figure 15.

Discussion
We observe that a cross-validated approach using movement data as reported in [9] resulted in good classification accuracy when distinguishing touch input without changing position and orientation. However, its performance severely degrades when device position and orientation are changed (see Section 4.2), conditions which were not explored by [9] and was in fact overlooked in their investigation. As expected, this limits the efficacy of movement data in locating touch input and motivates seeking pathways which generate sensor data that is resistant to changes in device position and orientation.
Such a pathway which is more robust can be derived by exploiting multiple microphone signals on the devices. The non-similarity of the upper and lower microphone responses to touch input location shown in Figure 3 and fixed positions relative to the device increase the quality of information extracted, while also ensure less sensitivity to changes in position and orientation. Applying audio processing techniques and analysing the acoustic features therein, we can identify the touchscreen location tapped. This approach will yield increased accuracy with additional microphones, as seen in devices such as the recent iPhone models which now include up to four on-board microphones. Previously, techniques such as Time Difference of Arrival (TDoA) analysis have seen application on standard keyboards [7] and touchscreen surfaces [26] to identify possible sets of keys within a restricted area (virtual keyboard) with up to 90% accuracy [27]. In line with the previous studies, acoustic features used in our setup is able to identify touch input location across the entire touchscreen an average accuracy of 86.2%, offering almost comparable performance. These studies reinforce the fact that physical signals arising from user interaction may offer unintended pathways of data compromise.
As introduced in Section 2, MFCC and chroma vectors features contribute over 80% to touch input classification and can used in combination to create an acoustic fingerprint of different sections of the screen to categorize touch input location, by distilling and removing redundancies in the raw signal and allow analysis to be concentrated on salient attributes of the acoustic event. Comparatively, spectral and energy entropy features contribute only up to 20% to categorise user touch input. Filtering out these peripheral features can improve the generalisability of our model with only minor degradation in accuracy when applying machine learning techniques to determine user touch input location by identifying acoustic markers.
The use of acoustic features maintains a good discrimination of touch input classes across varying position and orientations, however it is subject to the maximum distance available between touch inputs location; with a smaller touch input area (numberpad input or simply a smaller device), acoustic features extracted may be less distinct due to the increased proximity of user touch input locations which results in a reduction of accuracy-from 86.2% to 78.8%. Even under such limiting conditions, acoustic features still provide comparable performance with detection techniques that rely on accelerometer data to identify touch input over a larger touch input area, with accuracy ranging from 78.0% on smartphones [4] to 84.6% on smartwatches [28] due to the strictly unique signal response and position of each on-board microphone.
This acknowledges the fact that touch input location on touchscreen devices can be retrieved from physical signals associated with the device. It then provides the basis and motivation to further investigate links between the physical characteristics of touch input location, acoustic excitation and how the physical device and signals interact accordingly.
To address the threat of unauthorised data access, techniques to limit fine-grained sensor readings can be employed to complicate keystroke inference. However this approach may pose potential problems for legitimate mobile apps and introduce usability issues as well. A more tempered approach would be to disable access to sensors when user are required to provide input in a sensitive application or include a physical kill-switch. While such potential vulnerability exists whereby an attacker gains access to the microphone sensor on the target device, nevertheless such acoustic-based incursions may nevertheless be minimized by changing the keyboard layout each time, emitting masking sounds which can interfere with touch input location identification or enabling haptic feedback to lower the rate of touch input detection and thus alter the acoustic response used in touch input classification. Unintended data exposure can also be attenuated by changing the mechanoacoustic response of the device to frustrate or invalidate the acoustic fingerprint registered by the trained classifier, such as fitting on a heavy rubber phone cover or dynamically altering the distribution of mass or mechanical coupling on the device. These techniques and counter measures against acoustic incursions are summarised in Table 3.

Conclusions and Future Work
We show it is possible to determine the touch input location on a touchscreen via acoustic and movement information extracted from a mobile device. Acoustic features proved to be more effective under realistic usage conditions compared with movement data alone; user touch input location can indeed be determined from audio recordings using on-board microphone sensors. Thus, the ensemble machine learning algorithm chosen for this investigation is effective in classifying touch input location with an accuracy of 86.2%. This has wide-ranging implications on user input privacy on mobile communication devices armed with on-board mechano-acoustic sensors.
In future work, further investigation of the acoustic signal and physical characteristics of the device will allow us to determine if other acoustic features or features extracted from pre-trained networks will further improve the sensitivity of the model. Neural networks may also be employed to automatically detect touch input. A larger dataset including swipe and long press input across a variety of implements (e.g., stylus) with the device held at different orientation and inclination may also be considered to evaluate the generalisability of the acoustic model. Model transferability across users and devices may be evaluated with the use of pre-trained models created with inputs from different users. This can be further extended with a larger dataset to address dynamic virtual keyboard layouts on different mobile devices such as smartphones and tablets, and collecting additional data with external microphones. This database can be further extended using augmentation techniques. Furthermore, as touchscreen interfaces are adopted by industrial appliances, the approach presented could be used to analyse these devices for potential vulnerability to acoustic side-channels. Finally, the success of incursion mitigation techniques must also be investigated to determine the most effective and practical approaches which can be implemented by users seeking increased data privacy.