MetaEar: Imperceptible Acoustic Side Channel Continuous Authentication Based on ERTF

: With the development of ubiquitous mobile devices, biometrics authentication has received much attention from researchers. For immersive experiences in AR (augmented reality), convenient continuous biometric authentication technologies are required to provide security for electronic assets and transactions through head-mounted devices. Existing ﬁngerprint or face authentication methods are vulnerable to spoof attacks and replay attacks. In this paper, we propose MetaEar , which harnesses head-mounted devices to send FMCW (Frequency-Modulated Continuous Wave) ultrasonic signals for continuous biometric authentication of the human ear. CIR (channel impulse response) leveraged the channel estimation theory to model the physiological structure of the human ear, called the Ear Related Transfer Function (ERTF). It extracts unique representations of the human ear’s intrinsic and extrinsic biometric features. To overcome the data dependency of Deep Learning and improve its deployability in mobile devices, we use the lightweight learning approach for classiﬁcation and authentication. Our implementation and evaluation show that the average accuracy can reach about 96% in different scenarios with small amounts of data. MetaEar enables one to handle immersive deployable authentication and be more sensitive to replay and impersonation attacks.


Introduction
Ubiquitous smart devices, such as smart phones, smart headphones, and smart watches, with their rich built-in sensors, have become an important access carrier for the Cyber-Physical Human System (CPHS) [1]. On the one hand, the requirements of work and entertainment during the mobile process make wearing headphones a daily behavior, which spawns a new computing model, earable computing [2]. On the other hand, taking mobility, ubiquity, and human-centricity into account, CPHS and even the Metaverse have higher requirements for continuous identity authentication [3,4]. The global biometric authentication market size is projected to reach 30.5 billion U.S. dollars by 2026 [5].
Digital assets and commercial transactions require highly secure and reliable continuous authentication to protect user property. These transactions require ongoing multi-factor authentication to ensure their security and validity. In addition, continuous, safe and efficient biometric authentication can also be applied to scenarios such as telemedicine, disabled services, and virtual production.
Biometric authentication has the advantages of stochastic variation within the different people and no dependence on shared secrets. The existing face, fingerprint, iris, and voiceprint authentication methods are based on extrinsic biometrics. Authentication based on face, fingerprint, and iris is vulnerable to presentation attacks and also raises privacy concerns [6], using a photo [7], video [8], mask [9], or silicone fingertip [10] to impersonate a victim. In addition, the voiceprint-based authentication method is vulnerable to replay attacks [11,12] of voice recording. So, can we design a continuous and transparent biometric authentication method to ensure the system's security?
At present, ultrasonic authentication through the human ear is a novel solution. The head-mounted device sends ultrasonic sound to the human ear and performs continuous identity authentication through the feedback of the human ear. Collecting the FMCW (Frequency-Modulated Continuous Wave) ultrasonic signals reflected from the human ear uses the signal channel characteristics to describe the human body's anatomy and extract the unique biological features [13] for authentication. The inaudible ultrasound can perform continuous passive perceptual authentication without affecting human hearing and without interfering with the user's immersive experience, releasing the user from the authentication frequent interaction process. Furthermore, unlike the existing face identification and fingerprint authentication, ultrasonic authentication can prevent counterfeiting and replay attacks, thus ensuring the security of transactions in Metaverse. On this basis, we aim to build an explainable acoustic authentication model that can be applied in daily scenarios with zero effort and high accuracy.
However, we face three major technical challenges to achieving a one-fits-all model. First of all, the features originally used in acoustic action recognition, such as doppler [14] and phase [15], are aimed at dynamic behaviors and cannot be used to describe static object features. So for the uniqueness of the biological structures of auricles and ear canals, how do we design a model that extracts the unique biological geometric features of static auricles and ear canals? Second, the FMCW acoustic signal received by two microphones in real time is not synchronized. How can the multi-microphone signal be effectively synchronized? Third, ultrasound is a vibration wave, and for a monostatic sensing device with co-located transceivers, vibration will cause self-interference of the received signal. How do we remove the self-interference and extract high-resolution characteristic signals?
To overcome these challenges, we propose MetaEar, a continuous authentication system based on imperceptible acoustic fingerprints, which uses FMCW ultrasound to model the unique biometrics of the human ear and conduct authentication, as shown in Figure 1. The key component of MetaEar is ERTF (Ear Related Transfer Function), which uses ERTF to identify and extract the unique biometrics of the human ear. Our observation is that each person's auricle and the ear structure have different responses to the delay and magnitude of FMCW ultrasound. The auricle and ear canal modulate the sound signal to produce different delays. The eardrum and the cochlea convert the different magnitude mechanical sound waves to neural electrical signals. Furthermore, the channel impulse response of the human ear can be modeled to extract the ear's unique features and authenticate it. The speaker on the headset sends inaudible FMCW ultrasonic waves, and the co-located microphone receives the reflected sound signal. After passing the band-pass filter, the phase slope change is harnessed to align the signals due to the problem of fuzzy peak judgment of cross-correlation. MetaEar employs a dual-microphone differential denoising to eliminate the self-interference caused by the co-located physical transmission of the vibration wave. Finally, the channel impulse response of the signal is calculated by modeling the ERTF so that the unique biometric vector of the human ear can be achieved. This process is equivalent to model-based human ear anatomy feature embedding. For better application deployment, instead of using a Deep Learning model that requires numerous data, we utilize a traditional SVM (Support Vector Machine) to perform one-class authentication. Due to MetaEar's modeling of the complex structure of the human ear and the relatively slow transmission speed and low power of ultrasonic signals, it prevents co-located signal collection and resists counterfeiting attacks and replay attacks.

Continuous
In a nutshell, our core contributions are three-fold.
• We propose the MetaEar system, which uses the ultrasonic reflection signals of the auricle and ear canal to continuously authenticate users with ERTF, which effectively prevents replay attacks and counterfeiting attacks. • We design a characteristic function to represent the auricle and ear canal biometric features through the principle of the impulse response of channel estimation, which effectively expresses the unique biological characteristics of the human auricle and ear canal. • We build a prototype of MetaEar with commercial off-the-shelf smartphones and evaluate its effectiveness and security in different settings. Extensive experiments show that it authentication accuracy over 96.8%.
The rest of this paper is organized as follows. In Section 2, we review the related works. Section 3 introduces the adversary model. Section 4 gives the mathematical derivation on how ERTF could model the unique feature of the human ear, and demonstrates the feasibility study. Section 5 presents the MetaEar architecture design. Section 6 is the implementation setup. Section 7 show the evaluation result, followed by a conclusion in Section 8.

Earable Computing
The era of earable computing is coming, and "earable" [2,16] refers to wearable devices around the ears, such as mobile phones, headsets, headphones, and smart glasses. These devices can utilize acoustic signals to interact with people, such as listening to music, sensing oral activities [14,17,18], and even using the ear canal for authentication [19][20][21].

Acoustic Authentication
Acoustic waves can measure temperature [22], tracking [23][24][25], gesture sensing [26,27], and activity recognition [28] to breathe monitoring [15]. The authentication based on the sound signal mainly uses the acoustic signal to extract corresponding unique biometric features to achieve authentication, including using audible sounds, such as voiceprints [29], for authentication. Some use the FMCW ultrasonic signal to extract the feature of teeth actions for authentication [20,30], and some use the sound signal to extract the characteristics of the throat movement [31] for authentication. The FMCW ultrasonic signal is also used for lip motion [32] authentication or face liveness [33] detection.

Continue Authentication
As augmented reality evolves, continuous authentication ensures system security without interfering with the user's immersive experience. Some use WiFi signals for continuous authentication [34,35], and the sensitivity of RF signals to location and orientation seriously reduces generalization. Some existing works using heart rate [36][37][38] and respiration [39,40] require the user to remain motionless, prone to severe interference from multipath environments, and cannot be applied to multi-person scenarios. Behaviors [41] can also be used for continuous authentication but are vulnerable to impersonation attacks. Continuous authentication uses eye movement [42,43] but is sensitive to ambient light and requires high equipment costs. There is also work using the ear canal [19,21], but the types of equipment are all earbuds, which will not only cause irreversible damage to the user's hearing but also have higher needs on the depth and position of the earplugs in the ear canal.

Threat Model
We assume that the attackers are not resourceful in the headset hardware. The attack succeeds if the attacker can implant malware in the headset and obtain any private data he wants. In other words, augmented reality's software and hardware resources are secure. Based on the above assumptions, two types of adversarial behaviors are considered below.
Replay attack: The replay attack scenario in which the adversary is physically near the victim and his/her enrolled handset, such as in a crowded campus cafe or public vehicle. The collected signals can be disguised as the victim entering the system to invade the victim's Metaverse property and privacy [44].
Impersonate attack: The adversary sends and receives sound wave signals from its ears in different modes, impersonating the victim, intending to deceive the security authentication into attacking the system. Alternatively, various silicone bionic materials are used to imitate the biological structure of the human body, for example, imitating faces and imitating fingerprints with silicone. However, both faces and fingerprints are explicit biometrics that can be easily forged. The counterfeiting attack uses forged 3D printed artificial dummy ears of particular anthropomorphic materials to complete the invasion. It deceives the system to destroy and take the virtual property of the owner [45,46].

Ear Structure
The human ear is an essential auditory organ of the human body and has unique biological characteristics that are different for each person [47]. The human ear structure is divided into three parts, as shown in Figure 2, in which the auricle and the ear canal collect and transmit sound. The human ear canal is about 25∼35 mm long and about 8 mm diameter. The eardrum and cochlea are the perceptions of sound. Different people have different geometric shapes of the external auricle and ear canal, and the response of the eardrum and cochlea to sound is unique for different people. Therefore, continuous authentication can be realized by modeling the characteristics of the ear's intrinsic physiological structure and extrinsic geometric shape.

Auricle Cochlea
Ear canal Eardrum Figure 2. The physiological structure of human ear.

ERTF
In Metaverse, in addition to vision, sound immersion is also essential, allowing users to maintain the immersion of sound and perceive spatial positioning. Individualized or generic HRTFs (Head Related Transfer Functions) are usually employed to render spatialized sounds within an AR headset.
HRTF is a phenomenon that describes how an ear and head receive sound from a sound source. ERTF is very similar but describes the ear profile to the FMCW acoustic signal. Most notably, the shape of the pinna, the length and the diameter of the ear canal, and the subtle differences in biological properties influence the incoming ultrasonic signal by boosting some delays and phases. When the reflected sound signal reaches the microphone, the intrinsic unique biometric features of the human ear can be collected. So ERTF is the change of the sound's response profile to the unique characteristics of the user's ear. It is mainly produced by the pinna, ear canal, and cochlea, and we define it as ERTF: where H( f ) is the channel frequency response produced by the disparate anatomy of the ear, and IFFT is Inverse Fast Fourier Transform. We imagine the whole human acoustic sensing process as communication through an RF signal. The acoustic sensing channel can be modeled as a linear time-invariant system, which effectively models propagation delay and signal attenuation along multiple propagation paths. So, all the channel parameter changes by the user's organic textures are modeled for the channel state estimation. In this way, we could easily extract the unique features of the subject's ear.
To achieve that, we borrow the idea from channel estimation to determine the fading and path loss of the wireless channel. The received signal can be mathematically represented as r(t) = h(t) × s(t), where h(t) represents the CIR (Channel Impulse Response) of the acoustic channel; r(t) and s(t) represent the received signal and transmitted signal, respectively. The linear frequency of the FMCW acoustic signal is expressed as: where T is FMCW chirp duration time, B is the chirp bandwidth, and f c is initial frequency. Phase u(t) is derived as: So the send signal is: Then the frequency domain expression is: Ultrasonic sound signals are transmitted in the ear canal and inner ear in a multipath environment and reflected in the microphone for the reception. Assuming there are P paths, the delay of each path is τ i , i ∈ [1, P], and the multipath delay of the received signal is: where θ i is attenuation coefficient. The frequency-domain formula is: S k represents the transmitted acoustic signal. CIR-based ERTF is: In practice, CIR measurement is represented with a set of complex values. Each complex value measures the channel information of a specific propagation delay range and the corresponding magnitudes and phases of the CIR. It can be seen from Equation (8) that ERTF is related to attenuation θ i and time delay τ i . θ i expresses the magnitude attenuation caused by the conversion of sound mechanical waves into electronic signals by the intrinsic biological structure, and τ i expresses the multipath delay caused by the extrinsic geometric structure. That is why ERTF can express the human ear's physiological and physical unique characteristics.

Feasibility Analysis
In a quiet hall, we used the Samsung NEXUS 6 mobile phone to collect the data on the FMCW sound signals. The frequency band of the ultrasonic wave is 18∼22 kHz. By calculating the received signal's PSD (Power Spectral Density) P r (ω) of different subjects, as shown in Figure 3, it can be seen that the features can be distinguished.
Because in the channel estimation, for the autoregressive time series model, the relationship between the PSD and the ERTF amplitude is as in Equation (9), we can show the unique features of different subjects by the PSD.

Overview
MetaEar is a system for continuous biometric-based authentication using FMCW ultrasound. As shown in Figure 4, the system consists of four modules, data collection, signal processing, feature extraction, and authentication. The registration process is the same as the authentication process. First, the speaker sends out FMCW ultrasonic waves with a time slot facing the human ear, and the two microphones receive the reflected chirp signal.
MetaEar contains two primary parts: signal processing and feature extraction modules. Since the device's microphones have a hardware startup time, the front empty window period caused by the hardware delay should be removed first. Then, for the co-located transmitting and receiving devices, the self-interference cancellation is performed using the differential technique of the dual microphones. After denoising, it is critical to align each chirp to perform feature extraction. Here we use a method based on the phase slope so that the two received signals can be satisfactorily aligned. We only sample the signal within the corresponding frequency band and carry out a Hanning filter to remove the noise signal outside the frequency band. There is a particular time slot between every two chirps, and the actual signal segment of each chirp needs to be extracted. The segmentation is implemented through a power spectrum envelope. Subsequently, the segments of these chirps are directly input into the feature extraction module. The feature extraction module mainly uses the proposed ultrasonic chirp of a specific frequency band to perform feature extraction. First, the calculation of the channel frequency response is fulfilled. Secondly, the channel impulse response is converted into the channel impulse response through IFFT. The uniqueness of the human ear's geometric structure and endoplasmic structure can be represented as the channel impulse parameters of the sound transmission channel. Then PCA (Principle Component Analysis) extracts principal components to form feature vectors. These feature vectors are fed to the SVM of the authentication module for one-classification to achieve efficient continuous authentication.

Acoustic Design
On the one hand, the ultrasonic signal cannot produce audible noise that interferes with the user. Due to the limitation of mobile phone hardware, the highest sound frequency of prevalent mobile phones is 24 kHz, and the maximum sample rate is 48 kHz. In order to increase the bandwidth of the sound signal, we employ the ultrasonic signal above 18 kHz. The FMCW ultrasonic signal frequency is 18∼22 kHz, FC = 18 kHz, B = 4 kHz, and the wavelength is 16∼21 mm. The ear canal length is about 25∼35 mm. Furthermore, combined with the reflection distance, the overall length can reach the range of 1∼2 sound wavelengths, which naturally satisfies the near-field requirements.
As shown in Figure 5, the chirp duration of FMCW ultrasound is 10 ms because the smaller the FMCW period, the higher the range resolution. To avoid ISI (Inter Symbol Interference) between each chirp, we added two 10 ms idle time slots between every two chirps. The time delay between the transmitted signal and the received signal is τ.

Denoise
Hardware delay elimination. Because transceivers of mobile phones have a hardware run-up time, the 10 ms period FMCW sound signal is sensitive to the short startup time. Therefore, we need to remove the empty sampling points at the beginning of the acquisition signal. According to experience, this part of the time is 500 samples, which is 500/48 = 10.4167 ms.
Ambient noise. Typically, common noise in the daily circumstances declines sharply in the frequency band above 8 kHz and lower than 18 kHz [48]. Therefore, we use the Hanning window filter for the corresponding filtering and only maintain the signal within 18∼22 kHz, as shown in Figure 6, to remove the high-and low-frequency noise in the environment and reduce the frequency leakage.  Dynamic interference. Because the sound wave propagates slowly, the transmission power is weak. It attenuates quickly, the propagation distance is exceptionally compact, and the transmission and reception area is close to the ear. There is almost no need to remove the interference of moving objects in the long-distance environment to the signal. For very close moving objects, such as other people doing random behaviors close to the user's ears, since the action frequency is generally lower than 18 kHz, the above Hanning bandpass filter can filter out this part of the noise.

Synchronization
The two microphone channels of sound signals have a time difference due to the distance between the two microphones, so time alignment and synchronization are required. Time alignment is obtained by transforming the signals into the frequency domain and applying the linear phase shifts corresponding to the time delay.
Existing [49] work in RF sensing uses two transmit antennas to cancel the direct path signal at the receiver. However, this approach does not work for acoustic-based ERTF. The reasons are two-fold. First, a smartphone typically has two pairs of co-located speakers and microphones. Playing sounds with any speaker may saturate the corresponding microphone by directly transferring the acoustic signal, making it infeasible to sense reflections. Second, the speakers on the smartphone are designed for different usages (e.g., communication, playing sound) and thus are highly heterogeneous. It is hard to perform equalization on a commercial audio system for FMCW chirp signals.
Instead, we leverage two microphones available on smartphones to achieve selfinterference cancellation. Specifically, suppose one speaker plays the FMCW signal, and two microphones receive r 1 (t) and r 2 (t), respectively. Then ERTF estimates the subsample delay ε t with the phase slope changing in the frequency domain and further calculates the correlation between two aligned signals [37].
where F [·] denotes Fourier transform. Since the direct signal from the speaker to the microphones is the most potent component, we can approximate the estimation above as the delay and amplitude ratio of the direct acoustic of the two microphones.

Dual-Mic Subtraction
After alignment, we want to remove the self-interfering signal received by the microphone due to direct physical propagation. Since this part of the signal propagates through solid hardware, the propagation speed is 15 times the speed in the air, so the first received signal component is the self-interference signal. The most commonly used method utilizes autocorrelation [39,50] to eliminate the self-interference through the corresponding tap peak search. However, due to autocorrelation, there will be the disadvantage that the tap peak is ambiguous. The distance is too close to the regular reflection signal peak, so the blur is too high to distinguish the peak of the direct path. We use the dual microphone cancellation method. Differentiating the signals of the two microphones after alignment can eliminate the self-interference signal and retain the relevant dynamic and adequate information about the ERTF. For the noise in the environment, both microphones can receive it, so the dual-microphone differential method can also remove the ambient noise. From the above, we know that: Thus, ERTF scales r shift 2 (t) with ρ, and subtracts it from r 1 (t): Then the signals from each microphone are time synchronization. The synthesized signal in the time domain is formulated below.
where ∆t ε t indicates the time delay.

Chirp Segment
After completing the above procedures, each cycle transmitted FMCW sound signal has been segmented and extracted. Because the guard gap slot in the middle is futile, it is only necessary to calculate the ERTF for the transmit and receive chirps in each cycle. After trying several segment methods such as VAD (Voice Activity Detection), variance, and power accumulation, we eventually chose signal envelope, which is more efficient in computation. First, find all peaks in the signal, and obtain local maxima that are separated by N points at least, then use spline interpolation to return the peak envelope of the signal, as formulated in Equation (14).
After the power spectrum envelope is obtained, we set the threshold to be the overall expectation and detect the start and end points of the chirp signal of each cycle. The algorithm is shown in Algorithm 1. The result is shown in Figure 7, which can segment each chirp ideally.

ERTF
Previous works [51,52] have demonstrated that the microphone's frequency response (especially in the ultrasonic band) is a stable and feasible feature over time, even sensitive enough to distinguish among millions of smartphones of the same brand and model. Considering it from another view, attenuation and time delay still express the geometric structure and intrinsic biological features of the auricle and ear canal besides the frequency response. The overall biologically unique characteristics of the expression occupy a principal place.
However, the time-domain equalization calculation complexity is too high, and the result is not accurate enough. In addition, the CIR calculation in the time domain utilizes a complex cross-correlation calculation. Moreover, most acoustic signal processing is implemented in the frequency domain. In order to avoid multiple conversions between the frequency and time domain and obtain the corresponding CIR using FFT (Fast Fourier Transform) for high efficiency, we first calculate the frequency-domain response CFR (Channel Frequency Response) and then use the computationally efficient IFFT to obtain CIR = IFFT(CFR) = IFFT (H( f )). H( f ) represents the acoustic frequency-domain signal.
The CIRs obtained from a set of chirp signals are stacked to form a matrix, which is the echo matrix of the signal generated by a sound cycle sequence.
To this end, the system applies a PCA (Principal Component Analysis) to compress the twenty-channel matrix signals into one channel. Each of these channels is input as an observation of the PCA algorithm. We use the covariance analysis of PCA to decorrelate the data, projecting the data to the direction with the most significant variance as the main feature, to achieve the purpose of dimensionality reduction. The component with the highest eigenvalue carries the most critical information: the user's unique biometric vector embedding.

Authentication
Finally, the feature vector obtained by PCA is saved to assemble the unique feature embedding of the human ear. Since the authentication issue is typically a classification problem, SVM is the most suitable method. We do not choose a data-driven deep neural network for three reasons. First, it can ensure the efficiency and applicability of the MetaEar. Second, it avoids much heavy work of collecting user data. Third, users can quickly sign up without retraining the network or fine-tuning. Transfer learning and these Deep Learning algorithms are almost unattainable to apply. We input the feature vector into the SVM for training and eventually output the authentication result of 1 (legitimate user) or 0 (illegal user). MetaEar uses LibSVM [53] and sets up an RBF (Radial Basis Function) kernel with the one-classification function. The dataset is divided into two parts, the training set and the test set, which account for 75% and 25%, respectively, and then utilize 10-fold cross-validation to train the model, and finally obtain the certification accuracy of the test set.

Mobile APP
We developed an Android APP to send and collect FMCW acoustic signals within 18∼22 kHz, as shown in Figure 8. The Android APP could config the sample rate to 48 kHz and the highest frequency to 20 kHz. The lowest frequency is 18 kHz, the chirp period is 0.01 s, and the upper speaker is employed to send the FMCW acoustic signal. Before starting to collect data, the preparation time is 1000 ms, and the total number of samples to be collected is 1. As shown in Figure 8, we chose three smartphones, Huawei Nexus 6P, Samsung Galaxy S6, and OnePlus 8. They have more device diversity than headsets and can collect data at different angles, distances, and postures so that the ERTF model can be effectively verified in multiple dimensions. On the one hand, there will not be a significant deviation due to the depth or angle of the inserted earbuds. Furthermore, it also avoids the harmful effect of long-term in-ear headphones on the physiological function of the ear. In the experiment, to verify our point of view, we used a desktop computer in the background, with AMD Ryzen 3 2200G, 16G RAM, which can ensure that our data is thoroughly and efficiently processed.

Environment
We collected FMCW sound signals with mobile phones in three scenarios, hall, laboratory, and street, as shown in Figure 9. The laboratory is relatively clean, with less noise and few people walking around. There is human noise in the hall and interference of the sound signal by moving objects. The street is the scene with the most complicated environment. Not only is there a large number of moving object interference, but also the most significant noise interference.

Dataset
We recruited 17 people aged 22 to 25, eleven males and six females. Each person uses three mobile phones to collect data at different distances, angles, and behaviors in each scene. The collection was performed 780 times by each person, and a total of 780 × 17 = 13260 samples were collected. The dataset is sufficient to train the SVM for classification. During SVM training, the data of the current legitimate user are labeled as positive, while the data of other users are labeled as negative. These negative samples, randomly selected from all labeled negative users, have the same number of positive ones. We divide the current dataset into three parts, 70% for training, 15% for validation, and 15% for testing.

Overall Accuracy
Because the system aims to authenticate users, it can be attributed to a one-classification problem with traditional OC-SVM (One-Class SVM). The overall confusion matrix is shown in Figure 10. It is worth noting that this confusion matrix is different from the traditional one. The x-axis represents different users, while the ordinate represents different individual models. Each row in the figure denotes the average authentication accuracy produced after inputting different user data into models. Therefore, for each system legitimate user, a corresponding individual model is trained to constitute a model library. It can be seen that the minimum authentication accuracy is 92.8%, the highest accuracy is 100%, and the average accuracy is 96.48%.

Impact of Environment
We collected data in three different scenarios, as shown in Figure 9: the hall, the laboratory, and the noisy street. In the environment, there are different noises and disturbances, respectively. The indoor environment is relatively less noisy, but there will be reflection interference from close-range moving objects. The outdoor environment selected the noisiest and most complex street environment, which aims to test whether the system can work typically in the most complex and noisy environment. These three environments basically cover daily production and life scenarios. As shown in Figure 11, no matter what kind of environment, our system can denoise sufficiently and perform safety authentication. In most cases, the authentication accuracy in the hall and lab is better than in the street environment. In the hall and the laboratory, people often walk, and there are human voices and keyboard and mouse tapping noises in the laboratory. The accuracy in the hall is the highest, achieving on average 96.19%. Noisy street authentication accuracy is slightly lower, and average accuracy is also achieved at 92.02%. The loud street is not only contained by high-decibel motor vehicle noise but also the voice of people. For user7 and user9, the authentication accuracy of the street environment is also relatively good due to some uncontrollable factors in the environment noise. For example, there are few vehicles and low noise during data collection.

Impact of Angel
Different angles of the mobile phone have different effects on the authentication accuracy. We measured different angles with a semicircular protractor in the hall and collected data when the mobile phone was at four different angles, as shown in Figure 12.
The coordinate system is set to the auricle plane, the x-axis is from back to front, and the y-axis is from top to bottom. When the mobile phone and the y-axis coincide, the angle is 0 degrees. The corresponding rotation to the x-axis can reach 30 degrees, 60 degrees, and 90 degrees, four angles for experimentation in total. As can be seen from the figure, most users hold the highest authentication accuracy at 30 degrees and 60 degrees, and the average accuracy is about 93.54%. The reason is that the FMCW ultrasonic signal is most apparent in terms of fetching the characteristic expression of the pinna and ear canal at 30 degrees and 60 degrees. In addition, the mobile phone's microphone also receives the highest signal strength, and the SNR (Signal to Noise Ratio) is optimal so that it can achieve the best authentication effect. However, some users have the highest accuracy at 90-degree angles, primarily because different users hold their mobile phones in different postures. At 90 degrees, the mobile phone is not very close to the face, so the expression of human ear biometrics is accurate.

Impact of Distance
We verified the authentication sensitivity to the distance between the human ear and the transceiver. The authentication accuracy was measured when the mobile phone was 1 cm, 3 cm, and 5 cm away from the human ear, as shown in Figure 13. It can be concluded from the figure that the accuracy at 1 cm is the highest, and the average accuracy is achieved at 92.65%. At a 5 cm distance, the accuracy is lower, with an average of 92.3%. This is mainly because when the device is far away from the human ear, the SNR decreases, and the characteristic signals of the ear canal and auricle feedback cannot be well received. Furthermore, hair and earrings will increase negative influence so that the accuracy will decrease. We did not test more extended distance scenarios because if the signal selective fading increases sharply with distance, the SNR decreases, and the accuracy drops severely.

Impact of Different Behavior
During the authentication process, users may be sitting motionless and doing some other body movements in daily life. We test the most potential activities, including static state, shaking, and walking. as shown in Figure 14. Among them, the authentication accuracy in the static state is 96.89%. It is 96.15% in the shaking head state, and while walking, the average accuracy is 91.60%. Because the vibration of the bones and muscles of the human body will affect the feedback of the signal during the walking process, the accuracy will decline. The authentication accuracy is reduced due to the change of the mobile phone position caused by non-vigorous head shaking. In the static state, the authentication effect is the best.

Impact of Different Devices
In order to measure the impact of different hardware on the authentication, we use three brands of mobile phones for experimental verification, Oneplus 8, Samsung galaxy s6, and Huawei nexus 6p. When collecting data in the early stage, due to the difference in the microphone ADC (Analog-to-Digital Converter) circuit and nonlinear processing methods and numerical precision of different devices, the collected data will be different, so different devices will cause accuracy differences. We use AMD Ryzen PC to simulate the same technical data process. As shown in Figure 15, the average accuracy of the three devices is almost the same, showing that our system is adaptable to different hardware. Overall, the effect of Oneplus 8 is slightly better, with accuracy reaching 98.44%, principally because the smartphone was produced in 2020 with improved hardware and performance. However, the result of user10 using Oneplus 8 is low, which is mainly caused by the extensive construction noise interference in the environment when the user collects data.

Impact of Data Quantity
Machine learning is a data-driven learning model. Although OC-SVM uses much less data for training than DNN, it also obtains different authentication results for different data amounts. We train the model with different amounts of data and then use the test data to test the authentication accuracy of the model, as shown in Figure 16. As the volume of data increases, the accuracy steadily increases. When using 30 data samples, the accuracy reaches 86%, and when the number of training data samples is 90, the test accuracy reaches 95.56%, which can meet the daily requirements. Ultimately, when using 590 training samples, the accuracy can reach 98.98%.

Efficiency of Attack Defense
We conducted an attack defense test on the MetaEar. For impersonation attacks, the adversaries utilize their biological signal to imitate the signal of the legitimate user, intending to deceive the system. We take one person as a legitimate user, and then 16 other people imitate this person's habits and behavioral characteristics to collect data. In the hall, the voice data of 17 people were collected at the same location and time. For the model trained for a legitimate user, we take the data of the remaining 16 people as the attacker, input it into the model, and the output is shown in Figure 17. It is concluded that the median FAR of other users is below 4%, except for user2, user9, user10, and user15. In particular, the FAR of user10 is 11%. The main reason for the deviation of these users is that the legitimate users did not follow the dictated actions when collecting data, which caused the distance between the device and the ear to change, and the extracted biometric characteristics of the human ear changed significantly, thereby increasing attack success probability. Nonetheless, attackers have a low probability of successfully executing impersonation attacks. Compared with face or fingerprint authentication, the registration process also requires a fixed posture and process. If our system can allow legitimate users to develop fixed habits when collecting data and adopt dictated actions, the results of resisting attacks will be better.

Analysis of Time Efficiency
We measured the time efficiency of each system module, as shown in Table 1. The table is divided into two stages, one is the registration stage, and the other is the login stage. In the registration stage, the collected data need to be trained by SVM, so the time in the table is the consumption for training 90 samples. The time statistics of each phase in the login is the time consumption of one sampling. 'pre' denotes data preprocessing, including some data access time and the time to remove the hardware delay, and its time efficiency is 0.364 s. 'Denoise' runtime is 0.414 s. The signal alignment described by 'Ali' has a running time of 0.88 s. 'SIC' is self-interference cancel, consuming 0.883 s. Furthermore, 'seg' is the chirp segment, which takes 20.056 s, and includes the total running time of segmenting samples of 90 times. This time consumption is slightly higher than the other parts. However, it is much less than the time spent on fingerprint scanning when performing fingerprint authentication registration. So it still has an overall advantage in time efficiency. 'FE' is feature extraction, which takes 0.544 s. The SVM training phase took 1.335 s, which shows that the training efficiency of SVM is very high. Finally, the total time spent in the training stage is about 24 s, which is less than the registration time consumed by some existing fingerprint or face biometric authentication methods. When logging in, the total time consumption is 0.939 s. Moreover, in the MALAB execution environment, the complexity of all programs is O(n). After program optimization, the processing time should be able to meet the actual needs. Overall, MetaEar is an efficient and deployable system.

Comparison of Different System
Furthermore, we compared the other four systems in three main aspects: device usage, biometric location, and average accuracy, as shown in Table 2. The average authentication accuracy of our system is the highest, and we use not only the intrinsic biometrics of the ear canal but also the extrinsic biometrics of the auricle, making the biometrics' uniqueness more accurate. Moreover, the smartphones we use do not need to be modified and use the machine learning algorithm SVM, which can be efficiently integrated with existing systems and deployed in practical scenarios.

Conclusions
This paper proposes MetaEar for modeling and authenticating human ear ERTF biometrics using FMCW ultrasonics. By sending FMCW ultrasonic waves to the ear, the dual microphones receive the feedback sound wave, extract the features through ERTF, and then feed into the SVM for one-class authentication. A large number of experiments verify that our average authentication accuracy can reach 96.48%, which can effectively strengthen biometric authentication and resist replay attacks and imitation attacks. From the overall authentication accuracy result, we do not achieve the best authentication accuracy, and the attack resistance test FAR also does not achieve the optimal level. Therefore, the next step is further improving the authentication accuracy and achieving practical deployment. First, we will use the combinatorial optimization method to perform more accurate feature extraction on ultrasound to improve the authentication accuracy; second, through multi-modal fusion biometric authentication, we will obtain a more robust performance against attacks.