For the sake of repeatability, an Intel i7 computer with 4 GB. RAM has been used for data preparation. Although a Sherlock dataset is considered, further data preparation is needed since each of the studied features require a set of experimental files. On the other hand, an Intel i5 with 8 GB. RAM has been applied to carry out the data stream mining procedure, using Massive Online Analysis (MOA) tool in its version as of August, 2017 [
25]. The experimental results are then processed with MATLAB R2016b in order to obtain
for each system setting. All the scripts developed for preparing the data, running the experiments in MOA and processing the results are publicly available in a GitHub repository [
26].
5.1. Accuracy Analysis
This section studies if the considered data is useful for continuous authentication. For this purpose, the data stream mining algorithms introduced in
Section 4 have been applied over each data source, as well as over combinations of them. Essentially, if a given data source is unique per element, it would be useful as an identifier. Therefore, this analysis is focused on determining whether audio and light measurements are identifiers for the environment and battery and transmitted data can identify the user.
In order to avoid any bias, this analysis is carried out in an all-vs.-all fashion, i.e., all sensorial data of all users are interleaved. In this way, the relevance of the results is ensured since it leads to having a global accuracy measurement. For the sake of clarity, accuracy is computed as the ratio of successful user authentications and the total amount of evaluations. More precisely, let
be the total number of records for each particular combination of data sources. Recall that this set contains data for all
subjects. Each record will be analyzed by the DSM technique at stake, leading to an authentication decision. Thus, at the end of the experiment, each user
has a number of correct authentications (succ
). The accuracy
is then computed as a fraction between this amount for all
users and
, as shown in Equation (
4):
5.1.1. Data Preparation
In order to carry out this analysis, all records for each type of data are gathered in a single file (e.g., audio, light, battery and transmitted data). These files are intended to perform an independent analysis on each feature.
On the other hand, different features are combined as well. In this way, it is possible to determine whether the addition of features improves the identification capability. Moreover, it also enables achieving the goal of identifying a user-in-a-context. For this purpose, it is necessary to build records that gather information from both the user and the environment. This is done by merging the independent files described in the previous paragraph. For illustration purposes,
Figure 2 describes how the input data file is built for the combination of audio, light and battery sensors. Since environmental variables are sampled every 10 s, whereas user-related ones are sampled every 5 s (recall
Table 2), it is necessary to adjust both information sources. In particular, the merged records are built by taking every environment-related records and one out of two user-related ones.
5.1.2. Analysis Results
The result of this analysis is summarized in
Table 4. Concerning the environment, the best results are achieved when both audio and light are considered together (68.29%). It must be noted that light alone offers a very limited identifying capability (43.31%). Although further research would be needed, it may be due to the fact that, when the device is carried in a bag or in a pocket, the light values are more affected than noise. Thus, some noise can be perceived by the device even if it is held in a bag.
Concerning the user, battery readings offer a high capability as identifier (97.05%). On the contrary, transmitted data is very weak as identifier (18.51%). One of the main reasons is that mobile devices are continuously exchanging information with other entities, thus reducing the value space.
Considering both the user and environment together (i.e., user-in-a-context identifiability), the combination of audio, light and battery offers a remarkable accuracy (81.35%). Intuitively, it is reasonable that the achieved value is between the user and environment identification accuracies, since the data at stake is formed by both elements together.
These findings point out to what extent these sensorial data sources are unique and distinguishable at a given point in time. In other words, the system is able to learn the values expected for each user and classify them accordingly. One important matter is that this decision is not only based on each reading—previous values are taken into consideration. In other words, the order in which data is received is meaningful by itself. To confirm this belief, another side experiment was carried out. Considering the data stream with highest accuracy (i.e., battery), a new data stream was built by randomly shuffling the readings. After this operation, the method achieving the best results (i.e., KNN) was applied, leading to an accuracy of 7.34%. In light of this value, it is confirmed that the order of the readings has a direct impact on accuracy.
5.2. Immediacy, Usability and Readiness Assessment
The previous accuracy results are encouraging since they suggest that these data are useful to tell users (and their environments) apart. However, the actual case in which this issue is relevant is when the device is robbed. In this case, the three remaining goals come into play. Thus, it is necessary to detect this fact as soon as possible (immediacy), avoiding the rejection of the actual user (usability) and requiring a short amount of time for learning from the user to work properly (readiness).
This assessment is based on determining how
is affected by the parameters that are closely related to each goal. Therefore, the experiments for each goal consist of fixing values for its related parameters and considering all possible values for the remaining parameters. Hence, plots show the distribution of
for all these experiments. We opt for
since it serves as an indicator of the system effectiveness and has a direct real-world meaning. On the other hand, since the proposal looks for achieving a security-usability trade-off, all figures also show the impact of the usability parameter
. For the sake of clarity, the boxplot representation is adopted. Boxplots, proposed by Tukey in 1977, are relevant to show the distribution of data in a compact way [
27]. Thus, the edges of the box show the first and third quartiles (i.e., 25% and 75% of values), whereas the red line shows the median. Moreover, whiskers extend to those points that are not considered
outliers. In our experiments, outliers are those values beyond 1.5 times the Inter-Quartile Range (IQR), that is, the difference between the aforementioned quartiles. Thus, whiskers extend up to those values until 1.5 IQR times the 25% quartile, and those until 1.5 IQR times bigger than the 75% quartile. This is a typical decision for boxplots, being the default configuration of Matlab ones [
28].
5.2.1. Experimental Preparation: Data and Parameters
Based on the files prepared for the previous experiment (recall
Section 5.1.1), a new set of files are created for this analysis. In particular, it is needed to have the device stolen by
and measure
, that is, how fast it is possible to detect this issue.
We study user-vs.-attacker situations in order to achieve this goal, thus analysing when is in or that a robbery takes place from to . Thus, robbery is simulated by creating files with data from the legitimate user followed by data from any other user in the dataset, which acts as an attacker . In particular, these files are created assuming that takes the device at a particular time t and from that moment on the device is considered stolen and in the hands of the attacker. To avoid any bias, a pair of issues are carried out: (1) these experiments involve 10 randomly chosen subjects acting as , whereas other 10 randomly chosen subjects act as ; and (2) between readings of and a set of instances have been introduced to simulate the robbing action. Thus, there is a smooth transition between both user readings. Otherwise, the experiments would not be realistic since, for example, the device could move from a fully quiet environment to a crowded and noisy one in just a single reading (i.e., 5 s). In our experiments, we have introduced a set of 10 records for this purpose.
The design of these files has a direct impact into the type of analysis that can be carried out. Given that the DSM technique at stake will learn from , and afterwards will receive readings from , a typical accuracy analysis is not informative. Indeed, no false positives may occur, as a consequence of this design. However, false negatives will happen—this is what actually measures. Therefore, this magnitude is a real-world illustration of false negatives occurring until the system determines the attacker presence.
In order to carry out this analysis, different combinations of sensorial features have been taken into account. In particular, for this analysis, the three best identifiers (recall
Section 5.1.1) have been considered: the combination of audio and light (AL) for identifying the environment, the battery (B) to identify the user and the combination of these three (ALB) to identify a user-in-a-context.
Apart from the sensorial features, different periods for learning from are considered. In particular, 500, 2000, 5000, 8000 and 10,000 readings are chosen. In this way, it is possible to determine whether the amount of available knowledge about leads to lower rates. On the other hand, the robbery period () has been set to 5000 readings. As long as this period is equivalent to around 7 h (in the case of battery), we believe that it should be enough to detect any robbery.
Considering all these variables (i.e., amount of users, sensorial data and learning periods), the amount of files involved for these experiments is calculated as follows:
With respect to experimental parameters, KNN is the only algorithm considered as it is the one that achieved the best accuracy results (recall
Section 5.1). In terms of parameter
k of this algorithm, values 3, 10 and 21 are established. On the other hand, instances stored in the device may be 1000, 5000 and 10,000. Concerning the value
(recall
Section 4.1), we take the values
, recalling that
means 5 s for B and 10 s for AL and ALB.
Apart from these general parameters, immediacy assessment requires an additional setting. In particular, it is interesting to compare the effectiveness of the system in the user-vs.-attacker situation (i.e., the one described so far) with an artificial user-vs.-user one, which is in at different times pretending to be . In this artificial user-vs.-user case, we make the same user be the attacker by including some of her readings pertaining to another moment in time. This setting (which does not have any real meaning) is useful to confirm the hypothesis that it should be easier to detect a real attacker sooner, as a natural consequence of the accuracy results. In other words, it is expected that a user is more similar to himself than to anyone.
5.2.2. Immediacy Assessment
There are three factors that can have an impact on the immediacy, namely the difference between user-vs.-attacker and user-vs.-user settings, the sensor(s) at stake and the choice of k. Each issue is analysed in the following.
The results of
in the user-vs.-attacker and user-vs.-user situations are shown in
Figure 3a–c. As it can be seen, results for AL and ALB confirm expectations—regardless of the chosen
, the system detects robbery easier when the robber is
instead of
, thus in a user-vs.-attacker setting. This difference is more important for higher
. For instance, for AL and
, the median decreases from 1000 s in a user-vs.-user situation, to 200 s in a user-vs.-attacker situation. For ALB and
, detection is performed in 450 s and 950 s in user-vs.-attacker and user-vs.-user situations, respectively. On the contrary, in the case of B, the difference between both situations is almost equivalent. For instance, for
, detections are carried out at 141 s and 150 s for user-vs.-user and user-vs.-attacker situation. In practice, it means that the system is as good distinguishing between
in
as between
in
at different times in the case of B, while, for AL and ALB, the system is better at spotting the attacker, i.e., distinguishing between
in
.
With respect to the type of sensor at stake,
Figure 4a–c shows the results considering different
values. In general, B works better and robberies are detected faster. When
, the median is around
s, while AL and ALB are close to
s. With the increase of
, the difference increases, and ALB outperforms AL. This means that B adds useful information to discriminate between users. Then, when
, using B, the device will detect the attacker after
s on median,
s in the case of ALB and after
s using AL. This difference is higher for
–using B robbery is detected in
s, and
= 750 s and
s for AL and ALB, respectively. In addition, it must be noted that quartile 3 (i.e., 75% boundary) for AL and ALB is beyond
s. Therefore, the use of the device (i.e., information from B sensor) is more decisive than the environmental conditions (i.e., information from AL sensors) to distinguish between users.
Last but not least, the impact of
k for
is shown in
Figure 5a–c. As it can be seen, smaller values work better independently of the sensor and
. Intuitively, this can be a direct consequence of the inner workings of KNN—greater values of
k lead to a higher probability of having readings from
as part of such
k points. If a majority is reached when values from
are evaluated, the incorrect category is chosen. Differences are higher when
increases, for instance, for AL when
, the difference between medians of every
k is of 300 s.
5.2.3. Usability Assessment
There are two issues that have an impact on usability. On the one hand, the value of itself, which determines the hardness to apply the protective measures such as self-blockage. On the other hand, the amount of storage required for the mechanism to work properly. The first parameter may cause the system to sooner detect the attacker at the expense of improper blockages when the legitimate user is at stake. The second one determines whether this mechanism can be applied in regular devices or requires greater storage capabilities. Both issues are studied below.
In order to analyse the effect of
, previous
Figure 3,
Figure 4 and
Figure 5 have to be revisited.
means that the first time the system concludes that the user is
, it will activate its protection. In practice, the best achievable
would be 5 s for B and 10 s for AL and ALB, respectively. As it can be seen in the said figures, this optimal value is never achieved. Increasing
to 20, the minimum detection time is
for B and
for AL and ALB. In practice,
is far beyond that value. However, it is expected that this would make improper self-blockages to be less likely—now, 20 consecutive errors (i.e.,
or
being recognized as
or
, depending on the type of sensor) should happen.
Regarding the effect of the required storage, it must be noted that it is preferable to request a minimal amount of storage to ensure the suitability of the approach to regular user devices. Before entering into the experimental results, it is paramount to take into account some relevant details of the DSM technique at stake. In this adapted version of KNN, only a subset of readings are kept in memory. Thus, a sliding window is applied to limit the storage space. This window works in a first-in, first-out policy—the oldest reading is discarded to make room for the freshest one. Therefore, the storage size can be regarded as a synonym for the window size. On the other hand, KNN considers the
k nearest neighbors in order to assign the label for the reading at stake (recall
Section 2). Therefore, the probability of assigning a label of one class (i.e., user or attacker) grows with the presence of readings of that class. With these issues in mind, let us consider
Figure 6a–c in which this matter is analysed. As expected, the bigger the storage size (i.e., the window size), the longer it takes for the system to determine the attacker presence. Indeed, with an storage of 1000 readings (i.e., around 1:30 h. of battery consumption), promising detection rates can be achieved.
Measuring the size in terms of readings offers little information in what comes to determine the suitability of this approach to current devices. To study this aspect,
Table 5 presents the amount of bytes that have to be stored per reading for each particular sensor. Values have been established based on the worst case considering all managed data and
separators (e.g., comma) correspond to elements used to separate every stored value per reading. For instance, given the storage of 1000 readings, in the case of batteries, 1000 × 26 = 2600 B (2.6 Kb) have to be stored. Another example is ALB, in this case, 1000 × (26 + 6 + 380) = 380,032 B (380 Kb) should be stored. As the average size of photos in a smartphone such as the iPhone6S is 2.59 Mb [
29], the size of sensor readings to be stored in the device is complete sensible, i.e., much smaller than the size of a picture. According to these results, storage requirements are affordable for current smartphone devices.
5.2.4. Readiness Assessment
In order to assess readiness, it is necessary to measure the amount of training data that is required for the mechanism to work, that is the length of the learning process. In other words, this analysis focuses on measuring the extent to which the proposed mechanism benefits from the fact that the authorized user is porting the device for a period of time before the robbery.
To understand the results, it is essential to recall that DSM techniques limit the amount of used memory. This means that only a subset of the freshest readings are considered when analysing a new input. Moreover, KNN works by labeling the input with the label of a majority of the k nearest neighbours. Thus, in practice, no knowledge from the readings removed from memory is considered when analysing a new input. In other words, the system does not build a monotonic incremental model of the user.
Considering these aspects and in line with expectations,
Figure 7a–c show that
is promising after a short user learning process. The best choice is the use of B sensor, since the detection time is bigger with the remaining sensorial sources. However, it must be recalled that B is sampled twice as fast as both A and L sensors. Going into the details, the detection time remains below two minutes when the system learns from the user for around 6.9 h (i.e., 5000 readings) using B sensor and a moderate level of usability (i.e.,
). Interestingly, with that usability level as a limit, it can be seen that an increase on the size of the learning process (particularly when moving from 5000 to 10,000 readings) does not lead to a valuable benefit.
In light of these results, it becomes clear that KNN imposes an effective threshold on the amount of knowledge the system gets from the user. Indeed, it is limited by the amount of instances that are kept in memory.
5.3. Discussion: Towards the Best Settings for a Security—Usability Balance
The previous analyses have shown how the different parameters and settings affect each of the goals in the proposal. However, it is necessary to have a global view of which configuration achieves a nice trade-off between security and usability.
According to the obtained results, the use of battery readings with KNN and achieves the best results. In particular, after 500 readings for learning from the user (which is equivalent to 2500 s) and requiring the storage of 1000 readings, the mechanism is ready to work properly. In order to keep a balance between security and usability, seems to be the best choice to promote fast detection while reducing the likelihood of improper protection. In this way, robbery can be detected in 150 s for , while in 100 s for .
In addition, though AL and ALB are not the best alternative, they get their best results with the same setting as the battery, that is with KNN and K = 3, as well as for 1000 readings. However, should be limited to 10 because is significantly high in more challenging cases. In particular, for ALB (which get better results than AL), robbery can be detected in 600 s when and in 250 s when .