1. Introduction
Human activity recognition (HAR) consists in the automatic identification of human actions, enabling a wide range of applications in activity monitoring, fitness tracking, healthcare, and human–computer interaction [
1]. Existing HAR approaches can be broadly categorized into camera-based or wearable sensor-based systems. While the former can achieve high accuracy, these solutions raise critical challenges related to privacy, environmental constraints, and computational costs [
2]. In contrast, the latter, by employing inertial measurement units (IMUs), offers a practical alternative due to low cost, portability, energy efficiency, and potential ability to support continuous monitoring in daily life [
3]. Many studies in sensor-based HAR have previously focused on smartphones and wrist devices (e.g., smartwatches) that provide rich motion data in a convenient manner. IMUs embedded in head-mounted devices, such as earbuds or smart eyewear, have been less explored in the literature, despite offering a very promising sensing modality, being capable of capturing motion patterns that are more directly linked to the head and upper-body activities.
Established methods for HAR span from classical hand-crafted feature extraction and machine learning (ML) approaches to end-to-end deep learning (DL) models [
4], achieving overall great accuracy. HAR systems are generally designed under a closed-set assumption where the set of activities to be classified is already fixed and fully known during the training phase. In real-world scenarios, this assumption is rarely met. Users may perform activities not originally included in the training set, either because the set of target activities was incomplete or because entirely new behaviors could emerge over time. Such an open-set condition would require classifiers to recognize not only known activities but also samples belonging to previously unseen classes. The open-set recognition (OSR) problem has been largely investigated for image classification [
5], but less effort has been made in the context of time-series analysis for HAR [
6].
Despite OSR having been widely investigated in computer vision, OSR remains an unexplored problem in IMU-based human activity recognition. HAR recognition models are trained to process multivariate temporal signals characterized by strong inter-axis correlations, pronounced inter-subject variability, and high overlap between activity classes. Activities such as walking, running, and stair climbing often share similar motion patterns over short temporal windows, resulting in less separable representations. These characteristics make OSR behavior in IMU-HAR nontrivial and worth exploring.
Two main challenges characterize the application of OSR methods applied to HAR from wearables: (i) the model must be trained without explicit samples from the unknown classes, making it necessary to detect novelty by a classifier trained on the distribution of known classes; and (ii) OSR solutions must remain computationally efficient to be deployable on resource-constrained platforms. Many existing OSR approaches, such as adversarial learning or specialized architectures [
5], offer strong performance but require complex training pipelines or additional model parameters, which can hinder deployment in embedded systems.
To overcome this limitation, we present an open-set HAR framework for head-mounted inertial data acquired through smart eyewear. The proposed approach leverages neural classifiers trained on raw time series from a multichannel IMU and addresses OSR through lightweight, post hoc scoring strategies applied to the classifier outputs. Temporal dynamics and inter-axis relationships are modelled at the representation level by a convolutional backbone, while open-set performance is systematically evaluated under identical training conditions through a sound validation protocol. This formulation enables a comprehensive assessment of the robustness and suitability of different lightweight OSR methods for wearable HAR scenarios.
3. Materials and Methods
3.1. Datasets and Sensor Setup
Inertial data were collected (sampling frequency 100 Hz) using a 6-axis IMU sensor (three-axis accelerometer and three-axis gyroscope) embedded in an eyewear frame prototype. The sensor was positioned on the left temple arm of the glasses; further hardware specifications and implementation details of the utilized smart eyewear platform can be found in [
32]. Two experimental protocols were designed: static and dynamic. In the former protocol, thirty healthy volunteers (15 M and 15 F, age median [25th percentile; 75th percentile] 27 [24.3; 32.3] years) were recruited to perform in a seated position a core set of head-related activities, including
chewing,
drinking,
nodding,
shaking and
breathing. In the dynamic protocol, a different group of ten subjects (5 M and 5 F, age 27.5 [24.3; 31.5] years), performed other activities, such as
speaking,
walking,
stair climbing, and
cycling. All recordings were carried out in a controlled laboratory environment. Each activity duration varied between 1 and 5 min; details on the duration of each activity per subject in the static and dynamic protocols are reported in
Table 1 and
Table 2, respectively. Specific instructions on how to perform each designated activity were given to participants, with no constraints to artificially restrict or modify their natural movements.
All participants provided written informed consent prior to enrollment. The study procedures were conducted in accordance with the Declaration of Helsinki (1975, revised in 2013), and ethical approval was obtained from the Ethics Committee of Politecnico di Milano for both protocols (opinions n. 33/2023 and n. 20/2024 for the static and dynamic protocols, respectively).
Subjects with missing or incomplete data were excluded from further analysis. In particular, in this proprietary dataset 25/30 subjects were retained from the static protocol and 5/10 subjects from the dynamic protocol.
In addition, the publicly available UCA-EHAR dataset was also used for external validation and ease of reproducibility of our findings. This dataset includes inertial signals collected from a 6-axis IMU embedded in smart glasses, sampled at 26 Hz, from twenty subjects (12 M and 8 F, age 30.6 ± 12.0 years) performing a comparable set of daily-life activities. Complete details on such dataset composition and acquisition procedures can be found in [
10]. For the study on the UCA-EHAR dataset, the following activities were selected:
walking,
running,
lying horizontally,
stairs,
drinking, and staying
still (obtained by merging the original
sitting and
standing classes).
The inclusion of UCA-EHAR serves to complement the proprietary dataset by enabling external validation on an independent cohort, different sampling frequency, and a distinct activity set. Together, the two datasets allow for the assessment of OSR behavior across heterogeneous recording conditions that are representative of realistic smart eyewear deployments.
3.2. Data Preprocessing
Raw inertial data were segmented using a sliding window approach with a 3 s window and 50% overlap (1.5 s), resulting in 11,231 samples for the proprietary dataset and 13,420 for the UCA-EHAR dataset.
Table 3 and
Table 4 report the number of samples available for each activity across subjects for the proprietary dataset and the UCA-EHAR dataset, respectively.
As can be noticed, both datasets were not uniformly distributed across activities. No additional signal processing, filtering or normalization were applied, allowing the model to directly learn relevant feature representations from raw sensor data.
3.3. Problem Formulation
Let
denotes the space of temporal segments extracted from the inertial sensor data, where
T is the number of samples in each segment and
d is the number of sensor channels (
d = 6 in our study involving a three-axis accelerometer and three-axis gyroscope). Each window
is associated with an activity label
. In a conventional closed-set HAR scenario, the learning problem assumes a fixed set of activity classes
available during training, and the goal is to learn a classifier:
parametrized by
θ, trained on a labelled dataset to minimize the expected classification loss.
In OSR, at test time, the input data may belong to an extended label space , where denotes the set of unknown classes not observed during training. The OSR problem requires the system to correctly classify the samples from while detecting when a sample is part of . To this end, we associate each input x with an open-set score to quantify the degree to which x is considered unknown. Higher values of S(x) indicate a higher likelihood that .
In this work, we consider neural networks as a classifier
; thus,
S(
x) was derived from the
C-dimensional logit vector, i.e., the raw outputs
of the neural network before the SoftMax layer. All the proposed methods define
S(
x) directly on
z(
x) while leaving the classifier unchanged. The definition of
S(
x) is described in detail in
Section 3.5. This design choice reflects our focus on practical OSR solutions that do not require any modification of the backbone training procedure, thus preserving the standard HAR pipeline and facilitating deployment in resource-constrained wearable devices.
In the following, known samples will refer to those belonging to one of the classes available during training, while unknown samples will refer to those originating from classes for which no representative training data exist.
3.4. HAR Classifier
As the effectiveness of OSR strongly depends on the closed-set performance, the first step is to build a highly accurate classifier under the closed-set assumption [
33]. The proposed classification backbone consists of a two-layer one-dimensional CNN, designed to prioritize efficiency and ease of deployment while maintaining sufficient discriminative power. The neural network is composed of two convolutional blocks (Conv1D–ReLU–MaxPool) with increasing channel dimensionality (16 → 32) followed by adaptive average pooling to aggregate temporal features and layer normalization to obtain a compact 32-dimensional representation. This feature vector is then passed through a fully connected layer with ReLU activation and dropout regularization and finally projected into the class space through a linear output layer. The resulting logit vector
is used both for closed-set classification through
and as the basis for the open-set score
S(
x), which is derived by the methods described in the following sections. A representation of the network architecture is reported in
Figure 1.
The network was exclusively trained on samples belonging to the known classes using cross-entropy loss with class weights for dataset imbalance. Training was performed with the Adam optimizer using a learning rate of 10−3 and weight decay of 10−5 over a maximum of 50 epochs. To monitor generalization performance, the samples belonging to the known classes were split into training and validation sets with an 80/20 ratio. A stratified split was employed to preserve the original class distribution in both subsets, preventing biased validation estimates in the presence of class imbalance. Model selection was based on the macro-averaged F1 score on the validation set, which also drives learning rate scheduling and early stopping. All experiments were made reproducible by fixing random seeds across PyTorch v2.5.1, NumPy v2.0.2, and data loaders.
3.5. Open-Set Recognition Methods
After training the CNN classifier on the closed-set , the open-set score S(x) was defined on the output logit z(x). High values of S(x) indicate that the sample likely belongs to the unknown class set . The following five lightweight OSR scoring strategies were considered, all requiring no modification to the training procedure.
3.5.1. Maximum Logit Score
The first approach is the Maximum Logit Score (MLS), which builds on the intuition that confident predictions correspond to a high activation for one specific class. Given the logit vector
z(
x), MLS is defined as the maximum component:
In previous studies [
33,
34], inputs that do not belong to one of the known classes usually result is a significantly smaller MLS. To maintain a consistent interpretation across all the tested OSR methods, where higher values of the open-set score indicate a higher likelihood of being unknown, the score was defined as the negative of MLS:
With this convention, samples from known classes generate lower (and negative) scores, whereas samples from unknown classes yield higher scores, thus aligning the distribution of S(x) with the general OSR objective.
3.5.2. Energy-Based
As an alternative to MLS, we also considered the energy score [
18], which provides a smoother measure of the confidence associated with a logit vector. Given the logits
z(x), the energy function is defined as:
where
T > 0 is a temperature parameter.
E(
x) tends to be higher for samples of known classes and lower for uncertain predictions. The OSR score is defined directly as the energy:
3.5.3. Nearest-Neighbor Distance Ratio (NNDR)
This method was inspired by the Open-Set Nearest Neighbor (OSNN) classifier proposed by [
35]. In their approach, OSR was achieved by extending the traditional nearest-neighbor classifier with the Nearest Neighbor Distance Ratio (NNDR), defined as the ratio between the distance of a test sample to its closest neighbor of the predicted class and the distance to the closest neighbor of a different class. A threshold on this ratio was then used to decide whether the sample should be accepted as known or rejected as unknown. A key advantage of this method is that the OSNN is inherently multiclass, meaning that its efficiency will not be affected as the number of available classes increases.
In our setting, this principle was adapted to the logit space produced by the CNN backbone. Given the predicted class
and denoting with
the set of training logits corresponding to class
c, we computed:
The NNDR score is defined as:
Large values of indicate that the sample lies relatively far from its own class and closer to a different class, suggesting novelty.
3.5.4. Gaussian Mixture Model (GMM)
This approach aims to directly model the distribution of the logit vectors for the known classes using Gaussian Mixture Models (GMMs). For each class
, a single Gaussian component was fitted to the training logits, thus obtaining class-conditional densities
with prior weights
estimated from the training data. The overall likelihood of a test sample logit
z(
x) was then computed as:
The open-set score was defined as the negative log-likelihood:
Intuitively, samples consistent with the distribution of the known classes yield low scores, whereas samples lying outside the support of the known data have small likelihood and thus large scores. In our framework, we restrict each class to a single Gaussian component for efficiency and robustness while maintaining the ability to capture class-specific covariance structures.
3.5.5. Kernel Density Estimation (KDE)
In addition to parametric Gaussian models, a non-parametric approach based on Kernel Density Estimation (KDE) [
36] was also explored to model the distribution of logits. For each known class
, a Gaussian kernel density estimator was fitted to the corresponding training logits
. The resulting class-conditional densities were combined with empirical class priors
to define the overall likelihood of a test logit vector:
where
indicates the kernel density estimator for a class
c. Similarly to the GMM approach, the open-set score is given by the negative log-likelihood:
The intuition is that samples yielding a logit that is consistent with the distribution of known class logits achieve high density under the KDE model, resulting in low scores, whereas unknown samples yield low density and thus high scores. Compared to GMMs, KDE does not impose a parametric form on the class distribution and can therefore capture more flexible logit-space geometries. However, KDE typically requires more training samples than GMM to achieve stable estimates and can be computationally more demanding in high-dimensional settings, in particular when the logit space dimension C, corresponding also to the number of classes, is large.
3.5.6. OpenMax
The OpenMax method was originally proposed by Bendale and Boult [
14] as one of the first approaches to OSR in deep neural networks. The core idea is to adjust the network’s activations by estimating the probability that a sample lies far from the distribution of known classes in the activation space. Specifically, for each known class, the Mean Activation Vector (MAV) is computed from correctly classified training samples. The distance of a test activation to its class MAV is then modelled using Extreme Value Theory (EVT): a Weibull distribution is fitted to the tail of the distance distribution for each class. At inference time, OpenMax adjusts the most relevant class activations by estimating how likely it is for a sample to be an outlier. The α-highest activations are selected, and for each of them, an outlier probability is computed using the Weibull model. These activations are then reduced in proportion to this probability, and the removed activation mass is reassigned to an additional unknown class, producing an output vector over
C + 1 categories.
In our framework, this procedure was adopted directly in the logit space of the CNN backbone: per-class MAVs were computed from the training logits, Weibull models were fitted on the extreme distances using the combined Euclidean–cosine metric, and the OpenMax recalibration at inference was applied. The open-set score was defined as the probability assigned to the additional unknown class:
This formulation differs from the original use of OpenMax mainly in two aspects: (i) the activation space was restricted to the logit vectors rather than intermediate deep features, thus keeping the backbone unchanged and compatible with our logit-based framework; (ii) the unknown probability was interpreted as a continuous open-set score rather than applying a fixed decision threshold.
These modifications aligned OpenMax with the general scoring-based formulation adopted throughout this work, thus allowing for a consistent comparison with the other considered OSR methods.
3.6. Evaluation Strategy
Two complementary evaluation protocols were conducted to assess the model’s abilities, both to classify known activities and to detect previously unseen activities.
To measure how accurately the CNN can distinguish among the target activities, a five-fold subject-wise cross-validation was employed. This protocol enforces strict subject independence between sets, preventing leakage of individual movement patterns. For the proprietary dataset, each fold was composed of 6 held-out subjects for testing (5 from the static protocol and 1 from the dynamic protocol), with the remaining subjects used for training. For the UCA-EHAR dataset, the same five-fold cross-validation was used across subjects. Considering that each subject performed the activity for a similar duration, the proportion of represented activities in each set was balanced. Closed-set performance was reported using accuracy and macro-F1, the latter being more informative for imbalanced HAR scenarios. The results were then aggregated across folds and presented as mean ± standard deviation.
To evaluate OSR performance, a nested validation scheme was designed combining leave-one-activity-out (LOAO) and leave-one-subject-out (LOSO) cross-validation (
Figure 2). At the outer level (i.e., LOAO), one activity class was held out during training and treated as unknown at test time while the classifier was trained on the remaining activities. At the inner level (i.e., LOSO), data from one subject were excluded during training and used exclusively for testing. Here, the set of activities remained the same, but the system was exposed to previously unseen individuals, thereby testing its ability to generalize across subjects.
For each fold in LOAO and LOSO, the base classifier was trained on the corresponding training set, and the logits were collected on the held-out test partition. The open-set scores, defined in the previous section, were calibrated on the training set logits that contained only known classes and then computed for every test instance. As the ground truth labels are available, each test set naturally contains a mixture of known and unknown samples, depending on the held-out condition. This enabled a direct and systematic evaluation of OSR performance across multiple folds.
Discrimination capability was assessed through the area under the receiver operating characteristic curve (AUROC), which summarizes the trade-off between the true positive rate and false positive rate across thresholds.
This evaluation strategy was consistently applied across both the proprietary and the public UCA-EHAR datasets.
5. Discussion
Our results showed that, with the proposed framework, OSR from head-mounted inertial data was both feasible and effective using lightweight, post hoc scoring strategies.
The CNN backbone provided a strong closed-set baseline, achieving accuracy above 90% on both datasets. The network achieved near-perfect recognition for static or head-centric activities such as breathing, nodding, and shaking, confirming that head-mounted IMUs can capture fine-grained motion patterns associated with subtle head and upper-body movements. The most challenging class was speaking, which exhibited both a low average F1 score and high variability across folds. This drop can be attributed to both confusion with similar activities and data scarcity. Indeed, speaking was included only in the dynamic protocol; thus, each training fold contained recordings from four subjects and a total duration of approximately 1 min per subject, considerably shorter than other activities (e.g., 9 min of walking or 15 min of cycling per subject) or static activities (20 subjects in the training fold).
The open-set results revealed that performance is influenced by the choice of the excluded activity, confirming that OSR performance is class-dependent, as some classes are easier to separate from others. For example, in experiments involving the proprietary dataset, when breathing was excluded and treated as unknown, all methods failed to effectively separate it from the known activities. This outcome suggests that breathing acts as a baseline activity, characterized by low-motion patterns that overlap with many other behaviors. Consequently, omitting it from the training set deprives the model of a reference for “neutral” motion, impairing novelty detection. This behavior was not observed on the UCA-EHAR dataset, likely because even when still was excluded, another low-dynamic activity (lying) remained in the training set, effectively serving as a substitute baseline. These findings highlight that the availability of at least one baseline or low-motion class during training appears crucial for stable OSR in HAR when using head-mounted sensors.
Additional class-specific failure modes emerge for highly dynamic activities. In the UCA-EHAR dataset, when running is treated as an unknown class, it is often confused with known activities such as walking and stairs (which also exhibit large acceleration values), leading to confident but incorrect predictions, particularly for confidence-based scoring methods such as the energy score. In these cases, the learned representations for unknown samples fall in regions in the logit space characterized by other activities, resulting in poor open-set discrimination. Overall, these observations indicate that accurate recognition of unknown activities is achieved when known activity classes are sufficiently well separated in the learned representation space, and when baseline or low-motion activities are available during training.
Across both datasets, density-based approaches (KDE, GMM) consistently achieved the most stable AUROC values, indicating that modelling the logit distribution of known classes provides a reliable boundary between familiar and unseen activities. OpenMax also performed competitively, particularly on UCA-EHAR. Simpler confidence-based methods (MLS, energy) are also promising as computational efficient alternatives.
Taken together, the closed-set and open-set results highlight a relationship between base classification quality and open-set behavior. While in computer vision it is widely acknowledged that an accurate closed-set classifier is needed to perform open-set recognition [
33], in our experiments we discovered that accurate closed-set performance does not guarantee robust open-set recognition. In fact, separation boundaries established by the open-set classifier are influenced by the location of all the known activity classes in the latent space. For example, a class that presents very distinctive patterns (e.g.,
speaking, which is an activity where the head is mostly still) but that is poorly represented in the training set may still be easily detected as unknown when excluded from the known set, even if its closed-set classification performance is low due to class imbalance. However, when weak or highly variable classes remain part of the known set, they can reduce the stability and separability of learned representations, increasing overlap in the logit space and negatively impacting open-set discrimination. Overall, our results confirm that open-set recognition performance depends not only on closed-set accuracy but also on the stability and separability of learned representations across classes.
Beyond the quantitative performance of individual methods, the results highlight domain-specific challenges that distinguish IMU-based HAR from image recognition, a field where OSR techniques have been largely explored in the literature. Inertial data are multivariate temporal signals characterized by strong inter-axis coupling, substantial inter-subject variability, and overlap between semantically related activities, such as walking, running, and stair climbing. On top of that, user-specific traits suggest that HAR models need to be easy to be personalized. The customary operating conditions of HAR and image classification are also completely different, as the former refer to embedded devices and the latter also to more powerful computing devices. Training datasets are also typically of different sizes. All in all, these characteristics result in a substantial difference in DL models for the two domains and underline the importance of investigating lightweight and portable OSR solutions in HAR.
An important aspect for practical deployment on wearable platforms concerns the computational and memory costs associated with the different OSR strategies. The considered OSR methods exhibit distinct inference time complexities and storage requirements. Confidence-based approaches, such as MLS and energy scoring, operate directly on the logit output and introduce negligible computational overhead, requiring only simple arithmetic operations and no additional stored parameters beyond the classifier weights. Distance- and non-parametric density-based methods require storing training logits and performing comparisons with multiple samples at inference time; in particular, NNDR and KDE incur computational costs and memory usage that scale with the number of stored training examples. In contrast, GMM-based scoring models each known class using a small, fixed set of parameters (mean and covariance), leading to inference complexity that scales with the number of training classes. OpenMax introduces an additional recalibration step based on Extreme Value Theory, involving distance computations and Weibull evaluations for the most activated classes, with both computation and memory linearly scaling with the number of known classes. Overall, while all evaluated methods remain lightweight compared to retraining-based or generative OSR approaches, confidence-based and parametric density-based strategies exhibit the lowest computational and memory overhead, making them particularly suitable for resource-constrained wearable HAR systems.
A further practical aspect for deployment concerns the selection of the decision threshold used to separate known from unknown samples. In this work, thresholds are not fixed a priori but implicitly explored through the AUROC metric, allowing a fair comparison across OSR methods to identify the one yielding the best separation among known and unknown classes (repeated in an LOAO fashion). In real-world wearable scenarios, however, the operating threshold must be selected to control the rate of false rejections of known activities, according to the application requirements. For instance, safety-critical applications may favor conservative thresholds that reduce the risk of accepting unknown activities as known, while operating at low false alarm rates is key in non-critical continuous monitoring scenarios. Importantly, since all considered OSR methods produce continuous scores defined on the logit space, threshold calibration can be performed post hoc using a small validation set representative of the deployment context without retraining the classifier.
6. Limitations and Future Directions
The results presented in this work indicate that reliable OSR performance can be achieved from head-mounted inertial data using lightweight, post hoc scoring methods. Although the open-set decision process does not explicitly incorporate temporal dependencies or inter-axis dynamics, these characteristics are captured at the representation level by the neural classifier trained on raw time-series data from a multichannel IMU. This suggests that a substantial portion of the temporal and multivariate information relevant for novelty detection is already preserved in the classifier outputs and can be effectively exploited by simple scoring rules. Recent studies [
37,
38,
39] on OSR for time-series data have shown that explicitly modelling temporal and multivariate structure within the open-set mechanism, through approaches such as contrastive learning across time or frequency domains, reconstruction-based objectives, or time-series-specific similarity measures, can further improve novelty detection. However, these methods typically introduce additional model components, training objectives, or computational overhead, representing a less efficient and practical solution than the lightweight OSR module cascaded to the latent representation.
Building on these observations, a promising direction for future work is to investigate how temporal-aware open-set mechanisms could be integrated with lightweight post hoc scoring strategies with the aim of improving robustness while preserving the efficiency and embeddability required by wearable HAR systems.
A further limitation of this study concerns the scale and distribution of the available data. The proprietary dataset includes a limited number of subjects, particularly for the dynamic protocol, and both the proprietary and public datasets exhibit class imbalance. These characteristics reflect realistic constraints of wearable IMU data collection, where acquisition protocols, user compliance, and activity frequency are inherently heterogeneous. These factors may affect the statistical characterization of class-conditional logit distributions and, in turn, influence the stability of OSR for underrepresented or highly variable activities. Future work will explore the extension of this analysis to larger and more diverse cohorts, as well as strategies to improve robustness to class imbalance in open-set HAR.