UCA-EHAR: A Dataset for Human Activity Recognition with Embedded AI on Smart Glasses

: Human activity recognition can help in elderly care by monitoring the physical activities of a subject and identifying a degradation in physical abilities. Vision-based approaches require setting up cameras in the environment, while most body-worn sensor approaches can be a burden on the elderly due to the need of wearing additional devices. Another solution consists in using smart glasses, a much less intrusive device that also leverages the fact that the elderly often already wear glasses. In this article, we propose UCA-EHAR, a novel dataset for human activity recognition using smart glasses. UCA-EHAR addresses the lack of usable data from smart glasses for human activity recognition purpose. The data are collected from a gyroscope, an accelerometer and a barometer embedded onto smart glasses with 20 subjects performing 8 different activities (STANDING, SITTING, WALKING, LYING, WALKING_DOWNSTAIRS, WALKING_UPSTAIRS, RUNNING, and DRINKING). Results of the classiﬁcation task are provided using a residual neural network. Additionally, the neural network is quantized and deployed on the smart glasses using the open-source MicroAI framework in order to provide a live human activity recognition application based on our dataset. Power consumption is also analysed when performing live inference on the smart glasses’ microcontroller.


Introduction
With the growth of the senior population, elderly care becomes an important topic in the society. One aspect of elderly care is fall prevention, which is still challenging to tackle depending on the subject's health condition. In this context, artificial intelligence can be leveraged to notify about an increased risk. To achieve this goal, a solution consists in monitoring the subject's behaviour to detect some changes that could indicate a degradation of their mobility.
Human activity recognition (HAR) can be used for that purpose. In this article, HAR is solved as a machine learning problem that predicts activities of daily living performed by a subject using sensors data that can be of different modalities. Two sensor categories are mainly used for human activity recognition: vision-based and body-worn sensors. Vision-based sensing relies on cameras placed in the environment to capture a video stream of a subject performing activities of daily living [1]. Body-worn sensors rely on inertial measurement units (IMU), including an accelerometer, a gyroscope and sometimes additional sensors (magnetometer, barometer, etc.) to measure the subject movements. Various devices such as smartphones [2], wearables [3] or application-specific devices [4] can be used to collect data, some being more invasive than others. Body-worn sensors generate fewer data than cameras and do not require a specific environment setup. It is therefore easier to embed on autonomous devices.
Our approach is based on an inertial measurement unit embedded in smart glasses. Smart glasses are less invasive than some other devices such as dedicated IMU devices or even smartphones, especially for elderly for whom wearing glasses is common. However, and to the best of our knowledge, there is no available and usable dataset for human activity recognition based on smart glasses. Moreover, data would vary from one device to another due to sensors having different orientations, ranges, accuracy and sampling rates.
In this article, we present a new dataset [5] called UCA-EHAR with data collected from Ellcie Healthy's smart glasses [6]. Our dataset provides raw data collected from an accelerometer, a gyroscope and a barometer for 8 classes of activity performed by 20 subjects.
Additionally, for privacy, connectivity and latency reasons, all the data processing related to human activity recognition is performed directly on the smart glasses. Therefore, the machine learning algorithm performing the classification task is executed on the smart glasses' microcontroller. In previous works, we presented our MicroAI framework for end-to-end training, quantization and deployment of deep neural networks on microcontrollers [7]. This framework is now available as open-source [8]. In this work, the MicroAI framework is used to deploy a deep neural network model performing human activity recognition on the smart glasses. Quantization with 8-bit and 16-bit fixed-point representations is used to optimize the memory footprint and the inference time, thus reducing the power consumption as well.
Section 2 gives an overview of some of the available datasets and approaches for human activity recognition. Section 3 presents the smart glasses used for collecting data and performing live inference. Section 4 details the dataset and the protocol used to collect the data. Section 5 describes the deep neural network architecture used to classify activities from our dataset as well as the training phase. Section 6 summarizes the key characteristics of our MicroAI framework, such as its quantization and deployment process. In Section 7, classification results using our dataset are given and power consumption on the smart glasses is analysed. Finally, Section 8 concludes this work and discusses future perspectives.

State of the Art
Datasets for human activity recognition using various modalities have been flourishing for the past decade [9]. In this article, we mainly focus on body-worn sensors since visionbased or other environmental sensor approaches are significantly different compared to the smart glasses approach.
The most iconic dataset for human activity recognition using an inertial measurement unit is likely the Human Activity Recognition dataset hosted by the University of California Irvine, commonly dubbed UCI-HAR [2]. This dataset is built from a 3-dimensional accelerometer and a 3-dimensional gyroscope sampled at 50 Hz, embedded into a smartphone attached to the subject's waist. The acceleration signal is filtered to create an additional signal without gravity. Therefore, there is a total of nine channels of sensor data. The data are windowed over 2.56 s with 50% overlap to create windows of 128 samples. The data are provided in two forms: vectors of 128 samples for each of the nine sensor channels, and vectors of 561 features computed from the 128 × 9 values. A total of 30 subjects participated in the experiments, performing 6 activities: WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LYING. A total of 21 subjects are used for training while the 9 others are used for testing, representing 7352 and 2947 vectors, respectively. As it will be seen further, some aspects of our dataset are inspired by UCI-HAR such as some classes and the window duration.
The UCI-HAR dataset was extended in [10] to provide the transitions between static activities: STAND_TO_SIT, SIT_TO_STAND, SIT_TO_LIE, LIE_TO_SIT, STAND_TO_LIE, LIE_TO_STAND. This SBHAR dataset was used to evaluate the Transition-Aware Human Activity Recognition [11] system along with two other datasets: PAMAP2 and REALDISP.
Instead of using a single smartphone with an accelerometer and a gyroscope, the PAMAP2 dataset [4] rather uses dedicated IMU devices called Colibri Wireless from Trivisio. One device is placed on the wrist, another one on the chest and a last one on the ankle. Each device contains a 3-dimensional accelerometer, a 3-dimensional gyroscope and a 3-dimensional magnetometer, along with a temperature sensor, all sampled at 100 Hz. Additionally, one heart-rate monitoring device is sampled at 9 Hz. In this dataset, nine subjects performed 12 to 18 activities. This setup is much more intrusive than UCI-HAR as multiple dedicated devices are used at specific location, making this approach harder to use in real conditions for live human activity recognition.
The REALDISP [12] dataset has an even more complex setup, using 9 IMU devices from Xsens sampled at 50 Hz, each with a 3-dimensional accelerometer, a 3-dimensional gyroscope, a 3-dimensional magnetometer. The IMU devices also provide orientation estimates in quaternion format (4D) [13]. This dataset contains more classes performed by more subjects than PAMAP2, 33 classes and 17 subjects, respectively. Its purpose was to study the impact of sensor placement.
Other popular human activity recognition datasets include UniMiB SHAR [14] containing accelerometer samples captured from a smartphone, Real-Life HAR [15] also collected from a smartphone but focusing on real-life situations (for example inactive, active or driving) rather than a laboratory setting, and OPPORTUNITY [16] that uses many sensors of different modalities.
Apart from these datasets using data collected from smartphones or specific devices, there are few other datasets based on wearables available from the market. We can cite WISDM [3] using a combination of a smartphone and a smartwatch (LG G Watch) to collect data from 51 subjects performing 18 activities. Other datasets for human activity recognition, such as [17] relying on a Microsoft Band 2, have been created from consumer smartwatches. However, these datasets have not been released so far.
More specifically, smart glasses are still not a popular device to use for human activity recognition. Nonetheless, prior works have been done to build a dataset for smart devices including smart glasses in [18]. This dataset makes use of Jins MEME smart glasses as well as a smartphone and a smartwatch to collect data from different sensors. The smart glasses provide data from an embedded IMU. This dataset has however some noticeable drawbacks. First, only one subject participated in the experiment. Moreover, there is no well-defined set of activities or well-defined protocol, which makes it difficult to evaluate or to extend. Some efforts have been made in [19] to develop a system for activity recognition using smart glasses (Google Glass Explorer Edition XE 22). The authors compare the classification performance of a Support Vector Machine (SVM) between data collected either from a smartphone or smart glasses for 4 activities (Biking, Jogging, Movie Watching, and Video Gaming). Their system can perform inference on the Android smartphone but not on the smart glasses themselves.
However, and as it has been said in the introduction, each dataset will have its own characteristics depending on which device has been used. The device itself and its position will greatly influence the angle of the acceleration (both gravity and linear acceleration) as well as the signal shape for some movements. Additionally, the sensors themselves can have varying sensitivity and sampling rate. Therefore, using an existing dataset for a different device or application will produce poor classification results. For this reason, we created our own dataset for the Ellcie Healthy's smart glasses.

Ellcie Healthy Smart Glasses
Ellcie Healthy (EH) smart connected glasses are a multiple-purpose wearable device designed for e-health and road safety applications such as driver drowsiness detection, fall detection for elderly people or human activity recognition to prevent a fall. The Ellcie Healthy smart connected glasses shown in Figure 1 contain infrared proximity sensors embedded inside the rims for oculography purposes. Other sensors such as a barometer, a thermometer, a triaxial accelerometer and a gyroscope are integrated within the frame temples. The accelerometer and the gyroscope are located on the same inertial measurement unit component. The barometric sensor and the temperature sensor are located in another component. The accelerometer provide each of the component of the tree-dimensional acceleration vector along the orthogonal coordinate system shown in Figure 2. When the glasses are placed onto a table for example, most of the acceleration vector modulus (i.e., the gravity) is projected onto the Z axis approximately roughly giving 9.81 m·s −2 . Depending on how the subject is wearing the glasses, the shape of the nose and other physiological factors, the gravity may not be perfectly projected onto the Z axis. The frame also includes a 32-bit microcontroller. The STM32L451RE microcontroller from STMicroelectronics has been chosen for its low power consumption while still being versatile. This microcontroller relies on a Cortex-M4F core running at 40 MHz in active mode and alongside 512 KiB of Flash memory and 160 KiB of SRAM. The microcontroller runs a real-time operating system to handle the various concurrent tasks. Additionally, a Bluetooth Low Energy (BLE) transceiver is integrated inside the frame to enable wireless communication with a gateway (typically a smartphone). Finally, a 350 mWh lithium polymer battery placed on the left temple of the frame provides the energy to the whole system using a flat flexible cable. This cable allows energy and data to flow back and forth through the bridge, the rims and the temples. Embedded algorithms, signal processing and data collection can therefore be directly executed on the smart glasses to provide health constants and/or security information to users. Alerts can be triggered when a risk event (e.g., driver drowsiness) is detected.

UCA-EHAR Dataset
UCA-EHAR is our proposition of a dataset to address the lack of usable data for human activity recognition using smart glasses.
In order to build the UCA-EHAR dataset, we have enrolled 20 adult subjects, 8 women and 12 men (30.6 y.o average; 12 y.o standard deviation). Excluded were adults or children below 1.60 m of height, people with disabilities such as limping or backache.
The choice of activities has been inspired by the UCI-HAR dataset as presented in Section 2. Additionally, these activities are simple to perform, common and relevant for elderly activity monitoring.
STANDING, SITTING, and LYING are static activities where the subject stays in the same position for a given duration. However, the subject does not need to stay completely still, but rather be natural as long as they keep either a STANDING, SITTING or LYING position.
WALKING, WALKING_DOWNSTAIRS, WALKING_UPSTAIRS and RUNNING are dynamic activities associated to mobility. The RUNNING activity is closer to walking fast than a sprint.
DRINKING is an activity that has been specifically added because we believe dehydration can be a risk for the elderly. The DRINKING activity is performed by drinking from a glass or a bottle, sip by sip.
The composition of the dataset can be seen in Appendix A.

Data Collection Protocol
Each subject was given a table stating the guidelines of the recording. One voice recording per session was acquired. The entire signal recorded during a session can contain multiple status and transition classes as shown in Table 1.
Each data recording corresponds to one session as described in the table. Each session is described with 2 lines that must be read from left to right. The first line indicates the activity, while the second line gives the expected activity duration. Each session is a succession of activities. In order to provide a compact representation of sessions, an activity can be replaced by "repeat x times". In that case, no duration is indicated, it is rather replaced by the activity number to start again from. Subjects did not necessarily repeat the activities as many times as recommended due to time constraints or physical conditions. It is well known that homogeneous classes can be of premium importance to reach a good accuracy for some neural network family. As a transition is by nature shorter in time compared to a status class, the number of transition signal samples is very small compared to the status classes' samples. Even tough the transitions are labelled in the dataset, they are not considered meaningful for classification in this article and are therefore filtered out for classification results.
The recording process is performed using two mobile phones. One phone, running the so-called "research application" from Ellcie Healthy, is connected to the smart glasses through a Bluetooth Low Energy connection. The research application records the accelerometer, gyroscope and barometer samples sent by the smart glasses. The other phone is used to record the voice of the subject. The subject or the test assistant must pronounce the keyword corresponding to the activity that the subject is currently performing.
Example of recordings of approximately 20 s for each session are shown in Appendix C.

Data Format
The accelerometer, gyroscope and barometer, respectively, have 3 values for acceleration, 3 values for the angular velocity and 1 atmospheric pressure value.
The full sensitivity range is ±2g (g = 9.81 m·s −2 ) for the accelerometer and ±2000 dps (degrees per second) for the gyroscope. The Ellcie Healthy glasses used in this experiment sample the 6 signals from the accelerometer and the gyroscope at a rate of 26 Hz, whereas the barometer is sampled at 6.66 Hz.
Before the labelling process, an interpolation routine has to be executed within the Matlab environment to provide the atmospheric pressure interpolated values for each accelerometer timestamp, so that a merged file containing one timestamp and 7 columns is produced. It is worth noticing that the barometer, the gyroscope and the accelerometer share the same sampling time origin. The values are provided in m·s −2 , rad·s −1 and hPa.
The voice recording and additional supporting Matlab routines are used to determine the right label for each sample. Files are provided in CSV format with a semicolon as the column delimiter. The files contain one line every 40 ms approximately, with nine columns labelled "T" for the timestamp, "Ax", "Ay" and "Az" for the accelerometer, "Gx", "Gy" and "Gz" for the gyroscope, "P" for the atmospheric pressure and "CLASS" for the activity label. All numeric values are provided with 2 decimals. Finally, the name of the file is a combination of the identifier of the subject and the session name. The identifier of the subjects is numbered T1 to T21; however, T11 is skipped due to not having performed enough activities. Some recordings have been performed in two sessions, in such a case "_1" or "_2" is appended to the filename.

Machine Learning for Embedded Classification
In this section, a machine learning method to perform classification on the UCA-EHAR dataset is presented. Our aim is to provide a baseline for classification performance, so that these results can be used by other works for comparison. It is also the model used later on to perform inference for live human activity recognition on the smart glasses.

Data Pre-Processing
As the objective is to perform live inference directly on the smart glasses, the amount of computation done before entering the artificial neural network must be minimized. In consequence, only a windowing pre-processing task is performed. The neural network indeed requires time series, in other words a context around each data point. The windowing process uses windows of 64 time samples, each time sample containing a value for the three accelerometer and gyroscope axes. Each window is overlapped by 25% with the previous one. Since data are sampled at 26 Hz, each window has a duration of approximately 2.46 s. This is close to the choice made by the authors of the UCI-HAR dataset [2]. The raw data from the dataset have one label per time sample. Time samples in a window may have different labels. During windowing, the labels are reduced to one per window by selecting the label with the highest number of occurrences in the window. Despite the barometer data being provided in the dataset, they are not used in the embedded experiments since the barometer is not sampled at the same rate as the accelerometer and gyroscope. To use the barometer data during live inference, resampling the data coming from the sensor would have to be performed on the smart glasses.

Train/Test Split
The dataset is split in two parts: one for training and one for testing. There are 14 subjects in the training set and six subjects in the testing set, representing approximately 77% and 23% of the total number of samples, respectively. Subjects number 5,15,17,18,19, and 20 have been chosen for the testing set since they have completed all activities. Moreover, these subjects have the lowest standard deviation on the percentage of samples for each class in the testing set. Therefore, as seen at the bottom of Appendix A, activities are balanced as much as possible between the training and testing sets.
The total number of time samples in the training and the testing sets are 563,469 and 170,150, respectively. After windowing, the total number of vectors in the training and the testing sets are 35,213 and 10,631, respectively. The distribution of time samples before windowing by subjects and activities for both the training set and the testing set can be seen in Appendix A.

Data Augmentation
In order to mitigate overfitting and improve generalization, three different data augmentation techniques have been used during training: time shifting, time warping and 3D rotations. Time shifting performs a uniformly distributed random rotation over the time axis in order to shift the centre of the window. Time warping performs a dilation over the time axis in order to speed up or slow down the movement. The dilation scale factor is chosen randomly from a normal distribution with a mean µ = 0 and a standard deviation σ = 0.15. 3D rotation performs a three-dimensional rotation over the three accelerometer and gyroscope axes. The three rotation angles are randomly chosen from a normal distribution with a mean µ = 0 and a standard deviation σ = 0.15.

Artificial Neural Network Architecture
A deep neural network is used as the machine learning algorithm. More specifically, a residual neural network has been used as it performed well on the UCI-HAR dataset in previous works [7]. Moreover, this type of network is easy to scale down for embedded hardware by changing the number of filters per convolutional layer. In this work, a onedimensional ResNetv1-6 [20] network is used to classify time series from our dataset. All convolutional layers have the same number of filters f . The ResNetv1-6 architecture is illustrated in Figure 3. The neural network is trained over 750 epochs using stochastic gradient descent (SGD) with momentum set to 0.9 and weight decay set to 5 × 10 −4 . The batch size is set to 768. Initial learning rate is set to 0.025 and divided by 10 at epochs 200, 400, 600 and 675.

Quantization and Deployment of Deep Neural Networks with MicroAI
In order to perform human activity classification in real time on the microcontroller of the smart glasses, our MicroAI framework [7,8] is used. MicroAI is an open-source, end-to-end deep neural network training, quantization and deployment framework mainly targeting microcontrollers. MicroAI is designed as an alternative to other embedded inference engines such as TensorFlow Lite for Microcontrollers [21] and STM32Cube.AI [22]. TensorFlow Lite is complex and hard to extend, while STM32Cube.AI is proprietary. Our framework aims at being more easily extensible and tailored to specific use cases. MicroAI is divided in two parts: a neural network training tool that relies on Keras or PyTorch, and a tool to generate a lightweight and portable C inference library from a trained model. MicroAI enables the quantization of deep neural networks onto 8 or 16 bits in fixed-point representation. Quantization can be done using either Post-Training Quantization (PTQ) or Quantization-Aware Training (QAT).
The general flow for the end-to-end training and deployment process is illustrated in Figure 4. The entire process is automated and based on a configuration file. The process begins with a data preprocessing phase in order to apply transformations such as windowing. Then, a training is performed on a workstation using Keras or PyTorch. After the initial training, the model can be quantized with quantization-aware training or posttraining quantization. Finally, the model is deployed and evaluated on the microcontroller.

Quantization of Deep Neural Networks
After the initial training phase, the trained model can be quantized to perform inference using a fixed-point data format instead of floating point. Quantization is done after freezing the weights of the model as a post-training quantization step. Optionally, before freezing the weights, the model can be fine-tuned while taking into account the quantization error as a quantization-aware training step. While the values are quantized, a floating-point data type is still used during quantization-aware training. The data type conversion from floating-point to integers using fixed-point representation happens during the generation of the C inference library, both for quantization-aware training and for post-training quantization.
The quantization scheme of MicroAI does not make use of advanced quantization techniques such as non-power-of-two scale factors or asymmetric ranges [23]. Instead, a less complex quantization scheme is used: uniform quantization, per-layer power-of-two scale factor and symmetric ranges. Additionally, biases are quantized the same way as weights. Activations are quantized using a separate scale factor.
As it will be shown in Section 7, post-training quantization with 16-bit integers has no impact on accuracy. Moreover, the same fixed-point coding, set to Q7.9 [24] in our case, is used for all layers.
On the other hand, quantizing over 8-bit integers does negatively affect the accuracy. To mitigate the quantization loss, the fixed-point coding can be different between layers and is chosen considering the range of the training set values. In practice, this conversion method starts by finding m, the number of bits required to represent the largest unsigned integer part. In the fixed-point representation, one bit is used for the sign, m bits are used for the integer part and the remaining bits are used for the fractional part. Each floating-point value is then multiplied by 2 m and cast to an integer, truncating the fractional part. In the following experiments, quantization-aware training is not used since it did not bring a significant improvement over post-training quantization.

Deployment of Deep Neural Networks on Microcontrollers
With MicroAI, various deep learning models such as multi-layer perceptrons, convolutional neural networks and residual neural networks can be deployed onto microcontrollers.
More generally, MicroAI can deal with the following type of layers: fully-connected, 1D convolution, 1D max pooling, 1D average pooling and element-wise addition. Development is currently ongoing to add support for the 2D variant of these layers. ReLU activation is fused with the previous layer. In order to deploy the model onto an embedded target for inference, a C inference library is generated. For each layer in the graph of the model, a C inference function is generated from a template file. Arrays containing the weights are also generated if applicable. Then the main inference function containing the call chain to the layers and the allocation of their output buffers is generated. Finally, the code is cross-compiled using the GCC compiler with -Ofast optimization level. MicroAI can optionally make use of the CMSIS-NN [25] library for faster 8-or 16-bit fixed-point inference, taking advantage of the so-called DSP instructions available in the ARMv7E-M instruction set architecture of the Cortex-M4 core. The inference time can then be measured directly onto the target by sending input vectors through the virtual serial port and waiting for the output of the deep neural network inference. Alternatively, the C inference library can be included into a third-party firmware, such as the firmware for Ellcie Healthy's smart glasses, in order to perform live inference with real data.

Training and Prediction Results
The residual neural network described in Section 5.4 is trained for 8, 16, 24, 32, 40, 48, 64, and 80 filters per convolution. It is then quantized using the methods described in  Concerning the memory used by the parameters, Figure 6 shows that the 16-bit fixed-point model is the most efficient, using half the memory of the 32-bit floating-point model but without any loss of accuracy. On the other hand, the 8-bit fixed-point model is less efficient than the 32-bit floating-point model since a noticeable loss of accuracy can be observed.
The confusion matrix, shown in Figure 7 and extracted from one training for 80 filters per convolution, highlights the difficulty for an artificial neural network to differentiate the SITTING and STANDING activities from the collected data. The reason is that the orientation of the smart glasses remains the same for both classes, and the signals mostly stay constant for both of these motionless activities as seen in Figures A3 and A4   An evaluation per subject has also been performed and is reported in Figure 8. The training set and the parameters are the same as the one used for the previous confusion matrix. However, inference is evaluated using each subject of the testing set one by one. It is important to note that since the classes are unbalanced, the accuracy in the "TOTAL" column does not represent the average of each class's accuracy. Instead, it is the accuracy over all the test vectors of a given subject, and classes with more test vectors will have a greater influence on the resulting percentage of correct predictions. For example, for subject T20 the "TOTAL" of 75% is the most influenced by the "STANDING" activity, having much more samples than other activities and bringing the accuracy down. The same applies for the "TOTAL" line, since subjects do not all have the same number of test vectors per class. The bottom right cell, at the intersection of the "TOTAL" line and the "TOTAL" column, represents the accuracy over the entire testing set. Results show a discrepancy between subjects for some activities such as WALKING_DOWNSTAIRS, WALKING_UPSTAIRS and DRINKING, while other activities are more homogeneous. The STANDING activity, however, is hard to classify for all subjects. The reason is a large confusion with the SITTING activity, as previously shown in the confusion matrix.

Deployment on Smart Glasses
A ResNetv1-6 is integrated into Ellcie Healthy's smart glasses firmware version 6.1.2 using the C inference library generated by MicroAI. In this firmware version, only 77,604 B of Flash (for the inference code and the weights) and 40,572 B of RAM (for the intermediate computation and the layers' output buffers of the deep neural network) can be used. Therefore, these memory limitations constrain the neural network that can be executed on the microcontroller. For the 32-bit floating-point inference, the largest ResNetv1-6 that can be deployed only contains 32 filters per convolution. Since the 16-bit fixed-point quantization provides the best memory efficiency, we also deployed a 16-bit ResNetv1-6 with 48 filters per convolution to get the best possible accuracy on the smart glasses. It is worth noting that the same deep neural network without quantization (i.e., using 32-bit floating point) does not fit in Flash memory.
The memory footprint in Flash and the statically allocated RAM for each configuration is summarized in Table 2. Table 2. Flash usage and static RAM allocation of the deep neural network (code and data).

Data Type Optimizations
Flash RAM Accuracy (Available: 77,604 B) (Available: 40,572 B As expected, 8-bit and 16-bit quantizations allow reducing both the Flash and RAM usage. Therefore, models with more parameters can be deployed compared to the original 32-bit network. Using a 16-bit quantization, a network with 48 filters per convolution can indeed be deployed on the smart glasses. For this network, almost all the available memory is used: 94.43% of Flash and 98.43% of statically allocated RAM. On the other hand, a maximum of 32 filters per convolution can be used for the 32-bit network. For this network, the available memory is used as follows: 91.89% of Flash and 86.75% of statically allocated RAM. The inference is performed after each time 64 samples are collected by the inertial measurement unit (IMU) whose sampling rate is 26 Hz. As the barometer sampling rate is 6.66 Hz, this sensor is not used in these experiments since resampling the signal would be required.
The power consumption of the smart glasses is measured using a Qoitech Otii Arc laboratory power supply, supplying 3.75 V in place of the LiPo battery. Energy values are computed by the Otii software from the current and voltage over a one minute window starting from the beginning of an inference. Obtained measurement over one inference period is shown in Figure 9  In the Figure 9, the inference task starts at the very beginning of the measurement. After the 173 ms of inference, 64 new samples are collected from the IMU. This figure clearly shows that the inference task requires much less time than collecting 64 samples. Therefore, in this configuration the inference time does not have a significant impact on the overall energy consumption. Over one inference period (i.e., approximately 2.6 s), 10,200 nWh represents the sum of the energy for the inference (1120 nWh) and the energy to collect the samples (9100 nWh).  Table 3. Results show that quantization also helps to reduce inference time and therefore energy consumption for one inference. The original 32-bit floating-point network requires 140 ms on average for one inference, while its 16-bit quantized version only takes 88 ms for the same accuracy. Furthermore, the 8-bit quantized version only requires 53 ms, but as seen previously with a noticeable degradation of accuracy. However, the overall energy consumption over one minute does not significantly change with quantization. The overall energy is reduced by at most 7% between the 32-bit floating-point network and its 8-bit quantized version. As it has been observed in Figure 9, the inference time is indeed small compared to the time required to collect data. For that reason, the impact of inference over the overall energy consumption is small. Therefore, even if the largest network that fits in memory (48 filters per convolution with 16-bit quantization) is used, the autonomy of the smart glasses would not be impacted as long as the inference execution time remains small compared to the inference period. Hence, the energy consumption over one minute only grows by 2% with a 16-bit quantized network with 48 filters per convolution rather than using 32 filters per convolution.
Ellcie Healthy's smart glasses embed a 350 mWh battery. Therefore, when the 16-bit quantized network with 48 filters per convolution is used (this network consumes 237 µWh per minute), the autonomy can reach 1476 min, i.e., 24.6 h. This estimated lifetime does not take into account additional applications that could run concurrently as well as battery ageing.
The larger the neural network, the larger the memory and the higher the energy consumption. However, in our case study, the memory footprint is a far more important parameter than energy consumption, primarily making the artificial intelligence in the smart glasses a memory bound problem.

Live Human Activity Recognition on Smart Glasses
The ResNetv1-6 model with 48 filters per convolution, 16-bit fixed-point quantization and CMSIS-NN optimizations, has been trained using the UCA-EHAR dataset. This network has been then integrated onto the smart glasses firmware to perform live human activity recognition. Data are collected from the accelerometer and the gyroscope of the smart glasses when worn by a subject. The smart glasses' microcontroller performs the classification and sends the label of the recognized activity to a computer for visualization through a Bluetooth Low Energy communication. Additionally, the accelerometer and gyroscope data are also sent for visualization, even though the classification is not performed on the computer. A 30-second sample of such a live recognition has been extracted and can be seen in Figure 10. In this extract, the following sequence of activities has been performed by the subject: walking downstairs, walking upstairs, walking, stopping in a standing position and finally drinking a sip of water. No quantitative evaluation of the live recognition performance has been done so far. However, it can be said that qualitatively the performance follows the results presented in the confusion matrix. Activities such as WALKING, WALKING_DOWNSTAIRS, WALK-ING_UPSTAIRS and DRINKING are generally recognized properly, while the STANDING and SITTING activities cannot be distinguished properly.

Conclusions
In this article, a novel dataset for human activity recognition called UCA-EHAR has been presented. This dataset gathers data collected from the accelerometer, the gyroscope and the barometer of smart glasses. UCA-EHAR is the first publicly available dataset dedicated to human activity recognition on activities of daily living using smart glasses. To provide a comparison baseline for a classification task, we evaluated the performance of a residual neural network on our dataset and we provided accuracy results as well as a confusion matrix. The accuracy for this dataset using a floating-point ResNetv1-6 with 80 filters per convolution is 80.2%. However, such a floating-point implementation does not respect embedded constraints of the smart glasses. Therefore, the neural network has been quantized using 8-bit and 16-bit fixed-point inference in order to optimize the memory footprint and the inference time, thus the energy consumption. Obtained results show that the 16-bit quantization provides the best accuracy vs. memory efficiency. To illustrate the energy that can be saved by quantization, we deployed a deep neural network onto the smart glasses using our MicroAI framework. We then measured the current and voltage during a human activity recognition task running on the smart glasses. Using the 16-bit quantized network with 48 filters per convolution we have shown that we can run human activity recognition for up to 24 h on the smart glasses. In the future, we will build a dataset including more classes such as transitions (SIT_TO_STAND, STAND_TO_SIT, SIT_TO_LIE, LIE_TO_SIT) or other activities (DRIVING). We would also like to explore unsupervised online learning using this dataset. To do so, collecting data for some subjects over a longer period of time will be required. Preliminary results were already presented in [26] using the UCI-HAR dataset. Unsupervised online learning will be implemented in our MicroAI framework to automatically train, quantize and deploy a network composed of convolutional layers and unsupervised layers onto the smart glasses. Funding: This research is funded by "Université Côte d'Azur", "CNRS", "Région Sud Provence-Alpes-Côte d'Azur" and "Ellcie Healthy".

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of CER (Comité d'Ethique de la Recherche) (protocol code n • 2022-033 and date 8th of April of the Ethical Approval).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.5659336 (accessed on 24 November 2021).