UniMiB SHAR: a new dataset for human activity recognition using acceleration data from smartphones

Smartphones, smartwatches, fitness trackers, and ad-hoc wearable devices are being increasingly used to monitor human activities. Data acquired by the hosted sensors are usually processed by machine-learning-based algorithms to classify human activities. The success of those algorithms mostly depends on the availability of training (labeled) data that, if made publicly available, would allow researchers to make objective comparisons between techniques. Nowadays, publicly available data sets are few, often contain samples from subjects with too similar characteristics, and very often lack of specific information so that is not possible to select subsets of samples according to specific criteria. In this article, we present a new dataset of acceleration samples acquired with an Android smartphone designed for human activity recognition and fall detection. The dataset includes 11,771 samples of both human activities and falls performed by 30 subjects of ages ranging from 18 to 60 years. Samples are divided in 17 fine grained classes grouped in two coarse grained classes: one containing samples of 9 types of activities of daily living (ADL) and the other containing samples of 8 types of falls. The dataset has been stored to include all the information useful to select samples according to different criteria, such as the type of ADL, the age, the gender, and so on. Finally, the dataset has been benchmarked with four different classifiers and with two different feature vectors. We evaluated four different classification tasks: fall vs no fall, 9 activities, 8 falls, 17 activities and falls. For each classification task we performed a subject-dependent and independent evaluation. The major findings of the evaluation are the following: i) it is more difficult to distinguish between types of falls than types of activities; ii) subject-dependent evaluation outperforms the subject-independent one


Introduction
Nowadays, many people lead a sedentary life due to the facilities that the increasingly pervasive technologies offer. Unfortunately, it is recognized that insufficient physical activity is one of the 10 leading risk factors for global mortality: people with poor physical activity is subjected to a risk of all-cause mortality that is 20% to 30% higher then people performing at least 150 minutes of moderate intensity physical activity per week [1]. Another important global phenomenon actually affecting our society is population aging: the decline or even decrease of the natural population growth rates due to a rise in life expectancy [2] and to a long-term downtrend in fertility (expecially in Europe [3]). Falls are a major health risk that impacts the quality of life of elderly people. Indeed, among elderly people, accidental falls occur frequently: the 30% of the over 65 population falls at least once per year; the proportion increases rapidly with age [4]. Moreover, fallers who are not able to get up more likely require hospitalization or, even worse, die [5].
Thus, research on techniques able to recognize activities of daily living (ADLs), also known as human activities (HA), and 1 Corresponding Author: micucci@disco.unimib.it to detect falls is very active in recent years: the recognition of ADLs may allow to infer the amount of physical activity that a subject perform daily, while a prompt detection of falls may help in reducing the consequence (even fatal) that a fall may cause mostly in elderly people.
ADLs recognition and fall detection techniques usually accomplish their task by analizing samples from sensors, which can be physically deployed in the ambient (ambient sensors, e.g., cameras, vibration sensors, and microphones) or worn by people (wearable sensors, e.g., accelerometers, gyroscopes, and barometers) [6]. To train and evaluate their techniques, researchers usually build their own dataset of samples and rarely make it publicly available [7,8,9]. This practice makes difficult to compare in an objective way the several newly proposed techniques and implementations due to a lack of a common source of data [10,9,11]. Only very recently, Janidarmian et al. combined 14 publicly available datasets focusing on acceleration patterns in order to conduct an analysis on feature representations and classification techniques for human activity recognition [12]. Unfortunately, they do not make the resulting dataset available for downloading.
The few publicly available datasets can been primary divided into three main sets: acquired by ambient sensors, acquired by wearable devices, and a combination of the two. Recently, a lot of attention has been paid to wearable sensors because they are less intrusive, work outdoors, and often cheaper than the ambient ones. This is confirmed by the increasing number of techniques that are based on wearable sensors (see for example the survey by Luque et al. related to fall detection techniques relying on data from smartphones [13]).
Wearable sensors are divided in two main groups: ad-hoc wearable devices (e.g., SHIMMER sensor nodes), and smartphones (e.g., Android). For what concerns fall detection, several studies concluded that, in order to be used, fall detection devices must not stigmatize people nor disturb their daily life [14,15,16]. Unfortunately, devices such as ad-hoc wearable devices and ambient sensors are not well accepted by elderly people because mostly of their intrusiveness. On the contrary, smartphones are good candidate devices for hosting fall detection systems: they are are widespread and daily used by a very large number of person, included elderly people. This, on the one hand, reduces costs, and on the other, eliminates the problem of having to learn how to use a new device. Moreover, studies demonstrated that samples from smartphones sensors (e.g., accelerometer and gyroscope) are accurate enough to be used in clinical domain, such as ADLs recognition [17]. This is also confirmed by the amount of publications that rely on the use of smartphone as acquisition devices for fall detection systems [18,13] and ADLs recognition.
For these reasons we concentrate our attention to smartphones as acquisition devices both for ADL recognition and fall detection. Thus, we searched the publicly available datasets acquired with smartphones in order to identify their strengths and weaknesses so as to outline an effective method for carrying out a new acquisition campaign. We searched the most common repository (IEEE, ACM, Google, and Google Scholar) by using in our query the terms ADL dataset and Fall dataset in combination with the following words smartphone, acceleration, accelerometer, inertial, IMU, sensor, and wearable. We selected the first 100 results for each query. Removing duplicate entries, we obtained less then 200 different references. Then we manually examined the title, the abstract, and the introduction to eliminate references unrelated to ADL recognition and fall detection, and references that were based on ambient sensors such as camera, microphones, or RFID tags. We then read carefully the remaining references and discarded those that do not make publicly available the dataset used in the experimentation. Finally, we added the relevant references that we missed with our searches but were cited in the papers we selected. At the end of the process, we individuated 13 datasets with data from smartphones and 19 with data from wearable ad-hoc sensors. We then included only those datasets that have been recorded starting from 2012 2 mostly because the oldest dataset including samples from smartphones is dated 2012. This choice makes the datasets homogeneous with respect to the sensors technologies related to acquisition sensors which rapidly evolves year by year.
At the end of the process, we individuated 13 datasets with data from smartphones and 13 with data from wearable ad-hoc sensors. In the following, we will detail some relevant characteristics of the 13 datasets from smartphones since our aim was to build a new dataset containing acceleration patters from smartphone able to complement the existing ones. As it will presented in Section 2, the datasets from ad-hoc wearable devices have be examined with the aim of identifying the most common ADLs and falls. Table 1 shows the publicly available datasets recorded by means of smartphones and their characteristics. Table 1 also includes the dataset we realized in the last row, in order to ease the comparison.
The total number of datasets decreases to 11 because Mobi-Act and UCI HAPT are updated versions of MobiFall and UCI HAR respectively. Thus, in the following we will refer to 11 datasets overall, discarding MobiFall and UCI HAR.
The 11 datasets have been recorded in the period 2012 to 2016 (column Year). Only 5 datasets out of 11 contain both falls (column Falls) and ADLs (column ADLs).
The average number of subjects for dataset is 18 (column Nr. of subjects). The datasets that specify the gender of the subjects (which are MobiAct, RealWorld (HAR), Shoaib PA, Shoaib SA, tFall, and UMA Fall) contain in mean 6 women and 13 men (columns Gender -Female and Gender -Male respectively).
DMPSBFD, UCI UIWADS, and WISDM do not specify the age of the subjects (column Age). In the remaining 8 datasets, subjects are aged between 21 and 43 on average with a standard deviation of 4 and11respectively.
Finally, only Gravity, MobiAct, RealWorld (HAR), tFall, and UMA Fall datasets provide detailed information about the height and the weight of the subjects (columns Height and Weight respectively).
The detailed information reported in Table 1 have been collected from the web site hosting the dataset, the readme files of each dataset, and the related papers. It is remarkable to notice that in many cases such information get lost in the downloaded dataset. Grey cells in Table 1 indicate that samples are stored so that they can be filtered according to the information contained in the cell. For instance, in all the datasets, with the exception of tFall, it is possible to select subsets of samples according to the specific ADL (column ADLs). For example, it is possible to select all the samples that have been labeled walking. tFall is an exception because the samples are simply labeled as generic ADL, thus not specifying which specific kind of ADL are.
For what concerns falls (column Falls), all the datasets have organized samples maintaining the information related to the specific type of fall they are related to (e.g., forward).
As specified in column Nr. of subjects, the samples are linked to the subjects that performed the related activities and, where provided, falls. This means that in all the datasets (with the exception of Shoaib PA) it is possible to select samples related to a specific subject. However, this information is unhelpful if there is no information on the physical characteristics of the subject. Looking at the double column Gender, only Mobi-  In view of this analysis, only MobiAct, RealWorld (HAR), and UMA Fall allow to select samples according to several dimensions, such as the age, the sex, the weight of the subjects, or the type of ADL. MobiAct and UMA Fall allow to select samples also according to the type of fall. Unfortunately, the other datasets are not suitable in some experimental evaluations. For example, the evaluation of the effects of personalization in classification techniques [30] taking into account the physical characteristics of the subjects, that is, operating leave-one-subjectout cross-validation [31].
To further contribute to the worldwide collection of accelerometer patterns, in this paper we present a new dataset of smartphone accelerometer samples, named UniMiB SHAR (University of Milano Bicocca Smartphone-based Human Activity Recognition). The dataset was created with the aim of providing the scientific community with a new dataset of acceleration patterns captured by smartphones to be used as a common benchmark for the objective evaluation of both ADLs recognition and fall detection techniques.
The dataset has been designed keeping in mind on one side the limitations of the actual publicly available datasets, and on the other the characteristics of MobiAct, RealWorld (HAR), and UMA Fall, so to create a new dataset that juxtaposes and complements MobiAct, RealWorld (HAR), and UMA Fall with respect to the data that is missing. Thus, such a dataset would have to contain a large number of subjects (more than the 18 in average), with a large number of women (to compensate Mo-biAct, RealWorld (HAR), and UMA Fall), with subjects over the age of 55 (to extend the range of UMA Fall 3 ), with different physical characteristics (to maintain heterogeneity), performing a wide number of both ADLs and falls (to be suitable in several contexts). Moreover, the dataset would have to contain all the information required to select subjects or ADLs and falls according to different criteria, such as for example, all the female whose height is in the range 160-168 cm, all the men whose weight is in the range 80-110 Kg, all the walking activities of the subjects whose age is in the range 45-60 years.
To fulfil those requirements, we built a dataset including 9 different types of ADLs and 8 different types of falls. The dataset contains a total of 11,771 samples describing both activities of daily living (7,579) and falls (4,192) performed by 30 subjects, mostly females (24), of ages ranging from 18 to 60 years.. Each sample is a vector of 151 accelerometer values for each axis. Each accelerometer entry in the dataset maintains the information about the subject that generated it. Moreover, each accelerometer entry has been labeled by specifying the type of ADL (e.g., walking, sitting, or standing) or the type of fall (e.g., forward, syncope, or backward).
We benchmarked the dataset by performing several experiments. We evaluated four classifiers: k-Nearest Neighbour (k-NN), Support Vector Machines (SVM), Artificial Neural Networks (ANN), and Random Forest (RF). Raw data and magnitudo have been considered as feature vectors. Finally, for each classification we performed a subject-dependent (5-fold cross validation) and a subject-independent (leave-subject-out) evaluation. Results show how much the proposed dataset is challenging with respect to a set of classification tasks.
The article is organized as follows. Section 2 describes the method used to build the datasets. Section 3 presents the dataset evaluation and Section 4 discusses the results of the evaluation. Finally, Section 5 provides final remarks.

Dataset Description
This section describes the method used to acquire and preprocess samples in order to produce the UniMiB SHAR dataset.

Data acquisition
The smartphone used in the experiments was a Samsung Galaxy Nexus I9250 with the Android OS version 5.1.1 and equipped with a Bosh BMA220 acceleration sensor. This sensor is a triaxial low-g acceleration sensor. It allows measurements of acceleration in three perpendicular axes, and allows acceleration ranges from ±2g to ±16g and sampling rates from 1KHz to 32Hz. The Android OS both limits to ±2g with a resolution of 0.004g the acceleration range, and takes samples at a maximum frequency of 50Hz. However, the Android OS does not guarantee any consistency between the requested and the effective sampling rate. Indeed, the acquisition rate usually fluctuates during the acquisition. For the experiments presented in this paper, we resampled the signal in order to have a constant sampling rate of 50 Hz, which is commonly used in literature for activity recognition from data acquired through smartphones [24,23,25]. The accelerometer signal is for each time instant made of a triplet of numbers (x, y, z) that represents the accelerations along each of the 3 Cartesian axes.
We used also the smartphone built-in microphone to record audio signals with a sample frequency of 8,000 Hz, which are used during the data annotation process.
The subjects were asked to place the smartphone in their front trouser pockets: half of the time in the left one and the remaining time in the right one.
Acceleration triples and corresponding audio signals have been recorded using a mobile application specially designed and implemented by the authors, which stores data into two separated files inside the memory of the smartphone.
For what concerns ADLs, Figure 1 shows the most common ones in the overall 24 datasets we analyzed (11 with samples from smartphones and sketched in Table 1 and 13 with samples from wearable ad-hoc devices listed above). The y axis represents the number of datasets that include the specified ADL. ADLs are grouped by category. The following categories have been identified by analizyng the datasets: Context-related, which includes activities that someway deal with the context (e.g., Stepping in a car), Motion-related, which includes activities that imply some kind of physical movement (e.g., Walking), Posture-related, which includes activities in which the person maintains the position for a certain amount of time (e.g., Standing), Sport-related, which includes any kind of activity that requires a physical effort (e.g., Jumping), and Others, which includes activities that are presented in one dataset only (e.g., Vacuuming in category Housekeeping-related). The Jogging and Running activities deserve a clarification. In all the datasets we analyzed, they are mutually exclusive, that is, datasets that contain Running, do not contain Jogging and vice versa. The datasets REALDISP and MHEALT are an exception because they include both the activities. These datasets, besides being realized by the same institution, are primarily oriented towards the recognition of physical activities (warm up, cool down and fitness exercises). Moreover, none of the datasets analyzed exactly specify what the Jogging and Running activities are related to. Thus, even though they may be considered very similar activities, we have decided to keep them separated in oder to do not loose their specificity. We classify Jogging as a sportrelated activity (in the sense, for instance, of jogging in the park), and Running as a motion-related activity (in the sense, for instance, of running for the bus). For each category, the x axis shows all the ADLs we found and that are present in at least 2 datasets. Under the label Others fall all the ADLs for the corresponding category that have been included in one dataset only (e.g., Walking left-circle in category Motion-related).
Tables 2 shows the 9 ADLs we have selected among the most popular included in the analyzed publicly available datasets. UniMiB SHAR includes the top 5 most popular Motion-related activities (i.e., Walking, Going upstairs, Going downstairs, Sitting down, and Running). Moreover, we detailed the generic Standing up, by including the Standing up from sitting and Standing up from laying activities. Finally, we included also the Lying down from standing.
In the Sport-related category, we did not included Jogging even if it is the most popular activity in its category because we included the Running activity in the Motion-related category. In Sport-related category, we chose the Jumping activity being the second most popular one. Stepping in a car Stepping out a car  Our dataset does not include Postural-related activities. Indeed, we were interested in acquiring acceleration data from activities related to movements both because from them it is possible to estimate the overall physical activity performed by a person, and because people are more likely to fall during movements [41].
We do not include ADLs belonging to categories such as Housekeeping-, Cooking-, or Personal care-related (those fall in the Others category in Table 1, because we are interested in low order activities of daily living, which include simple activities such as, Standing, Sitting down, Walking, rather than high order activities of daily living, which include complex activities such as, Washing dishes, Combing hair, Preparing a sandwich. The same holds for contex-realted activities that are intended as high order activities. This choice was also motivated by the fact that these activities are scarcely present in the analyzed datasets (in particular, each activity belonging in the above mentioned categories is present in only one of the 24 analyzed datasets).
Finally, among the ADLs related to movements, we selected the ADLs most used in literature as demonstrated by the analysis we performed, which is also confirmed by Pannurat et al. in [42].
For what concerns falls, we analized DMPSBFD, Gravity, MobiAct, tFall, and UMAFall datasets from smartphones (see Table 1), and DLR v2, EvAAL, MMsys, SISFall, UMA Fall, and UR Fall Detection datasets from wearable ad-hoc devices, since they are the only datasets that contain falls. From this set, we excluded DLR v2, EvAAL because they do not specify the type of fall. Figure 2 shows the most common falls in the resulting 9 datasets we analyzed. The y axis represents the number of datasets that include the specified fall. Likewise ADLs, falls are grouped by category. Falling backward, Falling forward, and Falling sideward include back-, front-, and side-ward falls respectively. Sliding category can be further specialized so that to include Sliding from a chair, Sliding form a bed, and Generic sliding that not specifies details about the type of sliding. Finally, the category Specific fall includes different type of falls that have not been further specialized.
For each category, the x axis shows all the types of falls we found. Under the label Others fall all the falls for the corresponding category that have been included in one dataset only. The Specific fall category is an exception since it includes falls types not particularly present in the analyzed datasets.
Choosing which falls to include in the dataset was driven by the following considerations: the number of falls should have been comparable to that of the other datasets, and the dataset should have included a set of representative types of falls. Thus, having four categories (not considering Sliding, which includes only one type of fall that has been considered by two datasets only), we selected two falls from each of them. In each category, we selected the first two most popular falls. The category Falling sideward is an exception since we preferred to choose the two most specific falls instead of including the too generic Generic falling sideward. Table 3 shows the 8 falls that we selected according the adopted criterion.
Finally, studies on this topic confirm the falls we selected are common in real-life [43,44,45,24].

Name
Description Label

Subjects
30 healthy subjects have been involved in the experiments: 24 were women and 6 men. The subjects, whose data are shown in Table 4, are aged between 18 and 60 years (27 ±12 years), have a body mass between 50 and 82 kg (64.4 ± 9.7 kg), and a height between 160 and 190 cm (169 ± 7 cm). Note that we included more women and older ages to compensate for the lacks of MobiAct.
All the subjects performed both ADLs and Falls. The subjects gave written informed consent and the study was conducted in accordance with the WMA Declaration of Helsinki [46].

Protocols
To simplify the data annotation process, we asked each subject to clap her hands early before and after she performed the activity/fall to be recorded. Moreover, to reduce background noise, we asked each subject to wear gym trousers with front pockets.
Concerning ADLs, in order to avoid mistakes by the subjects due to too long sequences of activities, registrations have been subdivided in the three protocols showed in Table 5. Each protocol has been performed by each subject twice, the first one with the smartphone in the right pocket and the second in the left. Those smartphone positions were chosen because both they are the most natural ones and they are exactly the positions used in the analyzed references dealing with smartphones.
Protocol 1 includes Walking and Running activities. We opted for moderate walking and running so as to include even older people. Protocol 2 includes activities related to both climbing and descending stairs, and jumps. In our registration, we selected straight stairs ramps, and asked each volunteer to perform jumps with a moderate elevation, with little effort, and spaced each other about 2 seconds. Protocol 3 includes ascending and descending activities. The Sitting down and Standing up from sitting activities have been performed with a chair without armrests; the Lying down from standing and Standing up from laying have been performed on a sofa. The duration of the actives are in average with those reported in [47]     Falls have been recorded individually, always following the pattern of making a start and end clap (see Table 6). In cases where the volunteer ended in a prone position, the clap has been performed by an external subject to avoid as far as possible any movements that might lead to recording events outside the study. To carry out the simulation safely, a mattress of about 15 centimeters in height was used. Each fall was repeated six times, the first three with the smartphone in the right pocket, the others in the left. Finally, falls have been simulated, started from a standing straight up position, and self-started.

Action Iteration
Start the registration

times
Put the smartphone in the pocket clap fall clap Pull the smartphone from the pocket Stop the registration

Segmentation and preprocessing
The audio files helped in the identification of the start and stop time instants for each recorded activity. From the labelled recorded accelerometer data, we extracted a signal window of 3 sec each time a peak was found, that is, when the following conditions were verified: 1. the magnitude of the signal m t at time t was higher than 1.5g, with g being the gravitational acceleration; 2. the magnitude m t−1 at the previous time instant t − 1 was lower than 0.
Each signal window of 3 sec was centered around each peak and it is likely that several overlap between subsequent windows may happen. We adopted this segmentation technique instead of selecting overlapped sliding windows because our dataset is mostly focused on motion-related recognition of ADLs and falls. The choice of taking 3 sec window has been motivated by: i) the cadence of an average person walking is within [90, 130] steps/min [48,49]; ii) at least a full walking cycle (two steps) is preferred on each window sample. Figure 3 shows samples of acceleration shapes. For each activity, we displayed the average magnitude shape obtained by averaging all the subjects' shapes.
Since the device used for data acquisition records accelerometer data with a sample frequency of 50 Hz, for each activity, the accelerometer data vector is made of 3 vectors of 151 values (a vector of size 1x453), one for each acceleration direction. The dataset is thus composed of 11,771 samples describing both ADLs (7,759) and falls (4,192) not equally distributed across activity types. This is because the activity of running and walking were performed by subjects for a time longer than the time spent for other activities. Originally, 6,000 time windows of the running activity were found. In order to make the final dataset as much as balanced, we have deleted about 4,000 samples related to running activities. The resulting samples distribution is plotted in Figure 4, where the samples related to running activities are about 2,000. On our web site we release both datasets, the one balanced and the original one.
We preprocessed the acceleration signal s(t) in order to remove the gravitational component g(t). Since the gravitational force is assumed to have only low frequency components, we applied a Butterworth (BW) low-pass filter with a cut off frequency of 0.3 Hz [48]: g(t) = BW(s(t), 0.3). The accelerometer data without gravitational component is then obtained as:

Dataset Evaluation
We organized the accelerometer samples in order to evaluate four classification tasks: 1. AF-17 contains 17 classes obtained by grouping all the 9 classes of ADLs and 8 classes of FALLs. This subset permits to evaluate the capability of the classifier to distinguish among different types of ADLs and FALLs; 2. AF-2 contains 2 classes obtained by considering all the ADLs as one class and all the FALLs as one class. This subset permits to evaluate, whatever is the type of ADL or FALL, the classifier robustness in distinguishing between ADLs and FALLs; 3. A-9 contains 9 classes obtained by considering all the 9 classes of ADLs. This subset permits to evaluate how much the classifier is capable to distinguish among different types of ADLs; 4. F-8 contains 8 classes obtained by considering all the 8 classes of FALLs. This subset permits to evaluate how much the classifier is capable to distinguish among different types of FALLs.
We initially evaluated the classifiers by performing a traditional 5-fold cross-validation. It means that all the data have been randomly split in 5 folds. Each fold has been considered as test data and the remaining ones as training data. Results are computed by averaging the result obtained on each test fold. The folds have been obtained by applying the stratified random sampling that ensures samples of the same subject in both the test and the training folds.
To make the dataset evaluation independent from the effect of personalization, we conducted another evaluation by performing a leave-subject-out cross-validation. Each test fold is made of accelerometer samples of one user only, namely the test user, while the training folds contain accelerometer samples of all the other users except the samples of the test user.
Previous studies demonstrated that classifiers trained on raw data perform better with respect to classifiers trained on other types of feature vector representations, such as magnitude of the signal, frequency, or energy [43,50]. However, in order to make the experiments comparable with others experiments presented in the state of the art, we considered two feature vectors: 1. raw data: the 453-dimensional patterns obtained by concatenating the 151 acceleration values recorded along each Cartesian direction; 2. magnitude of the accelerometer signal, that is a feature vector of 151 values.
We experimented four different classifiers: 1. k-Nearest Neighbour (k-NN) with k = 1; 2. Support Vector Machines (SVM) with a radial basis kernel; 3. Artificial Neural Networks (ANN). We set up a threelayers feed forward network with back propagation. The network architecture includes an input layer, a layer of hidden neurons and an output layer that includes a softmax function for class prediction. The number of hidden neurons n has been set in way that n = √ m × k, where m is the number of neurons in the input layer and k is the number of neurons in the output layer, namely the number of classes [51]. 4. Random Forest (RF): bootstrap-aggregated decision trees with 300 bagged classification trees.
All the classifiers have been implemented exploiting the MAT-LAB Statistics and Machine Learning Toolbox and the Neural Network Toolbox.

Evaluation metrics
As shown in Figure 4, each of the 17 sets containing samples related to a specific activity is different in size. To cope with the class imbalance problem of the dataset we used as metric the macro average accuracy [52].
Given E the set of all the activities types, a ∈ E, NP a the number of times a occurs in the dataset, and T P a the number of times the activity a is recognized, MAA (Macro Average Accuracy) is defined by Equation 1.
MAA is the arithmetic average of the accuracy Acc a of each activity. It allows each partial accuracy to contribute equally to the evaluation.

Results and Discussion
In the following, we discuss separately the results achieved with the traditional 5-fold cross-validation and the leavesubject-out cross-validation.

Subject-dependent evaluation (5-fold evaluation)
The k-fold evaluation is the most employed evaluation scheme in literature [53]. This evaluation considers a training set and a test set made of activity samples performed by all the human subjects. The resulting classifier is subject-dependent and usually exhibits a very high performance. Results of the k-fold evaluation (here we used k=5) scheme are showed in Table 7 for raw data and magnitude. Overall, the performances achieved using raw data are better than the ones obtained using magnitude as feature vector. This confirm a result already achieved in previous works [43,50].
The AF-17 recognition task is quite challenging with a MAA of about 83% in the case of raw data with KNN, and a MAA of about 66% in the case of magnitude with RF. This means that is quite difficult to distinguish among types of activities especially in the case when magnitude is adopted as feature vector. Figure 5 shows the confusion matrix of the k-NN experiment in the case of raw data.
The A-9 classification task is quite easy, the MAA obtained by raw data with RF is about 88% while the MAA obtained by magnitude with SVM is about 79%. This means that is quite easy to distinguish between types of activities. Looking at the confusion matrix in Figure 5, the most misclassified pairs of activities are Standing up from laying and Standing up from sitting, Lying down from standing and Sitting down, Going upstairs and Walking, Going downstairs and Walking, Jumping and Going downstairs.
The F-8 recognition task is quite challenging: the MAA is about 78% and 57% in the case of raw data with KNN and magnitude with RF respectively. This result suggests that distinguish among falls is very complicated The most misclassified pairs of falls are Falling with protection strategies and Generic falling forward, Syncope and Falling leftward, Generic falling backward and Falling backward-sitting-chair, Falling rightward and Falling with protection strategies, Falling rightward and Syncope.
In contrast, the AF-2 recognition task is very easy for all the classifiers and for both raw data and magnitude with a MAA of about 99% achieved with raw data and SVM. These results are similar to those obtained by previous researchers on a similar classification task performed on different datasets [50,24]. This means that it is very easy to distinguish between falls and no falls.
To summarize, the F-8 and AF-17 are quite challenging classification tasks. The use of this dataset for those tasks will permit researchers:   Table 8: Leave-subject-out. Mean Average Accuracy for each classification task using raw data and magnitude of the signal as feature vectors. In bold the best results for each classification task and feature vector employed.
• to design and evaluate more robust feature representations as well as more robust classification schemes for human activity recognition.
• to study more robust features to deal with accelerometer samples of different types of falls. Table 8 shows the results obtained by performing the leavesubject-out evaluation. In this case the training set is made of activity samples of subjects not included in the test set. This evaluation is also known as subject independent evaluation and shows the feasibility of a real smartphone application for human activity recognition [54,55,56] where data of a given subject are usually not included in the training set of the classifier.

Subject-independent evaluation (leave-subject-out evaluation)
From the results it is almost evident the drop of performances with respect to the case of 5-fold evaluation. Human subject performs activities in a different way and this influences the recognition accuracy especially when it is necessary to distinguish between fine grained types of activities, that is in the case of AF-17, A-9 and F8 recognition tasks. In particular, in the case of AF-17 the best MAA is 56.58% using RF and magnitude. In the case of A-9 the best MAA is 73.17% using RF and raw data. In the case of F-8 the best MAA is 49.35% using SVM and magnitude. In contrast, distinguishing between coarse grained activities, such as falls vs no falls, is quite easy with a MAA of 97.57% with SVM for both raw data and magnitude. Overall the magnitude feature vector performs slightly better than the case of raw data. This suggests that using the magnitude as feature vector in the case of subject-independent evaluations could be more reliable than raw data.
The low performance achieved in the case of subjectindependent evaluation permits researcher to investigate the following issues: • the study of a more robust feature vector that is able to reduce as much as possible the performance gap between the subject-dependent and subject-independent evaluation; • the study of on-line learning classification schemes that permit, with the use of a few subject-dependent data, to improve as much as possible the performance.

Conclusion
Almost all publicly available datasets from smartphones do not allow the selection of samples based on specific criteria related to the physical characteristics of subjects and the activities (and/or falls) they performed. Of the 11 datasets containing smartphone measurements, only MobiAct, RealWorld (HAR), and UMA Fall are the exception. These three datasets include more men than women. Considering only datasets that include falls (MobiAct and UMA Fall), the maximum age of the subjects is 47 years. Our goal was therefore to create a new dastaset that would be complementary to the more complete ones and that also include falls. The result is UniMiB SHAR dataset that includes 9 ADLs and 8 falls performed by 30 humans, mostly female, with a huge range of ages, from 18 to 60 years.
The classification results obtained on the proposed dataset showed that the raw data performs quite better than magnitude as feature vector in the case of subject-dependent evaluation, and, on the opposite, the magnitude performs quite better than raw data in the case of subject-independent evaluation. The classification of different types of activities is simpler than the classification of different types of falls. It is very easy to distinguish between falls and no falls for both raw data and magnitude. The subject-independent evaluation showed that recognition performance strongly depends on the subject data.
UniMiB SHAR dataset will permit researchers to study several issues, such as: i) robust features to deal with falls; ii) robust features and classification schemes to deal with personalization issues.
We are planning to carry out an evaluation of the state-of-theart techniques for ADLs recognition on both UniMiB SHAR and all the publicly available datasets of accelerometer data from smartphone to have and objective comparison. Moreover, we have planned to make experimentation on personalization by using those datasets that include information about the characteristics of the subjects. We want to investigate whether the training set containing samples acquired by subjects with similar characteristics to the testing subject may result in a more effective classifier. Finally, we are planning to check if and how data from smartwatches and smartphones can jointly improve the performances of the classifiers. To this end, we are improving the data acquisition application used for UniMiB SHAR.