Predicting Physical Exercise Adherence in Fitness Apps Using a Deep Learning Approach

The use of mobile fitness apps has been on the rise for the last decade and especially during the worldwide SARS-CoV-2 pandemic, which led to the closure of gyms and to reduced outdoor mobility. Fitness apps constitute a promising means for promoting more active lifestyles, although their attrition rates are remarkable and adherence to their training plans remains a challenge for developers. The aim of this project was to design an automatic classification of users into adherent and non-adherent, based on their training behavior in the first three months of app usage, for which purpose we proposed an ensemble of regression models to predict their behaviour (adherence) in the fourth month. The study was conducted using data from a total of 246 Mammoth Hunters Fitness app users. Firstly, pre-processing and clustering steps were taken in order to prepare the data and to categorize users into similar groups, taking into account the first 90 days of workout sessions. Then, an ensemble approach for regression models was used to predict user training behaviour during the fourth month, which were trained with users belonging to the same cluster. This was used to reach a conclusion regarding their adherence status, via an approach that combined affinity propagation (AP) clustering algorithm, followed by the long short-term memory (LSTM), rendering the best results (87% accuracy and 85% F1_score). This study illustrates the suggested the capacity of the system to anticipate future adherence or non-adherence, potentially opening the door to fitness app creators to pursue advanced measures aimed at reducing app attrition.


Introduction
Physical inactivity and sedentary behaviour have been described as a worldwide pandemic [1]. More than a quarter of the world's adult population is considered to be physically inactive [2] and consequently at 20-30% increased risk of death, compared to people who are sufficiently active [2]. Inactivity has been associated with at least 35 different chronic conditions [3] and is considered a leading cause of mortality in non-communicable diseases [1,2], with approximately 3.2 million deaths per year [4]. Parallel to health-related concerns, this situation also has a direct economic global cost which was estimated at $54 billion per year, in the year 2013 [4], with an additional $14 billion attributable to lost productivity [5].
Current WHO guidelines recommend that adults undertake 150-300 min of moderate intensity activity per week which, in fact, is met by as little as 25% of the total world adult population [4]. Physical activity should additionally be complemented by regular muscle strengthening, a recommendation met by only 17.3% of adults in Europe [6], and by a reduction in sedentary behaviour [7]. Generally speaking, it is women and people living in high-income countries who account for the worst share [2]. Recent data from Germany has shown that ∼80% of adults fail to meet total activity and strengthening guidelines [6], which would seem to be in line with the aforementioned statements. The healthcare industry seems to now have realised the extent of the problem, which is proven by the fact that prescribing physical activity is what is most often done by healthcare workers [8]. But in spite of this, and since 2001, adherence keeps declining by as much as 5% (years 2001-2016) in high-income countries [2]. In order to bring about a shift in this trend, the WHO has made it a worldwide priority to reduce physical inactivity rates by 15% by the year 2030 [4].
Some authors have pointed to high attrition rates as a key element in explaining increased sedentary behaviour [9]. Given the widespread use of mobile technology, apps constitute highly feasible means for delivering health interventions and may offer an opportunity to reach that part of the population that is predisposed to starting an exercise programme [10]. Several studies, including some systematic reviews, have attempted to analyse adherence to exercise apps. Nonetheless, in the early stages of our research [11,12], we noticed a remarkable lack of consensus regarding the conceptualisation of this term. Most commonly, adherence has been assessed in clinical research environments involving individuals with health conditions that followed some type of therapeutic intervention. To this extent, an expected "dosage" of exercise is typically prescribed, and adherence is expressed in terms of frequency, with 80-99% of the recommended exercise dosage being considered "satisfactory" or "adherent" [13]. " Yang et al." determined that the definition of adherence to physical activity apps could be categorized into 4 dimensions: (i) frequency of app usage; (ii) intention/motivation to sustain use of the app; (iii) degree of function use within the app; and (iv) duration of app usage [11]. Previous researchers had, however, used the word "engagement" to refer to duration of usage, number of challenges/programmes started and number of physical activity log days [14]. Earlier in time, a meta-analysis by Cugelman et al. concluded that the term adherence was the opposite to attrition and therefore equalled the percentage of people who continue to use the app over time [15]. They further called this "intervention adherence". In contrast, they defined "study adherence" as the proportion of participants in a study at a given time, compared to baseline. Already back in 2011, authors agreed that coding "intervention adherence" was more challenging as they used different criteria to measure it [15]. Adherence has also been defined as the number of weeks before the participant becomes inactive for an entire week [16], while other authors have chosen to call this "retention" [17].
Our preliminary literature search revealed studies on adherence to PA guidelines, but not specifically on user behaviour in fitness apps. Some of the studies retrieved included biological age prediction, sports performance forecasting and human physical activity recognition [18][19][20][21]. El-Kassabi et al. [19] proposed different DL models to forecast an athlete's performance in sports tournaments and to guide their strategies accordingly to obtain improved results. They were able to evaluate the effectiveness of the athlete's training by predicting their race time results upon completing each additional training. Their results showed that the DL model managed to improve race time prediction accuracy over the baseline ML model, such as in the case of linear regression. In [21], the authors proposed a model to predict physical activity based on an LSTM, with the aim of providing personalized services based on data collected from mobile devices. To provide personal services and to select the model variables, the characteristics and surrounding data circumstances were considered, and the following were some of the variables they took into account in order to provide such customized services: sleep hours, travel distances, mean heart rate, cadence, weather, temperature, mean daily temperature, humidity. The LSTM was trained to learn the dynamic characteristics and to then predict users' physical activity, while the performance model was evaluated using the root mean square error (RMSE). The model obtained good mean RSME results. Similarly, researchers in [20] used LSTM to process raw data from gyroscope and accelerometer sensors, and to classify six activities involved in daily living. In [18], a deep convolutional LSTM approach was used to estimate biological age in human adults.
Big data techniques could be valuable for the analysis of user behavioural features and could lead to behavioural change encouraging actions. Deep learning (DL) has recently outperformed other machine learning (ML) methods on many fronts, with image recognition, audio classification and natural language processing being just a few of the many examples [22][23][24][25][26]. For its part, Modern DL provides a powerful, adaptive and flexible framework for supervised learning and can be considered as the way to automate predictive analytics. By adding more components (layers and units within a layer) to the network, DL is able to work with more complex problems [27]. Provided there is a large number of training samples with their respective labels and sufficiently large and well-designed model architecture, many of the tasks that were traditionally performed by individuals (e.g., mapping input to output vectors) can now be developed through DL [27]. Each algorithm hierarchically applies a non-linear transformation to its input and uses what it learns to create an output, and then iterations continue until the output has reached an acceptable level of accuracy. Once training has been completed (i.e., back propagation has been repeatedly applied to update the weights in the model, so as to achieve the desired accuracy with the training data), the resulting model can then be used to make predictions with new data that the model has not been previously exposed to. This all means that the DL approach will require much less feature engineering and be less reliant on domain knowledge.
The major purpose of our study may be described as follows: "It is possible to predict training adherence behaviour for a subsample of MH app users, over a given period of time, by processing data from previous training sessions with artificial intelligence algorithms." User behaviour was analysed in terms of training behaviour, whereby a prediction was established for the fourth month by taking into consideration user activity over the first three months. Section 2 explains the materials and methods used to conduct the project, Section 3 describes the experimental setup, along with the results obtained, and Section 4 provides a discussion of those results in terms of positive and negative highlights, as well as future directions. Lastly, Section 5 offers the conclusions drawn from this pilot study.

Data Acquisition
This was an observational, retrospective pilot study. The research protocol observed the principles set out in the Declaration of Helsinki and study approval was granted by the Research Ethics Committee at the Ramon Llull University (nr.1920003P). 777 participants who voluntarily agreed to participate and granted their informed consent were recruited from the company Mammoth Hunters S.L.
MH is a smartphone application that provides physical workouts. Upon registration, users are presented with a series of workouts that can form either part of a predefined training programme or personalized according to an individual's characteristics and preferences. Users do their workouts by following instructional videos on their screens, with indications regarding performance of exercise and number of repetitions or time taken per exercise. Upon completion of each workout, a summary of total repetitions and total workout time is displayed on the screen. MH stores all training sessions in its databases, as seen in Figure 1. It additionally sends information to an analytic platform called MixPanel, where additional information (e.g., IP, operating system, phone device) is stored. MH obtained explicit consent from all participants in this research project and ensured all registries remained confidential by sharing only anonymized data for analysis. The description of the data acquired from the MH app is shown in Table 1.  523 individuals in total met the inclusion criteria (i.e., MH app users older than 18 years old). These users were almost equally split between males and females, with an average age of 40 years, an average weight of 71kg, average body mass index (BMI) around 25 and average body fat of 24%. Almost 85% of the participants were equally split between a desire to lose weight and increase muscle mass, as shown in the following Table 2.
Based on previous literature [28][29][30][31][32] a period of four months was established as a determining factor and only those participants who had been subscribed for at least four consecutive months after enrolment (n = 243) were ultimately considered eligible for the prediction analyses.  Table 3 shows the description of the per-month data for the 246 users. From this analysis, it can be seen that the average duration of the training sessions is between 300 and 400 s per day/month. However, it is not possible to conclude anything with regard to the mean, since standard deviation clearly indicates that the data is very dispersed. This fact is ratified by the difference between the maximum and minimum values observed for each month. Therefore, in order to have an idea of data distribution, we decided to obtain the frequency of the data in time intervals, as described in Table 4.  Table 4 shows the frequency of user numbers in ranges of exercise time per day, over the four months. For example, column one shows that 41 users did not exercise on a daily basis in the first month. It also reflects that the majority of users trained between the ranges of 0 to 300 and 300 to 1800 s in the first month. In other words, most individuals trained less than 30 min, and very few trained more than half an hour. It is worth stressing that, on average, most users tended to do less exercise in the fourth month.

Proposed Framework
The goal of this study was to develop a model that was able to predict user adherence (continuation of training sessions) to the MH fitness app, at month four. Figure 2 shows a general diagram of our proposed framework, which is further explained in Figure 3.  Tables 5 and 6 below show the pseudocode of the algorithm that was used in this study, while Table 5 shows the proposed framework for the training process, which can be summarised by the creation of the regression ensemble models of the users who underwent similar training in the first three months. Table 6 shows the proposed framework for testing new users. This process can be summarised in terms of the determination of adherence or non-adherence in the case of new users who sign up for the MH fitness app. In order to achieve this, every new user is categorized into a cluster, and the mean of the ensemble comprising people that similarly exercised during their first three months is set as the output. Table 5 further explains this process. The raw data received from the MH app contained user information on age, gender, city, weight, height, body fat, body type, body fat target, individual's goal, profile creation, programmes, current programme and sessions, as shown in Table 1. To meet our study goals, we needed historical data, so as to be able to design an intelligent system that could learn from past workout behaviour and would then be able to forecast future ones. We selected the data about sessions, given that it provided us with time-dependent information from the time the user performed their first training session. The original data contained in this field corresponded to a JSON file and contained several workout details.

Pre-Processing
In data analytics, and in the case of intelligent systems, it is very important to identify any incorrect or corrupted data, as this could significantly affect the decision-making process. There is already plenty of evidence from fields such as finance, business, health systems and smart cities to suggest that incorrect or inaccurate decisions may have unfortunate consequences [33]. In order to identify corrupted or duplicate data and to build a reliable dataset, several techniques (e.g., data cleaning) could be used. In this paper, a pre-processing stage was first developed, in order to create new variables that allowed us to represent user evolution over time (i.e., determining the length of a workout session from start to finish). Upon completion of the pre-processing stage, a new dataset was built with the duration of training sessions, and so this variable contained the accumulated workout time, in seconds. It should be noted that the information corresponded to the duration of the actual workout session, as opposed to the time spent on the app.
Given the differences between users in app registry duration (some users had been active for two years, others for six months, etc.), the next step required was to determine a longitudinal time period. In previous literature, we found that other authors had used periods of 12 or 16 weeks for their PA-related research [28][29][30][31][32] and so, based on their findings, we decided to focus our system on a period of 16 weeks. From these, the first 12 weeks were used to train the system and the last four weeks were used to predict the workout sessions. A summary of the data can be seen in Table 7. The clustering block represents an approach which was chosen in order to categorize users into groups, based on their training frequency during the first three months. Instead of using all the models trained with the entire group of user' sessions, the clustering approach grouped models trained with data from users with similar exercise features.
Three clustering approaches were used: K-means, Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH), and Affinity Propagation (AP).
• K-means: K-means is one of the preferred methods in unsupervised ML strategies. It is widely used in manufacturing, education and business [34], and is based on minimization of the sum of distances between each object and the centroid of its group or cluster. Once the number of clusters (K) is chosen, the first initialisation step is to establish K centroids among the data. Then, samples are assigned to their closest centroids. Next, the positions of the centroids are updated, so that the distances between the elements of each cluster may be minimized. The assignment and update process is then repeated until no points change clusters, or equivalently, until the centroids stay the same. In order to assign a point to the closest centroid, a proximity measure that quantifies the notion of closest is required. Commonly, it is the Euclidean distance that is used for this, and so the goal is to find the objective function which minimizes the squared distance of each point to its closest centroid [35]. The sum of squared errors (SSE) calculates the error of each point (i.e., Euclidean distance from each point to the closest centroid) [35,36]. The SSE is defined by the Equation (1): where dist is the standard Euclidean distance, x is an object, C i is the i t h cluster and m i is the centroid (mean) for cluster C i what minimizes the SSE of the cluster is the mean, defined by the Equation (2): where n corresponds to the number of objects in the i t h cluster. The Basic K-means functionality is described in Table 8. • BIRCH: This is a non-supervised algorithm. Due to its ability to find good clustering with only a single data scan, it is especially suitable for larger datasets or streaming data [37]. This characteristic was especially relevant to our research, since we expect to obtain a larger dataset in the near future, and we were in need of a process that would facilitate the upscaling of our application. In order to understand how the BIRCH algorithm works, the concept of cluster feature (CF) needs to be introduced. CF is a set of three summary statistics which represent a single cluster, from a set of data points. The first statistic, count, quantifies how many data values are present in the cluster. The second, linear sum, is a measurement which represents cluster location. Finally, squared sum refers to the sum of the squared coordinates that represents the spread of the clusters. The last two statistics are equivalents to mean and variance of the data point [37]. BIRCH is frequently explained in two steps: (1) building the CF tree, and (2) global clustering. Phase 1-Building the CF Tree: Firstly, the data is loaded into the memory by building a CF Tree, for which purpose a sequential clustering approach is used. Thus, the algorithm simultaneously scans and records the data, and then determines whether a point should be added to an existing cluster, or a new cluster should be created. Phase 2-Global clustering: Secondly, an existing clustering algorithm is applied to the sub-clusters (the CCF lead nodes), so as to assemble these sub-clusters into clusters. This could, for instance, be achieved using the agglomerative hierarchical method.
The basic BIRCH algorithm is described in Table 9.  [38]. Unlike others, the AP method is not required as a parameter, although it is commonly used in experiments where many clusters are needed. AP works with three matrices: similarity matrix, responsibility matrix and availability matrix. Similarity matrix: this is the first matrix obtained, and is calculated by negating the sum of the squares of the difference between participants [39]. Thus, the elements in the main diagonal of the similarity matrix equal 0 (zero) and a value needs to be selected in order to fill these. Consequently, the algorithm will converge around a few clusters if the selected value is low, and vice-versa, insofar as the algorithm will converge with many clusters, in the case of high selected values. Responsibility matrix: once the similarity matrix has been calculated, the next step is to calculate the responsibility matrix, given by the Equation (3).
where i corresponds to the number of rows and k to the number of columns in the associated matrix.
Availability matrix: the availability matrix is then calculated. All elements are set to zero, and Equations (4) and (5) are then used to calculate elements off the diagonal.
In essence, the Equation (4) corresponds to the sum of all values in the columns that are above 0, except for values which are identical for both rows and the given column. Criterion matrix: Finally, the algorithm calculates the criterion matrix. This equals the sum of the availability matrix and the responsibility matrix at that location, and is given by (6).
The highest criterion value of each row is designated as the exemplar. The pseudocode of AP can be seen in Table 10.  [27], which are well-known and widely used to process time series data and natural language processing [40,41]. These networks are built upon the idea of using the output of the previous neuron in the network along with the next input of the sequence as input to the next. This ability gives the network the opportunity to model sequences. It facilitates modelling cases in which the relationships between variables are not simply parallel, but rather sequential (the value of a given variable at one time may determine the value of another at a later or earlier time). Sequential data can be trained as: complete sequences, forward or backward sequences, or a set of them. Figure 4 illustrates the basic architecture of a recurrent neural network. Given an input vector sequence x = x 1 , x 2 , ...., x t , passed through to bunch of N recurrently connected hidden layers. The first hidden vector sequences are calculated as h n = h n 1 , . . . , h n T and the output vector sequence y = (y 1 , . . . , y T ). Where N = 1, the architecture is simply reduced to a single layer. Hidden layer connections are calculated as follows: where H corresponds to the hidden layer functions, W equals the weight matrices (e.g., W h 1 h 1 is the recurrent connection at the first hidden layer), b denotes bias vectors (e.g., b y is the output bias vector). Then the output is computed by: LSTM is one of the most famous types of RNN architecture [43]. It can memorize for long and short periods of time using a gating mechanism which makes it possible to control the information that has to be kept over time, the duration it has to be kept for and the time that it can be read through the memory cell [44]. The architecture of an LSTM cell, as described in [42], is shown in Figure 5, where H is implemented by the following composite function: where σ is the logistic function i t , f t , o t , c t , h t correspond to the input gate, forget gate, output gate, memory cell and hidden state at time t respectively, and x t refers to the input of the system at time t. • Support vector machine (SVM): this is an algorithm based on statistical learning and which has gained great popularity over the last decade. It is useful in several classification and regression problems [36,45,46]. SVM takes the structural risk minimization principle into account and attempts to find the locations of decision boundaries (also known as hyperplanes), which produce optimal separation among the classes [47,48]. This paper used a support vector regression (SVR), which refers to a generalization of classification problems, where the model returns continuous values. SVM generalization to SVR is achieved by introducing an -insensitive region around the function, referred to as the -tube. The tube then reformulates the optimization problem in order to find a tube value which best fits the function, while balancing model complexity and prediction error [49]. SVR problem formulation derives from a geometrical representation, and its continuous-value functions could be approximately represented by: where w and b correspond to the weight and bias vectors, respectively. In spite of this, and in real applications, data tends to be non-linear and separable, and Kernel functions are therefore used to extend the concept of the optimal hyperplane. In multidimensional data, x augments by one, and b is included in the w vector for a simple mathematical notation (see Equation (17)). The multivariate regression in SVR then formulates the function approximation problem as an optimization which attempts to find the narrowest tube centred around the surface [49]. The objective function is shown below in Equation (18), where w equals the magnitude of the normal vector to the surface. min The Grid search method was used to tune the hyperparameters, whereby three different kernel functions (i.e., radial basis function, polynomial kernel and sigmoid kernel) were used. These three kernel methods are defined by the Equations (19)-(21), respectively.
where γ is the influence on classification outcomes; large values for γ leads to overfitting and small values result in under-fitting [50].

Ensemble Models
Once we had trained the LSTM and SVM algorithms and used the grid search for each user to tune the hyperparameters, we then tried to develop a strategy that allowed us to use the trained ML models and combine several base models in order to produce one optimal predictive model. We decided to use an ensemble approach for this purpose. An ensemble consists of a set of individually trained classifiers or regressors whose predictions are combined in order to obtain better results than with single methods [51]. In this research, we used an approach similar to that used for bagging methods [52] and, as such, our objective was to compensate for the error generated by some models while at the same time combining the models into clusters based on the similarity in terms of their training data. The final prediction was the mean output value of combined models from the same cluster, as shown in Figure 3.

Output
For a better understanding of the output of the proposed framework, we shall refer to Figure 6, which contains an example that simulates three different users: users with high exercise frequency, users with medium exercise frequency, and users with low exercise frequency. The full steps a user has to go through in the system are explained below.

•
Step 1 (Input data): Raw data from the three different users is given as input to the system. Users can belong to one of the three aforementioned categories (depicted in different colours). • Step 2 (Pre-processing): In the pre-possessing step, all data cleaning procedures, as well as other operations, are applied in order to obtain the data from workout sessions.

•
Step 3 (Clustering): When using the K-means algorithm, if three clusters are selected and the characteristics are the mean of accumulated seconds per day over three months, clusters categorize the users into three groups: people with high PA (orange colour), people with medium PA (blue colour), and people with low PA (green colour). • Step 4 (Ensemble models): Assuming we are using the LSTM as the regression method, data corresponding to the first three months of use is given to the corresponding LSTM ensembles (orange, blue or green). The ensembles then use pre-trained models to calculate all regressions and the output will be the mean of the corresponding ensemble;Ē1: mean ensemble 1 (green),Ē2: mean ensemble 2 (blue),Ē3: mean ensemble 3 (orange).

•
The system output corresponds to the average regression of the models within a given cluster. Since our aim was to determine adherence to training using a fitness app, we turned again to literature in order to follow a rule that defined user adherence. Previous researchers have established that exercise-derived health benefits taper off after 4-5 weeks of training cessation [53][54][55][56][57]. Taking this into account, we determined that a user would be considered non-adherent if he/she showed no training activity over a full month (the fourth month).

Experiments and Results
This section shows all the implementation and results obtained throughout all stages of the project.

Implementation
We used Phyton programming language, version 3.8.5. For code writing, we used the Jupiter client-server application. All the dependencies were installed using the Conda package manager, in its version 4.9.2. Additionally, we used the open source sklearn library [58] for clustering algorithms, while DL algorithms were implemented using the Keras library, version 2.4.3.

Clustering
Three different algorithms were used to allocate new users into similar groups, while taking some of their characteristics (data training as calculated during the first 90 days) into account. Their features and corresponding descriptions are shown in Table 11.  The purpose of creating these variables was to provide the clustering algorithms with information so that, whenever a new user signs up, our framework extracts these features based on the first three months, and the clustering algorithm will categorize the user based on their training behaviour in the app. The variables mean_week_n were calculated from week 1 until week 12, and Table 11 shows a simplification of this process.
Three methods were selected, to test different scenarios for clustering algorithms. We first wanted to test the scenario where few clusters were expected, and chose the K-mean algorithm for such purpose, as no more than 3 or 4 groups were expected. The elbow curve in Figure 7 confirms that 3, 4 or 5 clusters are sufficient. The BIRCH algorithm caught our attention, due to its ability to work with larger samples, which could prove useful for our future work. Finally, we used the AP in scenarios where several groups of people did exercise. With our data, clusters were automatically calculated, and results showed variations between 15 and 30 groups.

K-Means
The parameter to adjust in the K-means clustering algorithm was the number of clusters, and we used the elbow method to set up the number of clusters, as shown in Figure 7. From the curve above, we can note that the optimal number of clusters is k = 4. However, we also tested the results for k = 3. The curve in Figure 7 was obtained using three features: (missed_first_month, missed_second_month, missed_third_month).
As for Figure 8, the top-left scatter plot represents the number of skipped sessions in the first month vs. the number of skipped sessions in the second month. The top-right scatter plot represents the number of skipped sessions in the second month vs. the number of skipped sessions in the third month, while the bottom 3D scatter plot combines the three axes into a single plot. The clusters show that users could be categorized into four groups: 1. users who, for the last three months, have been exercising for an average of one session every two days (yellow colour); 2. users who exercised moderately during the first month, but reduced their PA in the following months (green colour); 3. users with low PA in the first month, but who increased their PA in the second and third months (blue colour); and finally, those who evidenced very low PA across all three months (purple colour). The 3D scatter plot shows the separability between the four clusters.

BIRCH
The parameter used to configure the BIRCH algorithm was the threshold which represents the radius of the subcluster obtained by merging a new sample and the closest subcluster. The radius should be lower than the threshold [58]. Thus, it is important to adjust the threshold because the values can be represented by explicit coordinates in a Euclidean space. Upon building the cluster, special attention should be paid to units. In this dataset, features that depend on the accumulated time (e.g., mean_first_month, mean_week_1 ...) are expressed in seconds, while those features which depend on the accumulated number of missed workout sessions (e.g., missed_first_month, missed_second_month) do not have units. Figure 9. BIRCH clustering taking "missed first month", "missed second month" and "missed third month" as features. Figure 9 shows the three scatter plots represent clustering by the BIRCH algorithm. Similarities to clusters by K-means can be identified.

Affinity Propagation
The implementation of the sklearn library for the AP algorithm has two characteristics. On the one hand, when the fit function does not converge, the cluster centres meet in an empty array and then all training samples are labelled as −1. The predict function will classify each label as −1 accordingly. On the other hand, when all training has equal similarities and equal preferences, assignments of cluster centers and labels will depend on preferences. Thus, if the preference is smaller than the similarities, then the fit function will result in a single cluster and label 0 for each cluster. Otherwise, every training sample becomes its own cluster centre and is assigned a unique label [58]. Taking all of the above into account, we tested the AP by changing the damping parameter in the library, it is specified that this value must be in the range between [0.5, 1). With this in mind and following several data tests, we obtained the best results to 0.7 and 0.8 for the damping parameter.

Regression Results
The regression model was one of the most important stages in our research, and the LSTM model was selected for its ability to work with sequences and to memorize long and short term dependencies [43,59] . Here, we were interested in characterizing the training of MH users, with the aim of predicting future workout sessions during the fourth month, based on their training during the first three months.
The first step in this module was to convert time series data into supervised learning data. The conversion procedure used can be seen in Figure 10. A lookback value of 7 was used within the conversion procedure, implying that the LSTM system will learn in 7-day periods, the system output consequently being the eighth day. This process was carried out using the function TimeseriesGenerator from the Keras library. User behaviour differs greatly, and this can be observed in the first 90 days (marked in red), in Figures 11-13. Hence, the strategy was to build a general architecture for different types of user, whereby the creation of ensembles with similar users was combined with the previously mentioned clustering stage and the hyperparameter tuning for each user. Figures 11-13 reflect how different user behaviour was found for months 1-3 (in red), and for month 4 (in purple). The fact that different user behaviour was found makes it unnecessary to train and adjust hyperparameters for each newly-arrived user, which justifies the general approach proposed. Additionally, Figure 11 shows the long-short term memory, which is able to ignore the first month of workout sessions and learns from months two and three, so as to obtain the prediction of an adherent user for the fourth month. The LSTM prediction of an adherent user in Figure 12 proved similar, with a small upward shift with regard to test data. Finally, non-adherent user prediction is shown in Figure 13.

Hyperparameter Tuning
Given the sophisticated, automatic connections between their inputs and their outputs, neural networks have the ability to learn very complicated patterns [27]. Adding a hyperparameter tuning process to this optimizes the extraction of network parameters and leads to better regression results. There are different techniques used for hyperparameter tuning. Some are based on optimization methods that aim to obtain the best configuration of the networks, including Bayesian optimization to adaptively select configurations [60,61], Hill Climbing-Random Restart and Tabu List algorithms originally [46,62,63]. Other classical methods like grid search have shown great effectiveness. However, they have high computational complexity, especially with a wide range of parameter values that need to be tuned. In this regard, other researchers have explored alternative techniques, such as the suboptimal grid search [64]. In this paper, we used the grid search technique for hyperparameter tuning for LSTM and SVR, since it was necessary to optimize a large number of models with different behaviour.
• LSTM -grid search: following a large number of tests with different users, a wide range of hyperparameters was chosen for which the models generally adjusted the regression curve better. The same hyperparameters and number of neurons were then used for all users. Specifically, three dropout values after the first and second dense layers were applied. Similarly, three batch sizes of values 1,2,4 were used, taking into consideration the number of days in a week. Finally, five neuron values were applied to the first layer (LSTM layer). The range of aforementioned values are highlighted in Table12. Remaining hyperparameters were selected from existing literature (previous work on prediction, even if aimed at different types of application), with high performance in their proposed architectures. Hence, in accordance with the previous explanation, we pursued a hyperparameter tuning strategy in the relevant literature [22,65], with a detailed explanation of all the values in Table 12. The lookback is a parameter which was selected and agreed with the MH team, as it was considered more appropriate for the purpose of analysing the evolution of training over the weeks, as people generally change their routines or lose their motivation within a period of one week [66]. Additionally, following some experiments, we also verified that curves were fitting better with a value of 7 days. The number of epochs was then selected to be 50 after observing in experiments that overfitting was occurring after 50 epochs. Next, based on [67], we selected the number of hidden layers to be 4 after performing several experiments. The activation function selected was Relu, since it resolved the problem of negative values, and had performed well in previous research [68]. The Relu activation function was applied to all layers (including LSTM and dense), except the final one, while early stopping with patience of 15 epochs, was configured in our architecture, in order to avoid over-fitting. • SVR-grid search: The hyperparameters modified in the case of SVR were kernel, with the choices 'poly', 'rbf', and 'sigmoid'. Similarly, for the hyperparameter c, a range of 0 to 500 was chosen since the MAE error was not reduced beyond this number with any combination of kernel or other hyperparameters setup. Finally, the hyperparameter gamma, which assigns the scale option, and epsilon, which has a value of 10, were left unchanged.

Validation Metrics
We used the following performance metrics to evaluate the performance of our ensemble approach, precision, specificity and F1_score.
-Confusion matrix: this displays and compares actual values with predicted values of the model. In the context of ML, a confusion matrix is used as a metric to analyse how a machine or deep learning classifier performs on a dataset. It consists of the following 4 elements: • True Positives (TP): Users who were correctly predicted to exercise in the fourth month. • True Negatives (TN): Users who were correctly predicted to not exercise in the fourth month. • False Positives (FP): Users who were predicted to exercise but actually didn't exercise in the fourth month. • False Negatives (FN): Users who were predicted to not exercise in the fourth month, but who actually did.
-Accuracy: this is a metric that evaluates the performance of classification models. It indicates the fractions that the model gets right in classifying of correct predictions of adherent and non-adherent users in the total set evaluated. It is defined by (22).
-Precision: this is a value which tracks the performance of a model in terms of positive example classification (taking into account the users that were correctly and incorrectly predicted to exercise in the case of the positive class).
-Recall: this equals the number of genuine positive examples (i.e., users who were correctly predicted to exercise), divided by the number of false negatives (i.e., users who were incorrectly predicted to not exercise) and the total number of positive examples (users who were correctly predicted to exercise).
-F1_score: this is a weighted average of recall and precision.
-Specificity: this quantifies the TN rate or the number of users that the model defined as not exercising in the fourth month and who, in fact, did not train.

Results
After all the pre-processing stage, the total number of users in our model equaled 246 (Figure 7). We trained with all 246 users in order to get the most out of the ensembles proposed of which 112 were adherent and 134 were non-adherent, as shown in Table 7. It should be noted that the testing data was never shown to our LSTM models (e.g., last 30 days or fourth month of training). We then tested the model over the last 30 days for each user, using the regression results obtained from all the remaining 245 users. Table 13 presents the classification results using the regressions with a single ensemble, ignoring the clustering stage. The ensembles output corresponds to the regressions average in the fourth month for all trained models, while the system output (Class 0: Non-adherent and class 1: Adherent) was obtained using the rule mentioned in Section 2.2.6. Table 13 shows high metrics for accuracy and specificity, while metrics for recall and f1_score were not so high. This indicated that the system was unable to correctly classify users who actually exercised during the fourth month, and justified the need to employ a clustering stage, to make it possible to categorize users at the system input. Thus, our results are presented using two regression methods: SVR and LSTM, each one preceded by three different clustering methods. The following tables contain the results obtained by making combinations with 4 types of characteristics and altering some clustering algorithm parameters. The words in bold reflect the results with higher scores for each combination of parameters and features. Table 14 shows how the best results were obtained by using the number of skipped sessions trait in the first, second and third month. Additionally, it can be noticed that the recall and f1_score improved, although values remained unacceptable. The best results were obtained for k = 4 and threshold 0.1. See Table 15. The features which produced the best results were the same as those used in the k-means algorithm. It is also worth noting that the threshold values shown in the table were those which, after running a number of tests, produced the greatest changes. Low threshold values did not create changes in results, although large increases caused results to worsen.
In the AP algorithm, the Damping parameter was modified between the allowed range of 0.5 to 1, and Table 16 clearly reflects how the algorithm did not converge in the case of for values lower than 0.66. For this reason, only values higher than 0.66 were presented. No major changes were found in terms of results.  The following tables show the results obtained using LSTM architecture while changing the clustering method. We can observe that the LSTM model in combination with k-means clustering, obtained significant results when compared to the results using SVR, as seen in Table 17, with a recall value of 70% (acceptable results). Modifying the number of clusters led to no variations in results. However, accuracy did increase by 7%, compared to the best results obtained using SVR and clustering combinations. In Table 18, the best results for the BIRCH algorithm and LSTM were obtained by taking a combination of the following features: mean of completed sessions in the first, second and third month (in seconds), means of completed sessions in weeks 8-12, and skipped workout sessions, in seconds.
The results of the LSTM model preceded by the AP clustering method are shown in Table 19. It shows that it is possible to obtain good results with several features, such as the mean of completed sessions (seconds) in weeks of training, or skipped workout sessions per month, with accuracy values of up to 86%, and 87%, respectively. Additionally, the best results were obtained using all the combined features, with a Damping value of 0.68.

Discussion
Our results show that our initial purpose, stating that user adherence can be predicted, was correct. In order to achieve this, we used ML to categorize users into groups, and regressions according to DL ensembles, so as to predict user adherence to fitness app training.
Another finding was that the LSTM model outperforms the SVR model (12% accuracy increase), when the former is combined with clustering by Affinity Propagation. This is due to LSTM ability to learn from a series of past observations in order to predict the next sequential value. Adding a clustering block before the regression model has also increased (>15%) accuracy and recall (35%), which means that grouping users into similar categories can help to achieve better predictions of physical activity behaviour in new users.
Both in the case of LSTM and SVR models, the features which better helped to determine whether a user would be adherent to app training or not were mainly the combination of all features extracted, as shown in Table 11. We also obtained good results when clustering by the K-means method while using solely the skipped workout sessions feature. This could intuitively lead one to think that the number of missed workouts could have a greater significance than other features when determining adherence to training, and the latter could be relevant when predicting training behaviour in general. Somewhat in line with our results, previous research has associated mood improvement with the completion of the session, rather than with its duration [69]. If sticking to the training plan leads to improvements in mood, perceived pleasantness and usefulness, the former could be positively affected. And these two factors have been hypothesized to correlate directly with adherence to the training plan [70].
On a different note, the results in confusion matrix showed that the system we propose leads to high numbers of FN predictions. These FN become obvious in those cases where we know that the user completed workouts in month four, while our framework had predicted that they would not. While it may seem like a meaningless mistake, given that the client is in fact not at risk of dropping out, such incorrect predictions could have consequences. For instance, receiving a non-adherence prediction could lead to a series of unnecessary and inadequate motivational strategies, and the consequences of these otherwise unnecessary actions are unknown.
Our results (see Table 19) for FP values showed 5 false positive predictions-i.e., the system predicted that 5 individuals would continue to train during month four, while they actually failed to do so. These predictions showed 95% precision and 96% specificity, which we interpret as being satisfactory. Despite this being a low figure, it still implies that 5 clients could drop out unexpectedly, and so we will work towards reducing this figure in the future.
In this paper, we are identifying user patterns in a group of people who have been shown to have a desire to perform physical exercise. Thus, predicting user behaviour will give us the capacity to identify and target users at risk of drop-out. As published by other authors, programme modifications and/or psychological interventions that specifically targeted at these subgroups will allow for behavioral modifications which can, eventually, lead to increased levels of physical activity on the part of these users [71]. Increased physical activity entails various physical, mental and psychological benefits and constitutes one of the best ways to prevent health problems [2]. Greater adherence to regular physical activity would most likely have a very significant impact on public health and this fact needs to be acknowledged. Further steps are required, until the tools required to ensure effective motivation are found that may help our communities be more physically active.
Our work was developed via a general approach, so as to take that it takes advantage of the models already trained to predict the workouts, instead of training the model for each user who joins the application. In this way, the processing time for the system is and will remain low, even when the number of users becomes high. This approach highlights the characteristics of flexibility and resource efficiency, which are within the definition of "Industry 4.0" and its two directions in terms of development [72]. In order to advance to the industrialization stage and contribute to the development of 4.0 applications, our proposed system could next be deployed in production.
Future processes could study more flexible longitudinal periods (other than 4 months), while variables of both engagement to the exercise plan and user motivation (intrinsic vs. extrinsic) could be included, possibly by previously applying a motivational questionnaire. This would enable analysts to more accurately identify user patterns and predict training behaviour. Fitness app developers would, accordingly, be in a position to undertake motivational intervention to promote training adherence and reduce app attrition.

Conclusions
Our paper adds to the scarcely researched area of training behaviour in fitness app users. There is still no consensus as to the exact definition of fitness app adherence, and there would seem not to be any previous research work that uses a deep learning approach to predict fitness app adherence over time. To the best of our knowledge, this is the first framework whose aim is to predict user adherence to training via a fitness app. The framework consists of two main stages: (i) characterization of users into user groups, based on their training behavior during the first three months; and (ii) the regression prediction for new users via an ensemble approach. Our results show that it is possible to take advantage of stored time-dependent data, in order to predict adherence over a given period of time. From the features studied, training frequency seems to be more relevant than time spent in training. For their part, our ensembles consisted of DL regressors and reflected good performance metrics. In the near future, we plan to incorporate demographic factors, as well as involvement variables in the workout regimen and user motivation, (intrinsic vs. extrinsic) into the DL architectural design. Additionally, we believe flexible longitudinal periods would be worth studying, the stages for which, we expect to be using a larger user database, which should enable thorough testing. The approach should focus on adhering to the principles of flexibility and resource efficiency, which will be essential in the creation of industrial 4.0 applications.

Acknowledgments:
The authors would like to thank Mammoth Hunters for the data provided to conduct this research.

Conflicts of Interest:
Authors must identify and declare any personal circumstances or interests that may be perceived as influencing the representation or interpretation of reported research results.

Abbreviations
The following abbreviations are used in this manuscript: