Healthy Knee Kinematic Phenotypes Identiﬁcation Based on a Clustering Data Analysis

: The purpose of this study is to identify healthy phenotypes in knee kinematics based on clustering data analysis. Our analysis uses the 3D knee kinematics curves, namely, ﬂexion/extension, abduction/adduction, and tibial internal/external rotation, measured via a KneeKG™ system during a gait task. We investigated two data representation approaches that are based on the joint analysis of the three dimensions. The ﬁrst is a global approach that is considered a concatenation of the kinematic data without any dimensionality reduction. The second is a local approach that is considered a set of 69 biomechanical parameters of interest extracted from the 3D kinematic curves. The data representations are followed by a clustering process, based on the BIRCH (balanced iterative reducing and clustering using hierarchies) discriminant model, to separate 3D knee kinematics into homogeneous groups or clusters. Phenotypes were obtained by averaging those groups. We validated the clusters using inter-cluster correlation and statistical hypothesis tests. The simulation results showed that the global approach is more efﬁcient, and it allows the identiﬁcation of three descriptive 3D kinematic phenotypes within a healthy knee population.


Introduction
Establishing the phenotypes of healthy or asymptomatic gait from biomechanical data, such as kinematic, kinetic, and electromyographic measurements, is of primary importance. It provides a better understanding of the potential variability in a healthy individual's locomotion function and more personalized approach when designing treatment plans [1]. More importantly, identifying healthy phenotypes sets targets in treatment planning and evaluation of therapeutic efficiency [2].
However, the characterization of asymptomatic gait phenotypes representing clearly distinct gait categories is difficult for two reasons. First, the biomechanical data are given in the form of a vector of measurements of high dimension for each subject, causing a dimensionally cursed phenomenon [3,4]. Second, there is significant variability in the biomechanical data, even in healthy asymptomatic population gait. This high variability might be caused by the difference in the data collection methodology (motion capture systems, walking speed, data processing, etc.) [5].
Previous studies have shown the presence of several different patterns in the asymptomatic gait data [6]. Indeed, the presence of multiple phenotypes in knee kinematic patterns was first demonstrated by experiments using intra-cortical pins to measure movements [7]. While prior works have also identified differences in male and female gait patterns [2,8,9], other studies used 3D kinematic and kinetic to identify how age, sex, and body mass index (BMI) can influence normal gait [10]. However, both the high variability and high dimensionality of biomechanical data have forced many studies to limit their analyses to using simple descriptions, e.g., the mean or the local extrema. Indeed there have been various biomechanical patterns describing normal three-dimensional (3D) kinematic patterns of the knee (flexion/extension, abduction/adduction, and tibial internal/external rotation) during functional activities such as level walking [7,[11][12][13][14][15][16][17]-where a single normal pattern with bands of deviation is used as a reference to identify abnormalities in a patient's pattern [18].
Most previous studies have focused on assessing pathological rather than healthy asymptomatic kinematics [12]. Pattern recognition and machine learning techniques are typically used to characterize gait patterns. Various clustering techniques have been applied such as hierarchical clustering [19][20][21], c-means [22][23][24][25], artificial neural networks [26], and fuzzy clustering [27]. Hierarchical clustering was used to distinguish five gait patterns in young subjects using 3D peak muscle powers [19,20]. Hierarchical c-means clustering algorithms were employed to derive gait patterns [22], to classify kinematic data of patients following a stroke [23], and to identify gait pattern deviations in children with cerebral palsy [21,24,25]. In this same pathological population, fuzzy clustering was applied to normal and pathological subjects [27]. Similarly, artificial neural networks were used to distinguish between asymptomatic and osteoarthritis gaits [26].
Previous studies used only a small set of descriptors such as the mean of the measured data or local information to identify patterns in the data. The use of a small set of descriptors seriously limits the analysis and interpretation of the data due to the complexity of the kinematic data [28]. Therefore, the objective of this study is to identify healthy gait phenotypes using a clustering model that deals with high dimensional data analysis, i.e., the BIRCH (balanced iterative reducing and clustering using hierarchies) model. We investigated two data representation approaches based on a joint analysis of the three dimensions. The first, named global approach, considered the raw data without any dimensionality reduction. The second is a local approach, which is considered a large set of biomechanical parameters of interest extracted on the 3D kinematic patterns. The objectives are: (1) to identify the phenotype using an important cohort of healthy participants, (2) to explore a clustering approach using 3D knee kinematics data representations that consider the three planes of motion together, and (3) to compare global and local representations.

Materials and Methods
The methodology used is described in the block diagram presented in Figure 1 and involves the following steps: (1) kinematics data collection, (2) kinematics data standardization, (3) kinematics representation, (4) clustering for phenotypes identification, and (5) cluster validation.

Kinematic Data Collection
Three-dimensional (3D) knee kinematics data (flexion/extension, abduction/adduction and internal/external rotation), from the sagittal, frontal, and transverse planes, respectively, were recorded while each participant walked on a treadmill at a self-selected, comfortable speed ( Figure 2). The KneeKG™ system (Emovi Inc., Canada), which utilizes a knee marker attachment system designed to reduce skin-motion artifacts [29], was installed on participants' knees to record 3D kinematics during gait trials. A number of representative trials, generally 15, were averaged to obtain a mean pattern per plane of movement per participant. This was followed by interpolation and resampling from 1% to 100% of the gait cycle (GC), therefore giving 100 measurement points per plane of movement for each participant. The 3D kinematic data collection was performed on 75 participants (34 men and 41 women). All the participants had data recorded for both of their knees, meaning there were 150 total data sets. Volunteers were excluded if they had any lower limb pathology, musculoskeletal disorder, or previous lower limb surgery. Ethical approvals were obtained from the ethics committee of the Hôpital Maisonneuve-Rosemont (reference: 11098). All subjects provided informed consent before participation.

Kinematic Data Standardization
Kinematics data describe the knee movement in 3 dimensions. As the movement amplitude is different according to the movement plane (sagittal, frontal, and transverse), the obtained data are heterogeneous (i.e., range at different scales). A standardization was performed to take this aspect into account. Standardization is a scaling technique where the values are centered around the mean with a unit standard deviation (SD).

Kinematic Data Representation
In this study, we investigated two approaches for kinematic data computation: global and local representations.
In the global approach, the kinematics data from the three movement planes (sagittal, frontal, and transverse) were concatenated in order to form, for each participant, single vector data. Considering that each movement plane is characterized by 100 points, the concatenation leads to a resulting vector of 300 dimensions. This approach has the advantage of dealing with the entire curve without any loss of information.
The local approach consists of the extraction of a set of biomechanical parameters of interest from the 3D kinematic curves. We extracted 69 parameters distributed over the three planes: 30 from the frontal plane (abduction/adduction curve), 17 from the sagittal plane (flexion/extension curve) and 22 from the transverse (internal/external rotation). These parameters were based on an exhaustive literature review [30][31][32] and on variables routinely assessed in clinical biomechanical studies of knee osteoarthritis populations, such as maximums, minimums, varus and valgus thrust, angles at initial contact, mean values, and range of motion (ROM) throughout GCs or GC sub-phases (loading, stance, swing, etc.) [33,34].

Clustering for Phenotype Identification
The purpose of clustering is to extract meaningful patterns by separating the 3D kinematic data into homogeneous groups. The mean patterns, obtained by averaging the kinematic curves within each homogeneous group, form the knee kinematic phenotypes.

Clustering Model
In this study, we use a hierarchical clustering algorithm named balanced iterative reducing and clustering using hierarchies (BIRCH) [35] (Table 1). In its first stage, BIRCH clusters numeric data by merging hierarchical clustering (i.e., global clustering). Then, at later stages, it uses other existing clustering techniques (i.e., cluster refinement). BIRCH uses the clustering feature (CF) concept to describe a cluster. It represents the cluster hierarchy as a CF tree. Phase 1: Build the CF Tree. Load the data into the memory by building a cluster-feature tree (CF tree, defined below). Optionally, condense this initial CF tree into a smaller CF. Phase 2: Global clustering. Apply an existing clustering algorithm on the leaves of the CF tree. Optionally, refine these clusters.
The main advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional data in an attempt to produce the best quality clustering. This is suitable for kinematic data and offers the opportunity to perform clustering without reducing the dimensionality of the data. Moreover, BIRCH can find high-quality clustering with only a single scan of a dataset and can efficiently handle noise caused by the kinematic data variability [28].

Optimum Number of Clusters
The optimum number of clusters is determined by the Calinski-Harabaz index (CHindex), which is a function of both the within-cluster-sum-of-squares (SS w ) and the betweencluster-sum-of-squares (SS b ) [36]. SS w measures the compactness of the cluster, while the SS b measures the separation between clusters.
A higher value of the CH-index means the clusters are dense and well separated. Although there is no "acceptable" cut-off value, we need to choose the number of clusters that gives the highest value of CH-indices.

Cluster Evaluation
The clustering model has been evaluated using the intra-class correlation coefficient (ICC), which informs how similar elements in the same cluster are. It provides a measure of homogeneity within the clusters. The ICC is conventionally calculated using the variance within clusters and the variance between clusters according to the following Equation (1): where σ b is the between-cluster variance and σ w is the within-cluster variance.
The values of ICC range from 0 to 1. From (1), as the within-cluster variance (σ 2 w ) moves toward 0, ICC becomes closer to 1. A very small value for ICC implies that the within-cluster variance is much greater than the between-cluster variance, and an ICC of 0 shows that there is no correlation of responses within a cluster.

Statistical Analysis: Hypothesis Testing and Phenotype Interpretation
The statistical analysis aims to compare the identified phenotypes in order to assess differences between them. We used the 1D statistical parametric mapping (SPM), which is increasingly used for continuum data (e.g., kinematic data) by the biomechanics research community to assess the overall difference on each plane through a complete GC [37]. Chi-square and ANOVA tests were used to assess between-cluster differences in the demographic characteristics of the participants (i.e., sex, age, BMI).

Results
The experiments were implemented in Python 3.7 (ww.python.org accessed on 26 October 2021) using standard scientific and visualization packages: sklearn 0.24.2, matplotlib 0.99. The simulations were performed under the open source web application Jupyter Notebook.
Following 3D kinematic data standardization, we performed global and local data representations to generate, for each participant, input vectors of dimensions 300 and 69, respectively.
The optimum number of clusters was determined using the CH-index. Figure 3a,b plots the CH-index as a function of the number of clusters for the global and local representations, respectively. For both representations, the highest value of CH-indices is obtained with k = 3 clusters, indicating that the optimum number of clusters is three. For the rest of the paper, we will only consider three clusters.  The identified phenotypes are described in Figures 4 and 5 based on global and local kinematic data representations, respectively (Line 1 of each figure). The three phenotypes are obtained by averaging the 3D kinematic curves within each homogeneous group identified by BIRCH clustering. In the following, the words 'phenotypes' and 'clusters' could be interchanged, but in fact, the phenotype is determined by averaging the curves identified in each cluster. The second line figures are related to 1D-SPM statistical analysis, which shows a statistical difference of the identified phenotypes in the three planes except for the flexion-extension when using a local representation (a statistical difference for 80% of the gait cycle). These results suggest a higher efficiency of the global representation, which will be retained for the following analysis.  We validated the obtained clusters using the intra-cluster correlation coefficient (ICC). Table 2 summarizes the ICC for the global representation. These values range from 0.27 to 0.76, showing a correlation between the different knee kinematic curves within the clusters. The differences on the participants' demographic characteristics between the clusters obtained with the global approach are shown in Table 3. There were significantly more men in Cluster 1 (60% of men and 40% of women) compared to Clusters 2 (30% of men and 70% of women) and 3 (42% of men and 58% of women, p < 0.01 and p = 0.03, respectively). Based on an ANOVA test, participants from Cluster 2 had a significantly higher BMI than participants from Cluster 1 (25.61 vs. 23.68 kg/m 2 , p < 0.01). There was no differences in terms of age between the three clusters (p = 0.60). Table 3. Between-cluster differences on the demographic characteristics of the participants (units ± SD).

Characteristics
All

Discussion
The SPM analysis showed that the global approach was more efficient than the local approach to identify distinct clusters ( Figure 4). Indeed, there was an overall statistical difference in all three planes (throughout 100% of the gait cycle). The statistical difference between clusters was limited to 80% of the gait cycle in the frontal plane based on the local representation method ( Figure 5). Notably, the optimum number of clusters determined by the CH-index was similar using both approaches. It is interesting to highlight that the mean ICC in the sagittal plane was lower than the two other planes and that this was true for the majority of the clusters. This result was predictable, as the bundle of curves is narrower for the sagittal plane (Figure 2b) than for the two other planes (Figure 2a,c). This makes the partition of the data into three groups more difficult and hence the obtained phenotypes more similar.
The global approach further led to a separation of the 3D kinematic data into homogeneous groups and meaningful phenotypes. Based on the ICC values, the kinematical similarity between members of the same cluster was maximized in the frontal plane (i.e., varus-valgus alignment).
Although the local approach gathers a representative list of parameters of interest, it performs less well than the global approach. Indeed, the latter keeps all the kinematic curves and thus naturally includes the totality of the information. The use of the global representation that assumes high-dimensional data vectors (a vector of 300 dimensions corresponding to the concatenation of the 3 planes of kinematic data) is made possible by the utilization of the BIRCH algorithm. It is especially suitable for clustering very large datasets [35] and the first clustering algorithm to be proposed that handles noise effectively [38]. These strengths made it very suitable for kinematic data analysis as it can handle high dimensional data compared to other clustering algorithms. This is due to the fact that when dimensionality increases, only a small number of dimensions are relevant for certain clusters. Thus, data in irrelevant dimensions may act as noise and mask the formation of the real clusters, which can be avoided with the BIRCH algorithm.
The three clusters allow the occurrence of three specific phenotypes of knee kinematics. The first phenotype (Cluster 1, c.f. blue curve of Figure 4) describes participants walking with a predominant dynamic varus alignment (i.e., adduction) through the entire GC, starting with a mean varus angle at heel strike greater than 5°. In the transversal plane, these participants accentuate their internal tibia rotation during the first half of their swing phase. Participants from Cluster 2 (c.f. orange curve) show a predominant dynamic valgus alignment, especially during stance and swing phases of their GC, and an external tibia rotation throughout their GC. Finally, participants from Cluster 3 (c.f. green curve) had the most neutral alignment in the frontal and transverse plane. They also show greater flexion contracture (i.e., flexum) at heel strike and throughout the stance phase.
The cluster with a majority of men (i.e., Cluster 1) showed greater varus alignment compared to the other clusters. Alternatively, Cluster 2 was composed mainly of women (more than 2 out of 3 participants) and showed a greater dynamic valgus alignment. These results are similar to what was shown in a previous study in which men were typically more varus aligned and women more neutral or valgus aligned on radiographs [39].
Results from this study are an important step in better understanding 3D knee kinematics in a healthy asymptomatic population. When investigating the impact of an injury or developing treatment plans to restore knee function, it is crucial to first understand the "normal" state. Phenotyping has shown to be valuable in precision medicine. Indeed, it can be used to define groups of patients that are more prone to develop chronic conditions or that may respond better to specific treatments. Therefore, the method proposed in this paper could be applied to different populations to help in the assessment of knee conditions and to monitor and personalize treatment pathways.

Conclusions
A global approach considering kinematic data from all planes of movement together allowed a more efficient identification of three distinct phenotypes than a local approach in a healthy knee population. The three identified 3D kinematic phenotypes may allow us to better understand the "normal" knee movement during gait and refine the target for the treatment of mechanical knee disorders. Further research is needed to better characterize these phenotypes and quantify differences in additional participants' characteristics.  Institutional Review Board Statement: Ethical approvals were obtained from the ethics committee of the Hopital Maisonneuve-Rosemont (reference: 11098).

Informed Consent Statement:
All subjects provided informed consent before participation.

Data Availability Statement:
The data presented in this study should be available on request from the corresponding author. The data are not publicly available due to ethical restrictions.