Fusion Models for Generalized Classification of Multi-Axial Human Movement: Validation in Sport Performance

We introduce a set of input models for fusing information from ensembles of wearable sensors supporting human performance and telemedicine. Veracity is demonstrated in action classification related to sport, specifically strikes in boxing and taekwondo. Four input models, formulated to be compatible with a broad range of classifiers, are introduced and two diverse classifiers, dynamic time warping (DTW) and convolutional neural networks (CNNs) are implemented in conjunction with the input models. Seven classification models fusing information at the input-level, output-level, and a combination of both are formulated. Action classification for 18 boxing punches and 24 taekwondo kicks demonstrate our fusion classifiers outperform the best DTW and CNN uni-axial classifiers. Furthermore, although DTW is ostensibly an ideal choice for human movements experiencing non-linear variations, our results demonstrate deep learning fusion classifiers outperform DTW. This is a novel finding given that CNNs are normally designed for multi-dimensional data and do not specifically compensate for non-linear variations within signal classes. The generalized formulation enables subject-specific movement classification in a feature-blind fashion with trivial computational expense for trained CNNs. A commercial boxing system, ‘Corner’, has been produced for real-world mass-market use based on this investigation providing a basis for future telemedicine translation.


Introduction
Mechatronic systems recognizing human activity are now fundamental components in biophysical analysis, with strong impact in fields such as physiotherapy, telemedicine, smart homes, rehabilitation, human-robot interface, and athletics (e.g., [1][2][3][4][5][6][7][8]). A wide range of activity-aware systems including smart phone apps (e.g., Galaxy Moves App, iPhone Moves App, iPhone Health Mate App, iPhone Fitbit App), athletic wearables (e.g., Nike Fuelband, Jawbone UP24, Fitbit Flex, Fitbit One, Fitbit Zip, Digi-Walker SW-200) and fall detection devices (e.g., Philips Lifeline, Lively Mobile, Sense4Care, Angel4) are commercially available today. Despite this range, most wearables remain limited to simple metrics such as step count, heart rate, and calories expended [9]. Though initial sales are promising, a staggering 1/3 of users abandon wearable devices [10], speaking to obvious challenges in transience and sustainability. to non-linear signal variations and the exploitation of coupling between uni-axial signals have not been specifically addressed in the detailed manner as described in this study.

Multisensor Fusion Validation Application: Combat Sport
We have chosen to test and validate our fusion classifier models in combat sport given the diversity of arm and leg movements, the fact that movements are representative of those necessitating multi-axial recognition, and the capacity to collect and test large sets of meaningful data. The use of IMUs in combat sports has grown in recent years (reviewed in [6]), though existing approaches tend to be focused on metrics or specific signal features. In general, classification systems that exploit information from ensembles of multi-axial sensors are capable of improving, quite significantly, the performance over uni-axial classifiers [40][41][42][43][44][45]. However, multi-sensor classifiers tend to be more complex because of the need to incorporate fusion methods to combine "information" from the multiple sensors. The fusion methods can be divided into "input-level fusion" and "outputlevel" fusion. For input-level fusion, also called early fusion, the information can be input data or features extracted from the data. The information in output-level fusion, also called late fusion, is typically the decisions of the uni-axial classifiers or some measure at the outputs of the uni-axial classifiers.
The fusion models developed in this study for classifying combat sport movement are formulated generically and then validated by classifying 24 classes of kicking movements in taekwondo and 18 classes of punching movements in boxing. To our knowledge this is the first set of generalized non-feature specific models demonstrated on such a large number of classes in either activity [6].

Investigation Goals
Our first goal is to introduce data input models which: (a) facilitate fusion of information at the input and output levels and (b) are generalizable for use in conjunction with a broad range of diverse classifiers. The second goal is to design DTW and CNN classifiers for human movement identification using these input models. The third goal is to design experiments to classify boxing and taekwondo strikes. The final goal is to compare the DTW and CNN-based classification systems with respect to accuracy, complexity, flexibility, and the potential to obtain further improvements in performance. We offer these findings as a basis for translation of wearables for a range of human performance and healthcare applications.

Orginisation of Paper
Section 3 describes the four movement classifier input models. Sections 4 and 5 describe the DTW and CNN models that are used in conjunction with the input models. Section 6 describes data collection and the strike movements for the validation studies in combat sports. Section 7 outlines classification results for 24-class kicking and 18-class punching movements. Section 8 briefly describes translation to a commercial product as evidence of novelty and impact while Section 9 summarizes conclusions from the investigation.

Classifier Input Models
We propose four input models, which differ in the way the multi-axial sensor signals are presented as inputs into the subsequent classification stages. The four classifier input arrangements are summarized in Figures 1-4. The models can be contrasted by noting the level of fusion incorporated in the models. In the formulations of the classification models, a movement is represented by I and it is assumed that the movement belongs to one of H movement classes, ω h , h = 1, 2, . . . , H. The models are assumed to have G multi-axial sensors represented by S g , g = 1, 2, . . . , G, and an output of sensor S g is represented by S gm , m = 1, 2, . . . , m g , where, m g is the number of multi-axial outputs. The term "non-linear variations" will be used to encompass latency shifts (shifts in peak positions) and expansions/compressions in signal segments.
independently. The number of independent classifiers in such a system, therefore, is = ∑ . Systems using this input model need fusion at the output level to determine the class of the movement signal. Of the four input models, the VI model is the most versatile because the sensors can be heterogeneous and can have a different number of axes. Fur thermore, the sensor outputs can have different durations and do not have be synchro nized with respect to non-linear variations. However, the resulting classifiers are the mos complex because they require a classifier for each multi-axial signal and output-level fu sion to combine the information from the classifier outputs in order to determine the input class.

Local Matrix Input (LMI) Model
The LMI model is designed for systems that classify the uniaxial outputs of each sensor separately by fusing the signals of each sensor into a matrix as shown in Figure 2. That is, the outputs of each multi-axial sensor are fused into a local intra-sensor matrix ( , ), = 1,2, … , ; = 1,2, … , where, is the duration of the outputs of sensor (assumed equal in each sensor). The number of matrices is equal to the number of sensors . The intra-sensor matrix to classify the signals of sensor can be written as ( , ) = , = 1,2, … , where, the fusion operation is represented by . Each matrix can be classified independently, and some form of output-level fusion can be applied to determine the class of the movement signal. The resulting classification system, therefore, is a hybrid system which includes both input and output-level fusion. This LMI input model is more restrictive than the previous model because the multi-axial sensor outputs must the same durations within each sensor (not across all sensors) in order to fuse them into a matrix. More over, the multi-axial sensor outputs are assumed to experience synchronized non-linear variations within each sensor. The advantage of the LMI model is that the number of classifiers is reduced to when compared with the classifiers needed in the previous VI model.

Global Matrix Input (GMI) Input Model
The third model, involving only input-level fusion in the classifiers, is designed to classify the uniaxial sensor signals of all sensors by fusing the signals into a global intersensor matrix shown in Figure 3. The inter-sensor matrix is formed by fusing all multiaxial outputs into a matrix ( , ), = 1,2, … , ; = 1,2, … , , where, is the duration of each sensor output (assumed equal). That is, each row of ( , ) is an output of a multi-axial sensor. The global input matrix is, therefore, given by This fusion operation is equivalent to fusing the LMI matrices into a matrix, therefore the global matrix can also be written as Unlike the two previous models, classifiers using this input model do not require output-level fusion because only a single classifier is needed to classify the global matrix However, it is important to note that the resulting classifier is more restrictive than the two previous models because the following assumptions are made: 1. The multi-axial outputs have the same durations within and across all sensors in order to fuse them into a global matrix.

Global Matrix Input (GMI) Input Model
The third model, involving only input-level fusion in the classifiers, i classify the uniaxial sensor signals of all sensors by fusing the signals into sensor matrix shown in Figure 3. The inter-sensor matrix is formed by fus axial outputs into a matrix ( , ), = 1,2, … , ; = 1,2, … , , where, tion of each sensor output (assumed equal). That is, each row of ( , ) is multi-axial sensor. The global input matrix is, therefore, given by This fusion operation is equivalent to fusing the LMI matrices into a ma the global matrix can also be written as Unlike the two previous models, classifiers using this input model d output-level fusion because only a single classifier is needed to classify the However, it is important to note that the resulting classifier is more restri two previous models because the following assumptions are made: 1. The multi-axial outputs have the same durations within and across all der to fuse them into a global matrix. 2. The multi-axial sensor outputs experience synchronized non-linear var and across all sensors.  where ∆ is the cuboid fusion operation, is the number of uniaxial outputs of each sen sor and is the duration of each uniaxial signal. This input model, shown in Figure 4, i the most restrictive because it requires an additional condition to be met, viz., the sensor must have an equal number of uniaxial outputs. In summary, there is a trade-off between classifier complexity and flexibility usin the four input models. A suitable way to overcome the equal duration restriction is t duration normalize the signals to have a common length through linear expansion or com pression. For example, the common length can be chosen to be the average duration of a sensor outputs for the GMI model and the average of outputs of each sensor for the LM model. However, there is no simple way to overcome the non-linearity restriction and th performance of the GMI and LMI models can be expected to drop if this assumption i violated. The GCI model cannot be implemented if the number of outputs across all sen sors is not equal. Another issue to take into account is the similarity of the sensors. If th sensors are heterogeneous, the heterogeneous signal amplitudes have to be normalized for example, using min-max normalization. Even if the sensors are homogeneous, nor malization within each sensor is also required to account for the varying ranges of the un axial signal amplitudes.
The sections that follow present the formulations of three DTW based models an the four CNN based models to classify multi-axial multiple-sensor movement signals us ing the four input models. It can be shown that for the DTW implementations, the GC model is equivalent to GMI model, therefore, the GCI model is not implemented. A DTW classifier is explicitly split into two operations: discrepancy computation and a decisio rule. The discrepancy computation operation determines the dissimilarity score betwee the aligned test and reference signals of movements and the decision rule uses these dis crepancy scores to assign the test movement into one of the movement classes.

Dynamic Time Warping (DTW) Classifier
DTW has been applied in numerous applications to measure the dissimilarity be tween pairs of sequences that experience non-linear variations in the segments of the se quences. A sample of applications employing DTW include speech recognition [19,23 shape recognition [15,16,25], clustering [17,24], gene expression [26], financial time serie

Vector Input (VI) Model
The VI model, shown in Figure 1, is straightforward because it does not involve any form of input-level fusion. The figure simply shows the labeling of the sensors and the sensor outputs. This model is suitable for systems that classify each uniaxial signal s gm independently. The number of independent classifiers in such a system, therefore, is M G = ∑ G g=1 m g . Systems using this input model need fusion at the output level to determine the class of the movement signal. Of the four input models, the VI model is the most versatile because the sensors can be heterogeneous and can have a different number of axes. Furthermore, the sensor outputs can have different durations and do not have be synchronized with respect to non-linear variations. However, the resulting classifiers are the most complex because they require a classifier for each multi-axial signal and output-level fusion to combine the information from the M G classifier outputs in order to determine the input class.

Local Matrix Input (LMI) Model
The LMI model is designed for systems that classify the uniaxial outputs of each sensor separately by fusing the signals of each sensor into a matrix as shown in Figure 2. That is, the outputs of each multi-axial sensor S g are fused into a local intra-sensor matrix Z g (m, n), m = 1, 2, . . . , m g ; n = 1, 2, . . . , n g (1) where, n g is the duration of the outputs of sensor S g (assumed equal in each sensor). The number of matrices is equal to the number of sensors G. The intra-sensor matrix to classify the signals of sensor S g can be written as where, the fusion operation is represented by ∇. Each matrix can be classified independently, and some form of output-level fusion can be applied to determine the class of the movement signal. The resulting classification system, therefore, is a hybrid system which includes both input and output-level fusion. This LMI input model is more restrictive than the previous model because the multi-axial sensor outputs must the same durations within each sensor (not across all sensors) in order to fuse them into a matrix. Moreover, the multiaxial sensor outputs are assumed to experience synchronized non-linear variations within each sensor. The advantage of the LMI model is that the number of classifiers is reduced to G when compared with the M G classifiers needed in the previous VI model.

Global Matrix Input (GMI) Input Model
The third model, involving only input-level fusion in the classifiers, is designed to classify the uniaxial sensor signals of all sensors by fusing the signals into a global intersensor matrix shown in Figure 3. The inter-sensor matrix is formed by fusing all multi-axial outputs into a matrix Z(m, n), m = 1, 2, . . . , M G ; n = 1, 2, . . . , N, where, N is the duration of each sensor output (assumed equal). That is, each row of Z(m, n) is an output of a multi-axial sensor. The global input matrix is, therefore, given by This fusion operation is equivalent to fusing the LMI matrices into a matrix, therefore, the global matrix can also be written as Unlike the two previous models, classifiers using this input model do not require output-level fusion because only a single classifier is needed to classify the global matrix. However, it is important to note that the resulting classifier is more restrictive than the two previous models because the following assumptions are made: 1.
The multi-axial outputs have the same durations within and across all sensors in order to fuse them into a global matrix.

2.
The multi-axial sensor outputs experience synchronized non-linear variations within and across all sensors.

Global Cuboid Input (GMI) Input Model
If the number of uniaxial outputs and the output durations of all G sensors are assumed equal, the LMI matrices can also be fused in a cuboid which can be represented by Z(m, n, g) = ∆ G g=1 Z g (m, n), m = 1, 2, . . . , M; n = 1, 2, . . . , N; g = 1, 2, . . . , G where ∆ is the cuboid fusion operation, M is the number of uniaxial outputs of each sensor and N is the duration of each uniaxial signal. This input model, shown in Figure 4, is the most restrictive because it requires an additional condition to be met, viz., the sensors must have an equal number of uniaxial outputs. In summary, there is a trade-off between classifier complexity and flexibility using the four input models. A suitable way to overcome the equal duration restriction is to duration normalize the signals to have a common length through linear expansion or compression. For example, the common length can be chosen to be the average duration of all sensor outputs for the GMI model and the average of outputs of each sensor for the LMI model. However, there is no simple way to overcome the non-linearity restriction and the performance of the GMI and LMI models can be expected to drop if this assumption is violated. The GCI model cannot be implemented if the number of outputs across all sensors is not equal. Another issue to take into account is the similarity of the sensors. If the sensors are heterogeneous, the heterogeneous signal amplitudes have to be normalized, for example, using min-max normalization. Even if the sensors are homogeneous, normalization within each sensor is also required to account for the varying ranges of the uni-axial signal amplitudes.
The sections that follow present the formulations of three DTW based models and the four CNN based models to classify multi-axial multiple-sensor movement signals using the four input models. It can be shown that for the DTW implementations, the GCI model is equivalent to GMI model, therefore, the GCI model is not implemented. A DTW classifier is explicitly split into two operations: discrepancy computation and a decision rule. The discrepancy computation operation determines the dissimilarity score between the aligned test and reference signals of movements and the decision rule uses these discrepancy scores to assign the test movement into one of the movement classes.

Dynamic Time Warping (DTW) Classifier
DTW has been applied in numerous applications to measure the dissimilarity between pairs of sequences that experience non-linear variations in the segments of the sequences. A sample of applications employing DTW include speech recognition [19,23], shape recognition [15,16,25], clustering [17,24], gene expression [26], financial time series matching [29], and classifying human actions in sports [27]. In order to facilitate the understanding of the formulations of the three DTW-based fusion models, a brief description of one, two, and three-dimensional DTW algorithms follows next.

Dynamic Time Warping (DTW) Algorithms
Given a pair of signals X and Y and a local cost function w(k) to reflect the discrepancy between the elements of X and Y, the goal of DTW is to determine an alignment function W = {w(1), w(2), . . . , w(K)}, such that the overall normalized cost is minimized subject to a set of end-point, monotonicity, and continuity constraints. A XY is a measure of the discrepancy between signals X and Y after optimal alignment. The most often used cost functions include the Euclidean, Manhattan, and Euclidean-squared distance metrics. Dynamic programming is used to solve the optimization problem.
The steps to align one, two, and three-dimensional signals are quite similar except for the computation of the local cost function. For example, if the Euclidean distance is used, the cost function for the one-dimensional (vector) DTW algorithm is For the two-dimensional (matrix) case, the cost function is given by d[w(k)] = ||X(:, i(k)) − Y(:, j(k))|| (7) where, the notation Z(:, t) is used to denote column t of a matrix Z. Note that the number of rows in the two matrices must be equal but the number of columns can be different. Similarly, the cost function for the three-dimensional (cuboid) extension is given by where, the notation Z(:, t, :) is used to denote a depth-frame t of a cuboid Z. For this case, the number of rows and depth of the two cuboids must be equal but the number of columns can be different. Also note that cuboid alignment can also be implemented as matrix alignment by fusing the height-width frames into an augmented matrix because the resulting column-to-column cost function is equal to the frame-to-frame cost function. However, the matrix alignment cannot be implemented as cubic alignment if the number of rows in the frames are unequal. In this study which involves the classification of sensor signals arranged as vectors, matrices, and cuboids, the corresponding DTW classifiers will be referred to as V-DTW, M-DTW, and C-DTW, respectively. In order to design a DTW classifier for a given problem, a reference template for each pattern is typically estimated from the signals in their respective training sets. The sample mean vector is used often because it best represents the signals in the training set in the sense of minimizing the sum of squared distances from itself to the signals in the training set. However, this does not necessarily imply that the sample mean is the best template choice for a particular problem. Modified averaging procedures which take non-linear variations into account have been proposed to generate templates that can be used in DTW algorithms [18]. In fact, other measures of central tendency (C-T) such as the median, Winsorized mean, trimmed mean, and tri-mean can also be used. What is important to note is that a better C-T estimate does not necessarily result in a better template for classification problems. Therefore, attempting to predict which C-T estimate will yield the best template for a given problem is not easy and the selection of a template is usually determined through trial-and-error.

DTW Implementation of VI Model (DTW-1)
The DTW based classification model which uses the VI model is illustrated in Figure 5. The model, referred to as DTW-1, consists of one independent V-DTW classifier for each multi-axial sensor output. Therefore, the number of V-DTW classifiers is M G . The discrepancy scores of the M G classifiers are fused through averaging in order to determine the class of the impact signal.
Sensors 2021, 21, x FOR PEER REVIEW 8 multi-axial sensor output. Therefore, the number of V-DTW classifiers is . The disc ancy scores of the classifiers are fused through averaging in order to determin class of the impact signal.

Discrepancy computation:
The output of the V-DTW operator for the uni-axial s is the discrepancy score vector = ( , , … , ), where, is the disc ancy between a test sequence and the reference sequence . Output Fusion Rule: The discrepancy scores of the V-DTW operators are aver and the resulting averaged discrepancy fusion vector is given by where: Decision Rule: The test movement is assigned to the class that yields the least crepancy using the following rule: * = arg min[ ], ℎ = 1,2, … , .

DTW Implementation of the LMI Model (DTW-2)
The use of M-DWT in conjunction with the LMI model is illustrated in Figure this hybrid input and output-level fusion approach, each intra-sensor matrix is class independently using M-DWT and the class of the movement is determined by avera the discrepancy scores of each classifier.

Discrepancy computation:
The output of the V-DTW operator for the uni-axial signal s gm is the discrepancy score vector gm is the discrepancy between a test sequence s T gm and the reference sequence s h gm . Output Fusion Rule: The discrepancy scores of the M G V-DTW operators are averaged and the resulting averaged discrepancy fusion vector is given by where: Decision Rule: The test movement I T is assigned to the class that yields the least discrepancy using the following rule:

DTW Implementation of the LMI Model (DTW-2)
The use of M-DWT in conjunction with the LMI model is illustrated in Figure 6. In this hybrid input and output-level fusion approach, each intra-sensor matrix is classified independently using M-DWT and the class of the movement is determined by averaging the discrepancy scores of each classifier. . The element is the discrepancy score between , ( , ) and , ( , ).
Output Fusion Rule: For this case, the outputs (discrepancy scores) of the G DTW operators are fused using an averaging operation. The averaged discrepancy fusion vector is given by where, Decision Rule: The test movement I is assigned to the class ω * using the rule in Equation (10).

DTW Implementation of the GMI Model (DTW-3)
The DTW classifier that uses the GMI model is illustrated in Figure 7. In this inputlevel fusion approach, the system has one M-DTW classifier to classify the global intersensor matrix. The discrepancy scores between a test movement and reference movements are computed and the test movement is assigned to the class which yields the smallest score. Output Fusion Rule: none required.
Decision Rule: The test movement is assigned to the class * the using the rule in Equation (10).

Convolution Neural Network (CNN) Classifiers
CNNs are a class of deep learning networks that is capable of performing well in computer vision problems such as large-scale object classification and detection in images [46][47][48][49][50][51]. One of the most striking features of CNNs when compared with other traditional Local Discrepancy Computation: The system has one M-DTW classifier for the outputs of each multi-axial sensor. For the M-DTW classifier for sensor S g , let Z g,T (m, n) and Z g,h (m, n) be the local input matrix of a test movement and a reference movement of class h, respectively, and let the output of the M-DTW operator be the discrepancy vector g is the discrepancy score between Z g,T (m, n) and Z g,h (m, n).
Output Fusion Rule: For this case, the outputs (discrepancy scores) of the G DTW operators are fused using an averaging operation. The averaged discrepancy fusion vector is given by where, Decision Rule: The test movement I T is assigned to the class ω * using the rule in Equation (10).

DTW Implementation of the GMI Model (DTW-3)
The DTW classifier that uses the GMI model is illustrated in Figure 7. In this input-level fusion approach, the system has one M-DTW classifier to classify the global inter-sensor matrix. The discrepancy scores between a test movement and reference movements are computed and the test movement is assigned to the class which yields the smallest score. Local Discrepancy Computation: The system has one M-DTW classifier for the output of each multi-axial sensor. For the M-DTW classifier for sensor , let , ( , ) and , ( , ) be the local input matrix of a test movement and a reference movement of clas ℎ, respectively, and let the output of the M-DTW operator be the discrepancy vector = ( , , … , ). The element is the discrepancy score between , ( , ) and , ( , ).
Output Fusion Rule: For this case, the outputs (discrepancy scores) of the G DTW op erators are fused using an averaging operation. The averaged discrepancy fusion vecto is given by Decision Rule: The test movement I is assigned to the class ω * using the rule in Equation (10).

DTW Implementation of the GMI Model (DTW-3)
The DTW classifier that uses the GMI model is illustrated in Figure 7. In this input level fusion approach, the system has one M-DTW classifier to classify the global inter sensor matrix. The discrepancy scores between a test movement and reference movement are computed and the test movement is assigned to the class which yields the smalles score. Output Fusion Rule: none required.
Decision Rule: The test movement is assigned to the class * the using the rule in Equation (10).

Convolution Neural Network (CNN) Classifiers
CNNs are a class of deep learning networks that is capable of performing well in computer vision problems such as large-scale object classification and detection in image [46][47][48][49][50][51]. One of the most striking features of CNNs when compared with other traditiona classifiers, including fully connected neural networks (FCNs), is that a minimal amoun Output Fusion Rule: none required.
Decision Rule: The test movement I T is assigned to the class ω * the using the rule in Equation (10).

Convolution Neural Network (CNN) Classifiers
CNNs are a class of deep learning networks that is capable of performing well in computer vision problems such as large-scale object classification and detection in images [46][47][48][49][50][51]. One of the most striking features of CNNs when compared with other traditional classifiers, including fully connected neural networks (FCNs), is that a minimal amount of preprocessing is required to generate the input to the network. For example, an image can be processed directly without having to convert it into a vector. Converting images to vectors results in a very long input vector which can lead to the curse of dimensionality in traditional classifiers and a large network for FCNs which in turn results in a large number of network parameters and overfitting problems. Though seldom discussed, converting an image into a vector leads to a poor representation of the input image because it loses the relationship between a pixel and its vertical and diagonal neighbors which is important for local feature detection. The most often used methods to overcome the dimensionality-related problems is through feature extraction. However, as noted in the introduction, selecting a set of features for a given problem is more an art than science and features are typically selected through trial-and-error. CNNs overcome these problems by applying feature extracting filters directly to the image and most importantly, learning the filter weights through training rather than using prior knowledge to hand-engineer the weights. Moreover, the overfitting problem is reduced through parameter sharing in which the same filter is used to determine each element in the feature map.
A typical CNN has an input layer, an output layer, and hidden layers consisting of convolution, pooling, and fully connected layers. The network architecture is defined by the number and arrangement of the convolution and pooling layers. Figure 8 is an illustration of a CNN with two convolution layers C {1} and C {2} followed by a pooling layer P {1} and a FCN with layers F {1} , F {2} , and F {3} (output layer). The input to the first fully connected layer is the flattened (concatenated) output from the pooling layer. In general, the dimension of a convolution layer depends on the number of convolution filters, the filter stride, and the type of convolution (valid or same). A pooling layer dimension depends on the size and stride of the pooling filters. For classification problems, the output layer is typically a softmax layer with one output for each pattern class. The network is trained using the gradient descent backpropagation algorithm. of preprocessing is required to generate the input to the network. For example, an image can be processed directly without having to convert it into a vector. Converting images to vectors results in a very long input vector which can lead to the curse of dimensionality in traditional classifiers and a large network for FCNs which in turn results in a large number of network parameters and overfitting problems. Though seldom discussed, converting an image into a vector leads to a poor representation of the input image because it loses the relationship between a pixel and its vertical and diagonal neighbors which is important for local feature detection. The most often used methods to overcome the dimensionality-related problems is through feature extraction. However, as noted in the introduction, selecting a set of features for a given problem is more an art than science and features are typically selected through trial-and-error. CNNs overcome these problems by applying feature extracting filters directly to the image and most importantly, learning the filter weights through training rather than using prior knowledge to hand-engineer the weights. Moreover, the overfitting problem is reduced through parameter sharing in which the same filter is used to determine each element in the feature map.
A typical CNN has an input layer, an output layer, and hidden layers consisting of convolution, pooling, and fully connected layers. The network architecture is defined by the number and arrangement of the convolution and pooling layers.  The two notable operations performed in CNNs are convolution and pooling. Each convolution layer contains a set of filters which have spatial dimensions much smaller than those of the image, however, the depth (number of channels) is usually the same as the input. A bias is added to the filtered outputs which are then passed through a nonlinear activation such as the function to yield the feature maps. The feature maps are stacked into cuboids to form the output of the convolution layer in which the number of channels is equal to the number of filters. If the convolution layer is followed by a pooling layer, the spatial dimension is reduced by subsampling blocks in each feature map in the convolution layer output. Max pooling, which replaces a block with the maximum value, is the most often used pooling operation. Pooling serves two purposes: it progressively reduces the spatial dimension thus decreasing the overfitting problem through the reduction in the number of parameters and selects the most robust features. The two notable operations performed in CNNs are convolution and pooling. Each convolution layer contains a set of filters which have spatial dimensions much smaller than those of the image, however, the depth (number of channels) is usually the same as the input. A bias is added to the filtered outputs which are then passed through a non-linear activation such as the ReLu function to yield the feature maps. The feature maps are stacked into cuboids to form the output of the convolution layer in which the number of channels is equal to the number of filters. If the convolution layer is followed by a pooling layer, the spatial dimension is reduced by subsampling blocks in each feature map in the convolution layer output. Max pooling, which replaces a block with the maximum value, is the most often used pooling operation. Pooling serves two purposes: it progressively reduces the spatial dimension thus decreasing the overfitting problem through the reduction in the number of parameters and selects the most robust features.
The actual operation that is performed in the convolution layer is correlation and not convolution. The term "convolution", therefore, is incorrectly used. However, if the input or the filter is folded (1-d case) or rotated (2-d and 3-d cases), the correlation and convolution operations are equivalent. Therefore, it is assumed that one of the inputs has been pre-folded or pre-rotated prior to the actual correlation operation performed in the convolution layer.
The following sections describe four implementations of CNNS that use the vector, matrix, and cuboid input models. The models can be distinguished by the convolution operations in the first stage and the output-level fusion operation. In order to do so, the input and output of the first convolution layer are assumed to be generalized cuboids [1] ), respectively. Using this notation, a d-dimensional vector and (m × n) matrix are represented as (1 × d × 1) and a (m × n × 1) generalized cuboids, respectively. If a pooling layer follows, the output of the pooling layer is assumed to have dimensions (H [1,p] × W [1,p] × D [1,p] ). The filters in the first convolution layer have dimensions represented by ( f [1] h × f [1] w × D [0] ) and the pooling filter by ( f . The dimensions are related as follows: where, p, s c , s p , and K [1] represent the zero-padding amount, convolution stride, pooling stride, and the number of filters in the first stage, respectively. Zero-padding is employed in "same convolution" to keep the input and output dimensions equal. If p = 0, the output of the "valid convolution" operation has smaller dimensions than those of the input. Just as in the development of the DTW classifiers in the previous section, the CNN classifiers are explicitly split into two operations: computation of the posterior class probabilities and a decision rule. The posterior class probabilities are computed by the CNN and the decision rule uses these probabilities to assign the test movement into one of the movement classes. Although a pooling layer may or may not follow a convolution layer, it will be assumed that a convolution layer is followed by a pooling layer for consistency in the formulations. It will also be assumed that the convolutions are "same." The output dimensions can be easily adjusted if the convolutions are "valid."

CNN Implementation of the VI Model (CNN-1)
The CNN-1 classification model which uses the VI model is illustrated in Figure 9.
The CNN-1 model is characterized by vector convolutions in the first layer to extract local intra-axial features and output-level fusion for combining the M g classifier outputs. Because the uni-axial classifiers are identical, the CNN classifier for one uniaxial signal s gm is first described and the method for combining the outputs of the M g classifiers is described next.
Posterior Probability Computation: In the first convolution layer, the input vector s gm with generalized cuboid dimensions (1 × n gm × 1) is convolved with K [1] filters, each with dimensions (1 × f [1] w × 1). Because the convolution is assumed "same", the outputŝ [1,k] gm of the kth filter will have the same dimensions as the input s gm . A bias b [1,k] gm is added to the filtered output and passed through the nonlinear ReLu activation function so that the activation of filter k in the first layer is given by where, ReLu[δ] = Max[0, δ]. The output of the first convolution layer is the K [1] activations combined into (1 × n gm × K [1] ) unit height cuboid represented by S [1] gm . If pooling follows and the stride and size of the pooling filter are r and (1 × γ × 1), respectively, the output S [1,p] gm of the pooling layer will have dimension 1 × [((n gm − γ /r) + 1] × K [1] ).
In the second convolution layer, if each filter has dimension (1 × f [2] w × K [1] ), the output s [2,k] gm of the kth filter will have dimension 1 × ((n gm − γ /r + 1] × 1). Note that although the two functions convolved are unit height cuboids, the output is a vector. After adding a bias and passing each filtered output through the ReLu activation function, the K [2] activations are combined into a unit height cuboid. The width of the unit height cuboid is adjusted according to the stride if a pooling layer is added. If necessary, the convolution and pooling operations can be repeated. A flattening operation is employed to combine the rows of the last cuboid into a vector which forms the input to a fully connected feed forward neural network with N gm layers. Typically, the sigmoidal or tanh functions are used as activations in the intermediate hidden layers and the softmax activation is used in the output layer of the fully connected network (FCN). Cross-entropy is employed for the loss-function. Because of the softmax activation function, the outputs can be regarded as estimates of posterior probabilities given by where, q h is the weighted sum of the inputs into a neuron h in the output layer. The output of the CNN classifier for signal s gm is represented by the vector P gm = p gm (1), p gm (2), . . . , p gm (H) ; g = 1, 2, . . . , G, m = 1, 2, . . . , m g Decision Rule: The H probabilities of the M G CNN classifiers are averaged into a probability fusion vector represented by where, Using the maximum response rule, the CNN assigns the input movement to the class associated with the output that yields the largest value. That is, a test movement is assigned to class ω h if P ω h > P ω j , for all j = h Equivalently, the test movement is assigned to the class given by The CNN-1 model shares similarities with the Channel-Based Late Fusion models (CB-LF) described in [35,39] in the sense that there is one CNN per axis. The main difference is that the CB-LF model has one FCN and the input to the FCN is the concatenation of the features from the last convolution layer of each axis. The late fusion, therefore, is a form of inter-channel feature fusion. The CNN-1 model has one FCN for each axis and the late fusion is a form of decision fusion that occurs at the outputs of the CNNs.

CNN Implementation of the LAI Model (CNN-2)
The CNN-2 classification model which uses the LMI model is illustrated in Figure 10. It is characterized by one CNN classifier for each sensor, matrix convolutions in the input layer to extract local intra-sensor features, and output-level fusion for combining the outputs of the G CNN classifiers.
The CNN-1 model shares similarities with the Channel-Based Late Fusion models (CB-LF) described in [35,39] in the sense that there is one CNN per axis. The main difference is that the CB-LF model has one FCN and the input to the FCN is the concatenation of the features from the last convolution layer of each axis. The late fusion, therefore, is a form of inter-channel feature fusion. The CNN-1 model has one FCN for each axis and the late fusion is a form of decision fusion that occurs at the outputs of the CNNs.

CNN Implementation of the LAI Model (CNN-2)
The CNN-2 classification model which uses the LMI model is illustrated in Figure 10. It is characterized by one CNN classifier for each sensor, matrix convolutions in the input layer to extract local intra-sensor features, and output-level fusion for combining the outputs of the CNN classifiers.

Posterior Probability Computation:
In the first layer, the sensor matrix Z g (m, n) with generalized dimensions m g × n g × 1 is convolved with with K [1] filters, each with di- w × 1). The outputẐ [1,k] g (m, n) of the kth filter is a matrix with the same dimensions as the input. A bias b [1,k] g is added to the filtered output and passed through the nonlinear ReLu activation function. The activation of the filter, therefore, is given by The K [1] filtered outputs are combined into a (m g × n g × K [1] ) cuboid Z [1] g (m, n, k). If pooling follows and the stride and size of the pooling filter are r and (γ × γ × 1), respectively, the output is the cuboid Z [1,p] g (m, n, k), m = 1, 2, . . . , m [1,p] g , n = 1, 2, . . . , n [1,p] g , k = 1, 2 . . . , K [1] ) where, m In the next convolution stage, the cuboid is convolved with a cuboid filters with dimensions ( f [2] h × f [2] w × K [1] ). Each filtered outputẐ [2,k] g (m, n) resulting from the cuboid convolution is a matrix. The series of convolutions and pooling operations terminate into an FCN with a softmax output layer.
If p g (h) is the output of neuron h in the output layer, then, the output of classifier for matrix Z g (m, n) can be represented by the vector P g = p g (1), p g (2), . . . , p g (H) ; g = 1, 2, . . . , G.
Decision Rule: The outputs of the G classifiers can be averaged and represented by the vector P = (P ω 1 , P ω 2 , . . . , P ω H ) (28) where, A test movement is then assigned to class ω h using the rule in Equation (25). The CNN-2 model is somewhat similar to the Sensor-Based Late Fusion models (SB-LF) described in [34,39] in the sense that there is one CNN per sensor. The main difference is that the SB-LF model has one FCN and the input to the FCN is the late fusion of the features from the last convolution layer of each sensor. The CNN-2 model has one FCN for each sensor and the late fusion is a form of decision fusion that occurs at the outputs of the CNNs.

CNN Implementation of the GAI Model (CNN-3)
The CNN-3 classification model using the GMI model, shown in Figure 11, is characterized by matrix convolutions in the first layer to extract local intra-sensor features and no output-level fusion. A small number of inter-sensor features are also extracted from the bordering uni-axial outputs from adjacent sensors in the input matrix. The input is the global matrix Z(m, n). where, A test movement is then assigned to class using the rule in Equation (25). The CNN-2 model is somewhat similar to the Sensor-Based Late Fusion models (SB LF) described in [34,39] in the sense that there is one CNN per sensor. The main differenc is that the SB-LF model has one FCN and the input to the FCN is the late fusion of th features from the last convolution layer of each sensor. The CNN-2 model has one FCN for each sensor and the late fusion is a form of decision fusion that occurs at the output of the CNNs.

CNN Implementation of the GAI Model (CNN-3)
The CNN-3 classification model using the GMI model, shown in Figure 11, is charac terized by matrix convolutions in the first layer to extract local intra-sensor features an no output-level fusion. A small number of inter-sensor features are also extracted from the bordering uni-axial outputs from adjacent sensors in the input matrix. The input is th global matrix ( , ). The output of the filter yields a matrix [ , ] ( , ). A bias [ , ] is added to the fi tered output and passed through the nonlinear activation function. The activatio of the filter, therefore, is given by [ , ] ( , ) = Decision Rule: a test movement is assigned to class using the rule in Equation (25 The CNN-3 model is similar to the Early Fusion (EF) model described in [37,39]. Th difference is mainly in the selection of the dimensions of the filters in the convolutio layers.
The output of the kth filter yields a matrixẐ [1,k] (m, n). A bias b [1,k] is added to the filtered output and passed through the nonlinear ReLu activation function. The activation of the filter, therefore, is given by The K [1] filtered outputs are combined into a (M g × N × K [1] ) cuboid Z [1] (m, n, k) which is pooled to give the cuboid Z [1,p] (m, n, k). The cuboid pooling operation is not described because it is similar to the one used in the previous model. The pooled cuboid is filtered by K [2] cuboid filters and the output of the kth filter is a matrix represented bŷ Z [2,k] (m, n), m = 1, 2, . . . , M [1] , n = 1, 2, . . . , N [1] (31) where, M [1] and N [1] are the height and width of the pooled output Z [1,p] (m, n, k). The series of convolutions and pooling operations terminate into a FCN with a softmax output layer which gives an estimate of the H movement probabilities. The softmax output is represented by the vector P = (P ω 1 , P ω 2 , . . . , P ω H ).
Decision Rule: a test movement is assigned to class ω h using the rule in Equation (25). The CNN-3 model is similar to the Early Fusion (EF) model described in [37,39]. The difference is mainly in the selection of the dimensions of the filters in the convolution layers.

CNN Implementation of the CI Model (CNN-4)
The CNN-4 classification model, shown in Figure 12, is implemented using the cuboid representation which is obtained by fusing the LMI local matrices into a cuboid. Cuboid convolutions in the first layer extract coupled intra-sensor and inter-sensor features throughout the input.

CNN Implementation of the CI Model (CNN-4)
The CNN-4 classification model, shown in Figure 12, is implemented using the cu boid representation which is obtained by fusing the LMI local matrices into a cuboid. Cu boid convolutions in the first layer extract coupled intra-sensor and inter-sensor feature throughout the input. Posterior Probability Computation: The cuboid input Z(m, n, g) is convolved with cuboid w × G) and the output of the kth filter, is represented by Note that convolving two cuboids with the same depth results in a matrix. The K [1] filtered outputs are combined into a (M × N × K 1 ) cuboid after the biases are added and passed through the ReLu activation function. The height and width of the cuboid is adjusted if a pooling layer follows the convolution layer. Subsequent convolutions are also cuboid convolutions which result in matrices which are then combined into cuboids. An FCN with softmax outputs is implemented after the last pooling layer. The softmax output is represented by the vector P = (P ω 1 , P ω 2 , . . . , P ω H ). (34) Decision Rule: a test movement is assigned to class ω h using the rule in Equation (25).
The CNN-4 model is unique because, to the best of our knowledge, there are no similar models which combine the uniaxial signals of each sensor into matrices, combine the matrices into a cuboid, and extract a combination of intra-sensor and inter-sensor features.

Experimental Data Collection
Motion capture for both taekwondo and boxing was conducted using custom Inertial Measurements Units (IMUs), developed in previous motion tracking research [7] as a basis for a commercial product. The IMU consists of a 3-axis accelerometer and 3-axis gyroscope; the ranges of the two sensor modules was set at ±16 g and ±2000 dps respectively to capture the full range of motion in both sports. Sampling frequency was constant for both sports, at 100 Hz. Data was streamed in real time from the IMU to the control computer via Bluetooth 4.0 communication. The IMU module was placed on the striking limb and held by Velcro straps. A pouch was sewn on the inside of the strap to keep IMU positioning consistent throughout the data collection process. Positioning and axis orientation of the IMU for boxing and taekwondo are outlined for sample movements in Figures 13 and 14, respectively. Note that these axes are relative and rotate along with the limb.
Note that convolving two cuboids with the same depth results in a matrix. The [ ] filtered outputs are combined into a ( × × ) cuboid after the biases are added and passed through the activation function. The height and width of the cuboid is adjusted if a pooling layer follows the convolution layer. Subsequent convolutions are also cuboid convolutions which result in matrices which are then combined into cuboids. An FCN with softmax outputs is implemented after the last pooling layer. The softmax output is represented by the vector Decision Rule: a test movement is assigned to class using the rule in Equation (25). The CNN-4 model is unique because, to the best of our knowledge, there are no similar models which combine the uniaxial signals of each sensor into matrices, combine the matrices into a cuboid, and extract a combination of intra-sensor and inter-sensor features.

Experimental Data Collection
Motion capture for both taekwondo and boxing was conducted using custom Inertial Measurements Units (IMUs), developed in previous motion tracking research [7] as a basis for a commercial product. The IMU consists of a 3-axis accelerometer and 3-axis gyroscope; the ranges of the two sensor modules was set at ±16 g and ±2000 dps respectively to capture the full range of motion in both sports. Sampling frequency was constant for both sports, at 100 Hz. Data was streamed in real time from the IMU to the control computer via Bluetooth 4.0 communication. The IMU module was placed on the striking limb and held by Velcro straps. A pouch was sewn on the inside of the strap to keep IMU positioning consistent throughout the data collection process. Positioning and axis orientation of the IMU for boxing and taekwondo are outlined for sample movements in Figures  13 and 14, respectively. Note that these axes are relative and rotate along with the limb.  Experiments were designed to demonstrate the application and evaluation of the three DTW and four CNN classification models developed in this study. Motion capture data was collected from 15 martial artists of varying experience. 18 classes of boxing punches and 24 classes taekwondo kicks (6 kicks for each leg for shadow and bag strikes) were collected as consistent with the entire range of movements for each sport. The classification models were given no a priori information on sensor placement or left/right limb to make the systems robust to using either sensor on either limb without polarization. Experiments were designed to demonstrate the application and evaluation of the three DTW and four CNN classification models developed in this study. Motion capture data was collected from 15 martial artists of varying experience. 18 classes of boxing punches and 24 classes taekwondo kicks (6 kicks for each leg for shadow and bag strikes) were collected as consistent with the entire range of movements for each sport. The classification models were given no a priori information on sensor placement or left/right limb to make the systems robust to using either sensor on either limb without polarization. Moreover, the models were not presented any a priori movement features. To our knowledge, no previous system has classified this wide range of movement and no existing classification model has demonstrated generalizability to both sports [6].
Boxing punches were acquired by placing the IMU on the wrists of each martial artist for 6 different punch classes during shadow boxing, punching a heavy bag, and with a trainer holing pads (2880 strikes, 18 classes). For taekwondo, an IMU was placed on the ankle of each martial artist executing kicking motions (2880 strikes, 24 classes). The signals were segmented using a signal-energy based algorithm [52] to locate the start and end-points of the strikes. The boxing punch classes, and taekwondo kick classes are listed in Tables 1 and 2, respectively. Note the right-and left-hand classes will be switched for left-handed (Southpaw) boxers. Table 3 shows examples of superimposed ensembles of boxing and taekwondo strikes. For clarity, Figures 15 and 16 show enlarged versions of the ensembles of one boxing and one taekwondo strike extracted from Table 3 Given that the number of sensors G is 2, the number of axes m g in each sensor is 3, and the total number of axes M g is 6, the 4 input models are characterized by the following:

System Training and Convergence
Each data set was divided randomly into a training set and a test set containing approximately 80% and 20% of the strikes, respectively. The average classification accuracy for the test set was determined. The random partitioning into training and test sets was repeated 100 times, with classification accuracies across repetitions averaged to obtain a final estimate of the classification accuracy.
For the DTW classifiers, the reference templates for the strike classes were determined by averaging the signals in their respective training sets. The CNN classifiers were initialized with a different set of random weights for each random partitioning of the data sets. Consequently, the final classification accuracy was obtained by averaging the results of 100 different CNNs. In order to keep the comparisons fair, the number of convolutions, pooling, and FC layers were fixed for all experiments. Moreover, the ordering of the layers was fixed. Given that the dimensions of the data were relatively small (2 sensors, 3 axes/sensor), a deep network with a large number of convolution and pooling layers was not needed. The CNN, therefore, consisted of a convolution layer, convolution layer, pool-ing layer, and 2 FC layers in which the first FC layer used sigmoidal activation functions and the last FC layer used softmax activation functions. The "same" operation was used in the convolution layer and max pooling was used in the pooling layer. The number of filters were 32 and 32 in the first and second convolution layers, respectively. The filter dimensions in the first and second convolution layers were as follows: (1 × 3 × 1) and (1 × 3 × 32) for CNN-1, (3 × 3 × 1) and (3 × 3 × 32) for CNN-2, (3 × 3 × 1) and (3 × 3 × 32) for CNN-3, and (3 × 3 × 2) and (3 × 3 × 32) for CNN-4, respectively. The networks were implemented using the Keras library [53][54][55].
Training times were benchmarked for each input model and classifier. Time efficiency profiling was conducted by using the MATLAB 2021a internal profiler for DTW models and the Python 3.8.12 c Profile function for CNN models. All evaluations were conducted on a system using Windows 10 Home Edition with an Intel Core i7-6700 k 4GHz Quad Core CPU, GeForce GTX 980 Ti GPU and 32GB RAM. Figure 17 shows the total time in seconds for model parameterization. The CNN training time increase is expected given the repeated layering design as opposed to the single pass in the DTW. CNN1 and CNN2 are also setup for multiple neural networks in training, hence their increase in training times. CNN3 and CNN4, however, show parameter convergence in comparable time to DTW in the single network training. It is also interesting to note that boxing data actually took longer to train or had negligible differences to taekwondo in CNN implementations, despite being an 18-class problem versus 24-class. Boxing punches very more between the dominant and non-dominant side, but boxing must also distinguish between pad (human held and bag strike classification which could account for comparable or longer time to converge. We also note that differences between an uppercut and hook punch can include highly nonlinear variations in movement waveform. While the training times for CNN1 and CNN2 were significant, it is also important to note that this has no impact on online use for real-time classification. Training, in particular for such large data sets will be done offline with identification in real-time occurring with trained models. Deep neural networks have well-established properties for computationally compact representations of nonlinear models enabling use online, even in time critical applications with limited computational resources (e.g., [56]). Complete identification of strikes occurs in negligible (msc) timeframes for all CNN models. DTW models are not as computationally lean for online use as they require a comparison of each data point of an incoming motion to each data point in the movement class. However, the time to execute this in real-time is still suitable for most online human-motion tracking applications. We envision training for individualized movements to be completed on cloud servers with online use parameters updated to edge devices to execute in firmware, as implemented in our commercial systems.

Results and Analysis
Outputs of the classification experiments after training are summarized in Tables 4  and 5. Table 4 shows the results of classifying each uniaxial signal independently for both data sets. For each classifier (table row), the result of the best axis channel is shown in boldface. For example, for the uni-axial boxing DTW classifiers, the best result of 65.1% While the training times for CNN1 and CNN2 were significant, it is also important to note that this has no impact on online use for real-time classification. Training, in particular for such large data sets will be done offline with identification in real-time occurring with trained models. Deep neural networks have well-established properties for computationally compact representations of nonlinear models enabling use online, even in time critical applications with limited computational resources (e.g., [56]). Complete identification of strikes occurs in negligible (msc) timeframes for all CNN models. DTW models are not as computationally lean for online use as they require a comparison of each data point of an incoming motion to each data point in the movement class. However, the time to execute this in real-time is still suitable for most online human-motion tracking applications. We envision training for individualized movements to be completed on cloud servers with online use parameters updated to edge devices to execute in firmware, as implemented in our commercial systems.  Tables 4 and 5. Table 4 shows the results of classifying each uniaxial signal independently for both data sets. For each classifier (table row), the result of the best axis channel is shown in boldface. For example, for the uni-axial boxing DTW classifiers, the best result of 65.1% was obtained from the x-axis channel of the accelerometer. An accuracy of 65.1% may not be strong, however, in comparison, an accuracy of only 5.6% can be expected through the random classification for an 18-class problem. The classification accuracies of the seven fusion classifiers are shown in Table 5 for both data sets. The best result for each data set is shown in boldface. Note that for each classifier type and data set, the worst fusion result in Table 5 is much better than the best uni-axial result in Table 4. This clearly demonstrates the merits in fusing information from multiple sensors and axes. Also note that in Tables 4 and 5, the accuracies of the CNN classifiers are much higher than those of the DTW classifiers. The fact that the CNN classifiers performed better than the DTW classifiers is quite unexpected for the following reasons: (a) unlike DTW classifiers, CNNs do not readily appear to be a good choice for classifying signals that are not naturally in an 2-d or multidimensional array formats, and (b) unlike the design of DTW classifiers, the design of CNN classifier do not typically focus on addressing the non-linear variations problem. From the results, it is interesting to note the following: (a) The CNN classifiers performed well in spite of the fact that the uni-axial strike signals typically experience non-linear variations as seen in Table 3 and in Figures 15 and 16.
The reason for this performance can be explained by noting that features are detected locally and not globally. Consequently, the local features tend to be invariant to latency shifts. Moreover, the local features are unaffected in the segments that do not experience non-linear variations. The trial-to-trial variations of the signals within each training set can be regarded as a natural form of "data augmentation" which is a technique commonly used to artificially increase the diversity in the training set without having to collect additional data. The CNN classifiers are capable of learning the typical variations in the signals by presenting the network with representative signals during training. (b) Using the same input data, the CNN implementations using the four input models extracted different types of local features for classification. The CNN classifiers, therefore, offer many choices of local features which can be selected depending on the type of coupling assumed or desired between the intra and inter-sensor outputs. For example, if the uni-axial outputs of all sensors are assumed independent, CNNs using the VI model can be selected. CNNs using the LMI model can be selected if the channels in a sensor carry complementary information for determining the output class. If complementary information is shared across all sensors, CNNs using the GCI model will be an effective choice. The manner in which the inputs are fused can take other factors into account, for example, the geographical locations (co-located or dispersed) of the sensors. The sensor outputs can also be fused in other ways. For example, the x-axis channels of all sensor can be combined into an matrix. The y-axis and z-axis channels can be combined in a similar manner. The intra and inter-sensor coupling assumptions can, therefore, be used to choose a particular classification model for a given problem. (c) It is unlikely that the performance of DTW classifiers can be improved by increasing the size of the training set because the template, which is the training set average, will change only marginally after a certain point and this marginal change will have little effect on the performance. Contrarily, CNN classifiers have the potential to improve performance by extracting more complex features by increasing the network depth and training data. Furthermore, by increasing the network depth and training data, CNNs are capable of accurately classifying a larger number of classes, whereas, the performance of traditional classifiers such as DTW classifiers will tend to drop as the number of classes increase. (d) It is interesting to compare the performances of the one, two, and three-dimensional classifiers resulting from the VI, LMI & GMI, and GCI input models, respectively. By comparing the results for the DTW classifiers in Table 5, it is first noted that the classification accuracies vary marginally for all DTW classifiers across both sets of data. The best results for the boxing and taekwondo data were obtained by the 2-dimensional DTW-2 classifiers. The classification accuracies also varied marginally across the CNN classifiers for both data sets. The best results were obtained by the 2-dimensional CNN-3 and CNN-4 classifiers for the boxing and taekwondo data, respectively.
It is also worth noting that CNNs offer particularly intriguing potential for widespread use in commercial wearables due to low computational expense of online use. A cloudbased training system working in conjunction with an embedded wearable would enable real-time training feedback coupled with updates and adaptation as movements change with time.
Finally, it should be stressed the results presented here are designed to demonstrate the capacity of the input modelling and classification approaches in the most challenging of circumstances. Tables 4 and 5 show output for the maximum number of classes with zero knowledge of movement and the broadest class of athletes. While the accuracies by themselves are enough for commercial use, further improvements are easily possible in practice. The eighteen-class boxing data, for example, yielded fusion classification accuracies of 95% + for CNN-3 with a subset (1/3) of the boxers who were not beginners. Furthermore, it is unlikely that boxers or martial artists even with simple training will mix striking a heavy bag, shadow boxing, or pad striking in the same round. Such measures will virtually eliminate misclassification such that the only errors are erratic strikes from the user that do not fit any strike model.

Translation for Mass Market Athletic Training
The research executed in this investigation has led to the design, fabrication, and commercial translation of a complete IoT sensor system for smart boxing. Our design reflects the evolution of wearables from a 'device' to a 'systems' perspective [9], and consists of original sensors, embedded code, apps for use with a smart phone for data collection, and cloud computing for data storage and visualization. The system, shown in Figure 18, has been released as a commercial product by Corner Wearables based in Manchester, UK. It was first trialed with the boxing team at Imperial College London and subsequently expanded into a full product for sale worldwide. The integrated system consists of a small sensor in the boxer's hand wraps that fits under boxing gloves. All code for punch identification is embedded onboard with a microcontroller to detect and classify movement history, which is sent via Bluetooth to a smart device for display and storage through an app on a smart phone. The embedded code performs all pattern classification hence transmission is only necessary for statistics saving the need to send raw data over Bluetooth. The first-generation commercial system tracks 6 classes of punches (dominant hand-cross, hook, uppercut, non-dominant hand-jab, hook, uppercut). Subsequent releases will classify the full 18-class problem outlined in Section 6 using the full deep learning architecture outlined in this investigation. Corner, featured in IEEE Spectrum [57], is the first ever smart boxing tracker which does not need polarized (left-right specified) sensors. It was recently assessed in a boxing study as a part of this special issue of Sensors [58] as having the capacity to track both beginners and experienced boxers, though beginner punches are less consistent due to immaturity of technique. Thousands of devices are currently in use, providing an intriguing database for analysis in future work. The commercial system has also been used in live boxing matches, including the World Series of Boxing, to provide real-time statistics to spectators, judges, and trainers to evaluate match performance.

Conclusions
The goal of this investigation was to develop models to classify human movement by fusing information from ensembles of wearable multi-axial inertial sensors. The specific contributions resulting from the investigation include: (a) the introduction of four multi-sensor multi-axial input models that can be used in conjunction with diverse classi-

Conclusions
The goal of this investigation was to develop models to classify human movement by fusing information from ensembles of wearable multi-axial inertial sensors. The specific contributions resulting from the investigation include: (a) the introduction of four multisensor multi-axial input models that can be used in conjunction with diverse classifiers, (b) demonstrating the use of the input models to develop three DTW and four CNN fusionbased classifier models that do not require a set of predetermined hand-engineered features, (c) testing the validity of the classifier on boxing and taekwondo sport data, (d) demonstrating the merits of multi-axial fusion by showing the that the worst fusion classifiers outperform the best uniaxial classifiers, (e) demonstrating that high classification accuracies can be obtained with the CNN fusion classifiers on signals that experience large non-linear variations and on signals belonging to a large number of classes, (f) demonstrating the surprising result that the CNN fusion classifiers outperform the DTW classifiers, (g) explaining the ability of the CNN classifiers to extract local features which depend on the type of coupling assumed or desired between the intra and inter-sensor outputs, and (h) noting that CNN classifiers have the potential to improve performance and handle a larger number of classes through both training and network scaling.
To our knowledge this is the first set of models demonstrated on this large a problem class in either activity [7] and the first generalized non-feature specific classification over multiple movement ranges. Also noteworthy is that due to the generalized formulations, the classifiers can be easily adapted to classify multi-dimensional signals of multiple sensors in various other applications.
Future work involves refining the system for exact learning of individual users for performance assessment, analyzing time series data from training of large groups of athletes, and implementation for live performance streaming in professional fights to enhance spectator experience support of fight scoring. As a completely feature-blind generic classification strategy, translation is also underway in other sports (e.g., tennis) as well as in wearables for telemedicine in neural motor dysfunction conditions such as stroke and Parkinson's Disease [59,60]. We believe these results provide a foundation for a new set of human movement classification paradigms based on fusion and deep learning.
Funding: This work was supported by the EPSRC (EP/R511547/1), the EPSRC CDT in Neurotechnology, the Department of Mechanical Engineering and UK DRI CR&T at Imperial College London (ICL) and Athletec Inc.

Institutional Review Board Statement:
The study was approved by the Imperial College Research Ethics Committee (ICREC) under study reference 15IC3068. Human subject permission does not explicitly give permission for data release.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
Please contact corresponding authors for data availability. Ethics permission does not explicitly give permission for public release of human performance data at the time of publication.