The RF labels every frame as either transitional movement (class 0) or sign movement (class 1). As a result, without any assumptions, we obtain an initial accurate temporal segmentation of pure activity sub-sequences. This segmentation is sensitive to the spatial as well as the velocity components of the motion. Moreover as described in the following, the robustness of the resulting split proposals can be enhanced utilizing post-processing on the base of the classifier’s confidence values.
3.1.1. Frame-Wise Classification
For each frame i in the motion sequence, we compute a feature vector which depicts phases of temporal and directional transformation. consists of a general geometric feature descriptor that represents the spatial (and angular) relations between body joints over a certain time span, and an additional kernel descriptor that represents their non-linear relationship obtained from a Laplacian kernel transformation of .
From any representation of angular joint relations,
can be computed as spatio-temporal representation
around a window
of neighboring frames. Joint relations shown to provide reasonable segmentation results are the combination of angular and distance features between line segments of pairs of joints introduced in [
33], or the joint angular displacement transformation by Kumar et al. [
34]. Another possible variation could be to utilize the point of intersection between selected line segments as reference point for a subsequent computation of joint or inter-segment angular displacements. To find the best
, we evaluate the performance of the RF with respect to all three variations adapted to the given setting: (a) the 2D version of the original line segment relation features (in the following referred to as LS), (b) its modified version encoding joint angular distances with respect to the point of intersection between the line segments (in the following referred to as LS-IS), and (c) a joint angular displacement transformation (in the following referred to as JAD) utilizing the same joint pair combinations as LS and LS-IS.
For all three variations, the frame-wise kernel matrix for a motion sequence of
t frames and its corresponding geometric feature matrix
is defined as
where
characterizes the similarity between the spatial feature vectors
and
of frames
i and
j in terms of the kernel function
. The kernel feature vector for a given frame
i and window size
is then defined as the flattened upper triangular sub-kernel
. As discussed in [
33], the idea here is to derive further high-level understanding of skeleton movement dependencies over time.
Following the original work, we apply the proposed window sizes around frame i to concatenate all corresponding feature vectors to , and around frame i to build . The concatenated frame-wise data representation is computed for each frame within a given training and test set. The resulting classification label for every feature vector is then used to determine an initial sentence segmentation proposal as the sequence vector . Consecutive frames classified as belonging to class 1 are interpreted as the segments of interest, also holding information about their start and end points. The frames classified as class 0 and positioned in-between different segments are considered non-gestures or transitional frames.
3.1.2. Confidence-Based Split
The RF classifies each frame i on the base of its confidence about the respective frame’s affiliation to class 1. This means that every frame with confidence is labeled as 1, and 0 otherwise. Ideally, every motion frame would be assigned to class 1 and every transition frame would be assigned to class 0. However in practice, it is nearly impossible to obtain such perfect split within a natural signing flow. To obtain a robust segmentation of sign and transitional movements, we refine the actual sentence segment prediction with the following signal-based post-processing.
The distribution of values for a frame belonging to class 1 over all motion sequence frames t defines a temporal prediction curve c with . Over the progression of a signed expression, c follows a fundamental sine-like pattern: parts of high and low confidence take turns on the base of the reciprocal occurrence of transitional and sign movements. We therefore compute a smooth version of c with a Gaussian filter of kernel deviation . Here, the idea is that is robust to smaller erroneous parts of mislabelling within multi-directional words or complex movements. As such, eliminates disturbances caused by classifier predictions inconsistent in their initial confidence rating. The values of can then be used to obtain a smoothed segmentation proposal .
Sources of noise (e.g., stutter, fast movement between two separate words or the signing of complex, multi-directional words) might both add or remove peaks to the basic evolution of c that cannot be represented in or . Since the main objective of our work is to investigate the recognition of NMEs, we aim to obtain the most robust sentence splits as possible. Therefore, we include additional information on the number of sign words occurring per sentence pattern. Here, it should be noted that this strategy is an optional, minor fine-tuning process which can be omitted in the absence of respective word count information. As such, it also does not significantly influence the following system evaluation.
For the optional segmentation refinement, we generate two additional, modified versions from the prediction curve
c: a mildly smoothed version
of
c by applying a Gaussian filter of kernel deviation
, and a weakly smoothed version
of
c by utilizing a Savitzky-Golay filter of window length
and polynomial order
. Plotting the filtered confidence curves (
Figure 2), one can see that strong Gaussian smoothing results in confidence curves which are more robust to smaller erroneous parts (
Figure 2b), but which also miss significant information in parts of quick signing and flowing transitions between signs (
Figure 2c). To account for such information loss, we determine the number of word segments given by
and compare them to their actual word count. All expressions whose word count is smaller than the sentence content then undergo a more detailed signal-based segmentation check. This enhances the probability that noise-induced
peaks which were correctly smoothed out in
will be left unconsidered.
To start, we identify all significant peaks of
c that were smoothed out in
, but would remain present in
. In concrete, we detect all maximal peak locations
and all minimal peak locations
of
and identify their labels as given by the respective
. Next, we identify the correlating reference locations
r of maximal (for
) and minimal (for
)
values within a window
. Here,
is defined as the area between the greatest lower bound (glb) and the least upper bound (lub) of
and
around every peak in
and
. For maximal peaks, this for example holds
with
for glb
∀ i
with
, and lub
∀ i
with
. Lastly, we replace all
labels within the specific
with the labels of
. This gives us a refined segmentation proposal
.
Based on the number of mismatches
m between the word count of
and the actual known sentence word count, we further refine the proposal in case of
. Next, we target to retrieve all parts of
c that are most likely to be misclassified by the RF, namely the locations of minimal peaks
with
and the locations of maximal peaks
with
. Similarly to the previous procedure, we utilize windows of intersection between two smoothed confidence curves around all peak values of interest. To retrieve peak values that were smoothed out by the high filter value
, we determine the points of intersection between
and
. We find the
m most likely mislabeled locations following the definition given in Equation (
2). Lastly, we replace their determined respective window areas with either class 0 or class 1 labels accordingly. This provides us with the final segmentation proposal
.