Action Classiﬁcation for Partially Occluded Silhouettes by Means of Shape and Action Descriptors

: This paper presents an action recognition approach based on shape and action descriptors that is aimed at the classiﬁcation of physical exercises under partial occlusion. Regular physical activity in adults can be seen as a form of non-communicable diseases prevention, and may be aided by digital solutions that encourages individuals to increase their activity level. The application scenario includes workouts in front of the camera, where either the lower or upper part of the camera’s ﬁeld of view is occluded. The proposed approach uses various features extracted from sequences of binary silhouettes, namely centroid trajectory, shape descriptors based on the Minimum Bounding Rectangle, action representation based on the Fourier transform and leave-one-out cross-validation for classiﬁcation. Several experiments combining various parameters and shape features are performed. Despite the presence of occlusion, it was possible to obtain about 90% accuracy for several action classes, with the use of elongation values observed over time and centroid trajectory.


Introduction
Human action recognition is a very popular topic in a field of computer vision. There is a large variety of applications associated with action recognition and classification based on Visual Content Analysis, some examples include surveillance systems [1,2], video retrieval and annotation [3], human-computer interaction [4,5], and quality-of-live improvement systems [6,7]. The variety of applications implies different approaches for the selection and description of action features due to varied action types. In [8], actions are defined as ''simple motion patterns usually executed by a single person and typically lasting for a very short duration (order of tens of seconds)." In turn, the author of [9] indicates diverse characteristics of an action, ranging from very primitive to cyclic body movements. As it is then stated in [10], actions are primitive movements lasting up to several minutes-an action is less complicated than an activity and more complex than a single motion (a gesture). Taking the above into account, an action is a collection of simple movements that are organized in a short period of time and can have periodic or non-periodic characteristics. Examples of actions include walking, hand waving and bending, and it is easy to notice that their basic movements are common to some physical exercises. In this paper, a focus is put on the recognition of actions that are classified based on exercise types. However, a problematic aspect is introduced-occlusion-which is a very challenging problem in computer vision [11,12].
According to the recommendations of the World Health Organization (WHO) [13], a healthy adult should do at least 150 min of moderate-intensity (or 75 min of vigorousintensity) physical activity every week to keep general health and lower the risk of noncommunicable diseases (e.g., cardiovascular disease, hypertension, diabetes, mental health conditions). A physical activity refers to a variety of exercises, such as jogging, weight lifting, bike riding, swimming and many more. However, the latest situation in the world related to the COVID-19 pandemic has been forcing people to change their daily routines and lockdowns, in particular, have been limiting physical activity [14,15]. There are some studies from all over the world researching the adverse consequences of the pandemic on different aspects of life and well-being that indicate the importance of physical activities for health improvement [16][17][18][19]. The WHO proposed some updated general recommendations on physical activity during pandemic, emphasizing several activity types that can be done at home, such as online exercise classes, dancing, playing active video games, jumping rope as well as muscle strength and balance training [20]. A narrative review in [15] summarises physical activity guidelines from 29 papers and gives some final joint recommendations. Generally, exercise types should include aerobic activity (e.g., running, walking), muscle strength activity (e.g., squats, various jumps), flexibility-stretching and relaxation (e.g., stretching arms, bending, yoga), and balance training. A brief report presented in [14] provides the results of a survey on the preferred home exercises for digital programs. Nearly 70% out of 15,261 participants from different countries were willing to work out at home, and were mostly interested in flexibility, resistance (strength training) and endurance (e.g., aerobic training) exercises.
In our works, a specific application scenario is assumed in which a person performs exercises in front of a computer camera, based on the displayed examples. Camera captures exercises and the system analyses them during recognition process. This can be related to active video games and exercise classes online. In the examined scenario, a focus is put on incomplete action data, where a part of a silhouette is occluded as a result of the improper camera positioning or the person being at the wrong distance from the camera. Then, usually the upper or lower part of the video frame is occluded and a part of foreground silhouette is missing (above the shoulders or below the knees). Due to the fact that each pose can be affected in different manner, an action descriptor should relay on the pose variability (changes between frames) rather than an exact shape contour itself. Effective recognition needs robust shape features which can be calculated despite the lack of a part of a silhouette. This paper addresses the problem by applying shape generalization and a combination of simple shape features that are analysed over time. Exercise types are recognised by means of binary silhouette sequences extracted from the Weizmann database [21] with added occlusions and two-stage classification. Firstly, coarse classification based on centroid trajectory divides actions into two subgroups-actions performed in place and actions with changing location of a silhouette. Since all exercises can be performed with repetitions, periodicity is not taken into account. Then, each silhouette is represented using selected shape descriptor and all silhouette descriptors of a sequence compose an action representation. Action representations are transformed using Discrete Fourier Transform (DFT) and classified using leave-one-sequence-out procedure. The proposed action recognition algorithm can be applied in a physical activity training system offering digital programs for home exercises. Types of exercises can be selected according to personal preferences and age group. For instance, in older adults, recommended exercises include mobility, balance and flexibility activities that lead to fall prevention [15]. Some of these exercises can be performed while seating or holding a chair, which can lead to occlusions. Given that the proposed approach can employ various shape features, its parameters can be adapted to specific applications.
Using a taxonomy presented in [12], the proposed action representation belongs to the category of holistic representations that are based on global features of a human body shape and movements. Body parts are not localised and discriminative information is extracted from regions of interest. Section 2 covers some examples of holistic approaches and other related works, as well as describes several action recognition challenges related to object occlusion in video sequences. Section 3 explains consecutive steps of the proposed approach together with applied methods and algorithms. Section 4 describes experimental conditions and presents the results. Section 5 discusses the results and concludes the paper.

Related Works on Action Recognition and the Problem of Object Occlusion
Methods for recognizing human activities are widely applied in various systems and approaches in order to recognize and classify gestures, actions and behaviours. A multitude of applications and activity types results in a variety of features that can be used in the recognition process. However, an exemplary human activity recognition system consists of several common modules that perform motion segmentation, object classification, human tracking, action recognition and semantic description [2]. If a holistic representation is considered, action recognition is then based on the detected objects of interest (foreground binary masks, usually the shapes of human silhouettes) with known locations (trajectory or other motion information). Popular solutions include space-time techniques, where all silhouettes from a single sequence are accumulated and then features are extracted, or shape-based methods where features are extracted from each shape and combined into action representation afterwards.

Selected Related Works
One of the first works on spatio-temporal action recognition is reported in [22]. The authors propose Motion History Image (MHI) and Motion Energy Image (MEI) templates that represent how an object is moving and where the motion is present, respectively. In [23], the variants of MEI and MHI are combined to propose a new representation called temporal key poses. Instead of using one representation for a whole video sequence, a temporal template is assigned to each video frame. Then, k-nearest neighbour (KNN) and majority voting are applied for action classification. MHI is also used in [24] to create binary patterns. A texture operator, Local Binary Pattern (LBP), extracts the direction of motion from MHI and then the Histogram of Oriented Gradients (HOG) is used to represent features from LBP. Classification is performed with a Support Vector Machine (SVM).
Another solution that uses aggregated silhouettes from the entire sequence is described in [25]. Each sequence is represented using Average Silhouette Energy Images, where higher intensity pixels refer to static motion and low intensity values represent changing movements. Based on these, Edge Distribution of Gradients, Directional Pixels and Rtransform produce feature vectors that are combined in a final action representation. The authors of [26] investigate several contour and shape based descriptors indicating superiority of silhouette-based approaches. Again, action features are extracted from accumulated silhouette images, called ASI. It is indicated that almost perfect accuracy can be obtained using HOG and KNN with Euclidean distance.
In [27], binary silhouettes are represented separately. Shape contours are converted into time series and then encoded into short symbolic representation called SAX-Shape. A set of SAX vectors represent an action, and classification is performed using a random forest algorithm. A popular concept for data reduction is the selection of key poses [28,29]. The authors of [29] use silhouette contours and divide them into equal radial bins using centroid location. Then, a summary value of each bin is calculated based on the Euclidean distances from centroid to every contour point. The proposed feature is used in a learning algorithm to build a bag of key poses, and sequences of key poses are classified using Dynamic Time Warping and leave-one-out procedure. The idea of radial scheme is applied in [30] as well; however, the entire shape area is used (contour with its interior). Firstly, human silhouette is divided into smaller regions in a radial fashion. Then, each radial bin is represented using region-based shape descriptors-geometrical and Hu moments. Obtained feature vectors are fed into multi-class SVM that indicates action classes. Radial coordinates and centroid location are also related to polar transform, which is proposed for shape description in [31]. Three polar coordinate systems are employed to represent the whole human body, and upper and lower parts of the body. For each of these systems, a histogram is generated based on radial bins. The normalized histograms are concatenated and represent a human posture. Classification is performed on a predefined number of frames.

Action Recognition under Occlusion
The approaches for action recognition mentioned in Section 2.1 yield high classification accuracy on popular benchmark datasets, however, have not been demonstrated on partially occluded action sequences so far. Occlusion is an important and challenging problem for vision-based systems, and foreground object occlusion can be perceived as a loss of data. In a single-camera setup, self-occlusion or inter-object occlusion may be present [11]. Moreover, the object of interest can remain partially occluded (temporarily or not) by the edge of the video frame-in other words, a part of the silhouette is outside the camera's field of view. Occlusion can coexist with other problems, such as changes in scene illumination, cluttered background, moving shadows or similar foreground to background, all of which can contribute to artefacts in images resulting from the background subtraction process. Artefacts in binary images have a form of extra pixels or a lack of some pixels in a silhouette area, thus it can be referred to as false data or data loss, respectively.
Occlusions are very problematic in real environments and busy scenes, making it difficult to detect and track objects. The selection of experimental database depends mainly on the recognition problem, application and required action types [11,32]. Literature sources indicate various ways to evaluate the action recognition method in the presence of occlusion, some of which add the occlusion to existing videos or foreground masks (e.g., [33][34][35]). In [33], a method for human action recognition based on sequences of depth maps is proposed. Each depth map consists of a human silhouette on a black background, where the area of a silhouette is coloured according to depth information. In order to avoid joint tracking, temporal dynamics are modelled using expandable graphical model (action graph) and the postures are represented using a bag of 3D points. A test dataset was collected by the authors, and include actions such as high arm wave, hand clap, bend, side kick, jogging or tennis serve, grouped into three subsets. The proposed solution is tested in each subset under simulated occlusion, where a depth map is divided into quadrants and one or two various quadrants of a silhouette are occluded. Compared to the experiments on unoccluded data, the recognition accuracy did not decrease significantly, except the case where the upper body was under heavy occlusion.
Another action recognition technique for corrupted silhouettes is proposed in [34]. It extracts normalized regions of interest containing as little background as possible (called a silhouette block) and represents each region as a binary vector-an image sequence of frames is transformed into vector sequence. Each silhouette block is partitioned into 12 subparts and partial template matching is applied. The results are integrated using voting or the sum of distances. During the experiments, two types of regions are superimposed on image sequences to simulate occlusion. These include horizontal and vertical stripes of different width and height, which increased the test dataset. The proposed approach obtained almost perfect accuracy for the original Weizmann database and is quite effective for horizontal occlusions.
In [35], several local descriptors are tested in the presence of static occlusion, namely Trajectory, Histogram of Oriented Gradients, Histogram of Orientation Flow and Motion Boundary Histogram. These methods are tested in combination with a standard bagof-features representation and classified using SVM. The experiments were performed on the KTH benchmark dataset. Silhouette regions of interest were extracted and 10%, 25%, 50% or 75% of an area of each region was occluded. A combination of Trajectory and Motion Boundary Histogram (MBH) yielded average accuracy of 90% for partially occluded silhouettes.

Proposed Approach for Extraction and Classification of Action Descriptors
In this paper, an approach for action recognition is evaluated in the presence of static occlusion. A procedure consisting of several processing steps is proposed. It combines centroid locations that capture motion information, simple shape descriptors that represent characteristics of silhouettes and the Fourier transform to extract features from action descriptors. A previous version of the approach was presented in [36], where it was tested using more than twenty shape features and three matching measures. The convex hulls of silhouettes were used as input data in a scenario involving the recognition of eight types of actions corresponding to home exercises. The aim of the experiments was to indicate the most accurate shape features for action classification without any occlusions. In turn, here the evaluation of the approach is carried out in the presence of occlusion for a specific scenario concerning the recognition of various exercises performed by a single person in front of a static camera. It is assumed that either the upper or lower part of the silhouette is occluded because of the camera being positioned incorrectly or a person being not far enough from the camera. Only foreground masks extracted from the video sequences are used for recognition, due to the fact that they carry information about an object's shape, pose and position in the video frame. A static occlusion is added to every image, which results in removing a part of a silhouette from a foreground binary mask. It is assumed that all images are occluded; however, in several cases, a silhouette could temporarily be out of occlusion. For example, if there is an upper occlusion and a person bends down, a silhouette is fully visible for several frames. The similar situation may occur for jumping in place when a person bends the legs. This can be seen as a dynamic occlusion. However, since static occlusion is added to every frame and whole sequences are analysed, the small amount of dynamic occlusion is of minor importance and simply results from the intrinsic characteristics of some action classes. The following subsections explain in more detail how silhouette sequences are processed and classified. Figure 1 gives an overview of the proposed approach.

Database and Preprocessing
The experiments on the proposed approach are performed using the Weizmann database [21] (the results are presented in Section 3). Here, the reasons for data selection are explained and the preprocessing steps are described. The Weizmann dataset has several advantages. Actions are captured on a static background and human silhouettes are easily distinguishable, fully visible and of equal scale. There are ten action types in the database which correspond to recommended exercises [14,15,37] as follows: • Aerobic exercise, e.g., running, walking, skipping, jumping jack (activities in which the body's large muscles move in a rhythmic manner); • Balance training, e.g., galloping sideways, jumping in place, jumping forward on two legs (activities increasing lower body strength); • Flexibility exercise, e.g., one or two-hand waving, bending (activities preserving or extending motion range around joints).
The database consists of 90 video sequences (144 × 180 px, 50 fps, 1-3 s) and corresponding foreground masks extracted for each video frame, obtained using background subtraction. Each mask is a binary image containing single human silhouette (white pixels) on a black background (see Figure 2 for examples). Some silhouettes are incomplete or have additional pixels (Figure 3)-artefacts resulting from the foreground extraction process. Depending on the shape descriptor used, artefacts may have insignificant influence on the classification results. Binary images with background only or in which a silhouette is too close to the left or right edge of the video frame are removed. The direction of movement is standardized, so all actors move from left to right.

Extracting Motion Information and Adding Occlusion
Preprocessed binary images are used to obtain the centroid trajectory. The centroid coordinates are calculated as the average coordinates of all pixels belonging to the silhouette's area. The trajectory length (relative to the bottom edge of the video frame) is used as a condition for dividing the database into two subgroups. This step is referred to as coarse classification. Actions are often distinguished into periodic and non-periodic ones, but in case of physical activity all exercises can be repeated periodically. Therefore, here the database is divided automatically into actions performed in place (a trajectory is short, e.g., bending) and actions during which a person changes location across the video frame (a trajectory is long, e.g., walking). For a single video sequence, centroid positions are calculated for each silhouette and accumulated in a separate image whose size is equal to the size of a video frame (144 × 180 px). The trajectory length is measured relative to the bottom edge of the video frame and if it is shorter than 20 pixels an action is classified as performed in place. Longer trajectories refer to actions with changing location of a silhouette. Coarse classification allows for the use of different features and parameters for each subgroup.
The assumed application scenario requires the input data to be occluded in the specific manner; therefore, the foreground masks from the Weizmann database were modified in two different ways to occlude the upper and lower parts of the camera's field of view. This type of occlusion simulates a situation in which a person is too close to the camera, and either the person's legs or head with neck and shoulders do not appear in the video frame. Foreground silhouettes on the original masks were located in slightly different places vertically. Therefore, in order to determine the cut-off point, the centroid coordinates of all silhouettes in the given sequence were averaged. The average value was then decreased or increased to experimentally indicate the final cut-off point, which translates to the size of the occlusion. For example, if the averaged centroid row coordinate equals 50 (the image matrix origin is in the upper left corner), then it was decreased by 20 for upper occlusion and increased by 20 for the lower occlusion. The cut-off point differs only between sequences due to changes in centroid location, but it is constant for all frames within a given sequence (a simulation of a static camera). Figure 4 depicts several examples of occluded silhouettes.

Shape Description
The shape description algorithm enables obtaining a numerical representation of a shape. It extracts the most distinctive features and reduces two-dimensional image matrix into a vector with numbers, or a single value in case of simple shape descriptors. Such simple descriptors are basic shape measurements or shape factors that capture general characteristics and usually are not used alone [38]. However, when calculated for more shapes and observed over time, they are more valuable. An appropriate selection of the representation helps to limit the influence of outliers, and shape generalization can be used additionally. Here, the Minimum Bounding Rectangle (MBR) is employed.
An MBR defines a smallest rectangular region that contains all points of a shape [38,39]. The algorithm creates a blob-like object that covers a foreground silhouette's area and is described by the coordinates of four corner points. The idea of using rectangular blobs is also known from the object detection and tracking approaches. Based on the corner coordinates, several characteristics can be calculated and combined into shape ratios. Basic MBR measurements include its length and width. The length of the rectangle can be calculated in two different ways. Firstly, as a distance between two fixed corners where width is a horizontal measurement and length is vertical (in relation to the coordinate axes). This approach is used to calculate other MBR-based descriptors. However, in the second method, the length is always the longer side of the rectangle, and width is the shorter one, no matter which corner points it concerns. These are referred to as 'shorterMBR' and 'longerMBR'. Width and height are further used to estimate perimeter, area, eccentricity, elongation and rectangularity. Eccentricity is a ratio of width to length of the MBR, while elongation is a value of eccentricity subtracted from 1. Rectangularity shows the similarity of an original silhouette to its MBR and is obtained as the ratio of the area of a shape to the area of its MBR. Ultimately, nine different MBR measurements and ratios are considered as shape descriptors for the evaluated approach.
In the shape description step, each foreground mask is represented using selected shape descriptor. As a result, an image is reduced to a single value. The descriptors of all frames from a given sequence are put into a vector and subjected to the normalization to a range of 0 to 1. Normalized vector can be plotted as a line graph to observe how shape descriptors change over time. The number of shape descriptors of a single sequence equals the number of its video frames; therefore, these vectors are still of different length.

Action Representation
This step aims to prepare action representations of equal size, based on the normalized vectors with shape descriptors. To achieve this, the one-dimensional Discrete Fourier Transform is applied. The number of the resultant Fourier coefficients is predefined and various options are tested to indicate the smallest and most accurate one. If the number of coefficients is smaller than the size of a description vector, the latter one is truncated. Otherwise, zero-padding is applied; therefore, the descriptor vectors are appended with zeros. Adding zeros in the time domain is equal to the interpolation in the frequency domain. Final action representations contain absolute values of the Fourier coefficients.

Action Classification
The proposed approach employs two-stage classification. A coarse classification that creates two subgroups-actions performed in place and actions with changing location of a silhouette-can be performed at the earlier step. This is due to that different shape descriptors and action representations can be applied. The actual classification is performed in each subgroup separately based on a standard leave-one-sequence-out cross-validation. It means that each action representation is matched with the rest of the representations from the database and a matching measure is calculated. This procedure is repeated for all instances and the results are accumulated in terms of the percentage of correct classifications. The closest match indicates probable class of an action under investigation. If a correlation is used for matching (C1 correlation based on L1-norm [40]), then the most similar indication is taken. In turn, if a distance is calculated (Euclidean distance [41]), a nearest neighbour (less dissimilar one) indicates the recognized class.

List of Processing Steps
Sections 3.1-3.5 provided a detailed description of the proposed approach, including how to prepare the input data. To summarize the most important elements, here a list of consecutive processing steps is given, as follows: 1.
The database is preprocessed as explained in Section 3.1, and occlusion is added based on Section 3.2.

2.
Motion and shape features are extracted from all sequences. A sequence is composed of binary foreground masks BM i = {bm 1 , bm 2 , . . ., bm n } and represented by a vector with normalized shape descriptors, SD i = {sd 1 , sd 2 , . . ., sd n }, where n is the number of frames. Shape descriptors are based on the Minimum Bounding Rectangle measurements which are explained in Section 3.3. To collect motion information, centroid locations are stored as trajectories (Section 3.2). Centroid coordinates are calculated as an average of coordinates of all points included in the shape area.

3.
Each SD vector is transformed into action representation AR using the Discrete Fourier Transform (Section 3.4). The one-dimensional DFT of an exemplary vector u(t) of length t (0 < t < T; T is a period of t) is as follows [42]: Then, a selected number of absolute coefficients is used for classification.

4.
Coarse classification is performed based on trajectory length. The database is divided into two subgroups.

5.
Final classification (Section 3.5) is performed in each subgroup separately using the leave-one-sequence-out procedure and one of the following matching measures-Euclidean distance d E [41] or C1 correlation c 1 [40]. The respective formulas for two compared vectors x = x 1 , x 2 , . . ., x n and y = y 1 , y 2 , . . ., y n are:

Experimental Conditions and Results
The experiments were carried out with the use of the Weizmann database [21], which contains ten action classes with two action types: the first type includes actions performed in place, such as bending (referred to as 'bend'), jumping jack ('jack'), jumping in place ('pjump'), waving with one hand ('wave1') and waving with two hands ('wave2'), and the second type refers to actions with changing location of a silhouette, namely walking ('walk'), running ('run'), galloping sideways ('side'), jumping forward on two legs ('jump') and skipping ('skip'). Action sequences are divided into subgroups during coarse classification and the division is made based on the trajectory length; therefore, the subgroups are referred to as 'short trajectory' and 'long trajectory', respectively. All foreground images from 90 sequences were preprocessed in accordance with Section 3.1. Based on Section 3.2, two versions of the database were prepared by occluding the upper or lower part of the silhouette, and are referred to as the 'upper-occlusion' and 'lower-occlusion'. Each database is tested separately.
In a single experiment, there are several tests investigating classification accuracy of the combinations of the shape descriptor, matching measure and the size of action representation. A sequence of binary silhouettes (less than 150 images per action) is taken as an input data for a single action. Firstly, all silhouettes are represented using selected shape descriptor and the results are combined into a vector which is then normalized. Vectors from all silhouette sequences are transformed into action representations that are subjected to the classification process. This includes coarse classification into two subgroups and final classification into action classes using leave-one-sequence-out procedure. A test representation is matched with the rest of the representations and the most similar one indicates the probable action class. Two matching measures are compared: Euclidean distance and C1 correlation. The experiments employed nine shape descriptors and various action representations based on from 2 to 256 DFT coefficients. The best result relates to the experiment with the highest accuracy (correct classification rate) and the smallest representation size. Then, a combination of a shape descriptor and matching measure that is used during this experiment is considered as the most effective version of the approach. The top results for each shape descriptor, obtained for occluded data, are presented in Tables 1 and 2. In turn, Table 3 contains a comparison of the selected results achieved using both occluded databases and the original Weizmann database. Table 1 contains the results of the experiments performed on the 'lower-occlusion' database. The highest classification accuracy for actions performed in place is 91.11%. The second best is 88.89%, and the third one is 84.44%. The most accurate experiment employed elongation for shape description. The accuracy of 91.11% is then achieved either for Euclidean distance or C1 correlation, and action representations are composed of 57 and 55 coefficients, respectively. For the second group of actions, the maximum recognition rate is 73.33% and was obtained for two shape descriptors-rectangularity (84 coefficients) and shorterMBR (71 coefficients). In both experiments, the C1 correlation is applied. The difference between the highest results in both subgroups is large and equals nearly 20%. In case of actions with changing location of a silhouette, this should not be surprising since the lower part of the silhouette is occluded and feet cannot be localised continuously.
The results for the experiments performed with the use of the 'upper-occlusion' database are not so strongly diversified between the subgroups ( Table 2). The highest classification rate for actions performed in place equals 88.89% and is attributed to the experiment using elongation, 51 Fourier coefficients and C1 correlation. For the other subgroup, an accuracy of 84.44% is achieved in three experiments using area, shorterMBR and rectangularity. However, if rectangularity is employed, the representation size is smaller (46 coefficients using C1 correlation). Table 1. Results for the experiments using the 'lower-occlusion' database. Correct classification rates are given for coarsely classified actions, nine shape descriptors based on the Minimum Bounding Rectangle and two matching measures (EU-distance, C1-correlation). The number of the DFT coefficients is given in brackets.   The proposed approach is also tested using the Weizmann database without any occlusions. The comparison of the experimental results yielding the highest classification accuracy is given in Table 3. Some conclusions can be drawn. Firstly, in several cases, the use of C1 correlation gives better results or the results are equal to the experiments with Euclidean distance. Secondly, in the subgroup of actions with changing location of a silhouette, the lower occlusion has stronger influence on the results. It can be concluded that, for action classification purposes, the information about feet positions is more important than the presence of head and shoulders. It is also confirmed by the results obtained for the 'upper-occlusion' database, which were very similar to those achieved on the unoccluded database. Different conclusions can be drawn from the results obtained for actions performed in place. Since actors do not move across the camera's field of view, feet positions do not influence the results. Furthermore, the same accuracy is achieved for the 'lower-occlusion' database and the database without any occlusions. It is also important to notice that different shape descriptors are used; however, the size of action representation is similar. In some cases, the number of DFT coefficients is smaller than the number of images in an input sequence. Therefore, the proposed approach offers data reduction, both by representing each image by a single value and each action by a selected number of the absolute spectral domain coefficients.

Discussion and Conclusions
This paper presents an approach for action recognition in the scenario of exercise classification under partial occlusion. It uses a combination of shape descriptors, trajectory and spectral domain features to assign silhouette sequences to action classes. General shape features are tracked; therefore, small disturbances or artefacts in the foreground masks are of minor importance. The proposed approach has a form of a general procedure consisting of several processing steps, but some of them are using different methods. It enables this approach to be adapted to other applications or data. It also made it possible to test different combinations of shape descriptors, action descriptors and matching measures in the assumed scenario. In order to improve recognition accuracy, applying other shape features or classification procedures is considered.
The goal of the experiments was to indicate such combination that would give the highest accuracy for the recognition of occluded silhouettes. Results differ between sub-groups of actions performed in place and actions with changing location of a silhouette. For the 'lower-occlusion' database, the highest accuracy is 91.11% (elongation, 55 coefficients, C1 correlation) and 73.33% (shorterMBR, 71 coefficients, C1 correlation), for the subgroups, respectively. In turn, in the 'upper-occlusion' database, the highest accuracy in the first subgroup is 88.89% (elongation, 51 coefficients, C1 correlation) and 84.44% in the second one (rectangularity, 46 coefficients, C1 correlation). This gives an average accuracy for the entire 'upper-occlusion' database equal to 86.65%, while, for the 'lower-occlusion' database, it is 82.22%. Compared to the averaged accuracy in data without occlusions, which is 88.89%, the decrease in classification rate is small. It can also be concluded that classification accuracy for actions performed in place is close to 90%, and higher than in the other subgroup, which makes the results promising, especially under the assumed scenario.
The proposed approach was tested using the datasets modified by the authors; therefore, there are no other results in the literature, presented on the same data. However, some general characteristics of the results can be compared to the conclusions on some other solutions handling occlusion. The authors of [35] investigated combinations of different descriptors as well. They experimented with partial occlusion (10% or 25% of the region of interest is occluded) and heavy occlusion (50% and 75% occluded), and compared the results with unoccluded data. The percentage results are similar to the results presented in this paper. For the experiments combining trajectory with MBH or HOG, the difference in accuracy between partially occluded and unoccluded data amounted to several percent. The authors of [33] proposed an action recognition approach that uses silhouettes with depth data, and achieves nearly perfect accuracy for three subgroups of activity classes. Occlusion is added by removing a quarter or a half of a silhouette. The highest accuracy decrease is observed when the top part of a silhouette is fully or partially occluded, especially in the subgroup where the following action classes are included: forward punch, high throw, hand clap, bend and more. In other cases, the differences were relatively small. In [34], several horizontal and vertical occlusions were added to the Weizmann database and tested with the use of the approach based on spatio-temporal blocks and partial template matching. The comparison of the results shows small differences in accuracy between the experiments performed on databases with horizontal occlusions and without any occlusion, while vertical occlusions cause larger accuracy decrease.

Conflicts of Interest:
The authors declare no conflict of interest. Tables 1 and 2 Figure A1. Confusion matrices for the experiment using the 'lower-occlusion' database, rectangularity and Euclidean distance.