Image Generation for 2D-CNN Using Time-Series Signal Features from Foot Gesture Applied to Select Cobot Operating Mode

Advances in robotics are part of reducing the burden associated with manufacturing tasks in workers. For example, the cobot could be used as a “third-arm” during the assembling task. Thus, the necessity of designing new intuitive control modalities arises. This paper presents a foot gesture approach centered on robot control constraints to switch between four operating modalities. This control scheme is based on raw data acquired by an instrumented insole located at a human’s foot. It is composed of an inertial measurement unit (IMU) and four force sensors. Firstly, a gesture dictionary was proposed and, from data acquired, a set of 78 features was computed with a statistical approach, and later reduced to 3 via variance analysis ANOVA. Then, the time series collected data were converted into a 2D image and provided as an input for a 2D convolutional neural network (CNN) for the recognition of foot gestures. Every gesture was assimilated to a predefined cobot operating mode. The offline recognition rate appears to be highly dependent on the features to be considered and their spatial representation in 2D image. We achieve a higher recognition rate for a specific representation of features by sets of triangular and rectangular forms. These results were encouraging in the use of CNN to recognize foot gestures, which then will be associated with a command to control an industrial robot.


Introduction
The agile demand-driven manufacturing process creates the need to design adaptive production using collaborative robotics labelled as cobot. As the flexibility in the manufacturing process increases with the rapid evolution of technology, the fabrication process increases in complexity, impeding standard robots from operating alone. Therefore, operators are brought to work with collaborative robots (cobot) in the same workspace and share with them production activities or working time [1]. This human-robot collaboration is intended to contribute to flexibility and agility thanks to the combination of human's cognition and management abilities with the robot's accuracy, speed, and repetitive work [2]. However, cobot's acceptance in industry is still weak as it raises the thorny issues of security and communication. Safeea et al. [3] demonstrated that the greatest drawback in the development and acceptance of cobots in industries comes from the reliability and the

Gesture Recognition Methods
Human gesture recognition is applied to recognize the useful information of human motion. Statistical modeling, such as discreet Hidden Markov model (HMM), was used as classifier to learn and recognize five gestures performed during the motor hoses assembly [24]. It was also used to teach robots to reproduce gestures by looking at examples [25], to distinguish between finger and hand gesture classes [26], and to recognize hand gestures in order to command a robot companion [27]. However, HMMs need a large amount of training data and therefore their system performance could be limited by the characteristics of the training data [28]. Dynamic time warping (DTW) is a widely used method in human gesture recognition applications (an algorithm used for online time series recognition). It can deal with gesture signals varying in amplitude and resolve ambiguities in the recognition result even for multiclass classification. It is known that the use of DTW with a set of sequential data of hand gestures have good classification rates [29]. However, it is a dynamic method that focuses most on the local motion information and has less consideration for the global features of gesture trajectories. Contrary to the DTW, the convolutional neural network (CNN) is a recognition method that uses static images of gesture trajectories and, thus, omits the local motion information [30].
Each method has distinct advantages and disadvantages. In fact, both static and dynamic recognition methods (CNN and DTW) were used to achieve better recognition accuracy in digit-writing hand gestures' localization and recognition for Smart TV systems [30]. However, CNN is more efficient than many traditional classification methods [31]. CNN is known for its robustness at low input variations and low pre-treatment rate necessary for their operation [31]. Numerous applications relying on CNN in the classification of human gestures or actions have been recorded and were based on either 1D-CNN [32,33], 2D-CNN [34][35][36], or 3D CNN [37].
Most applications based on camera rely either on 2D-CNN, as it computes a 2D image as input [36], or 3D-CNN to accurately scope the information in the space. For example, 3D-CNNs have been developed for the recognition of human actions from airport surveillance cameras [37]. This model extracts characteristics of spatial and temporal dimensions by performing 3D convolutions, thus capturing the motion information encoded in several images. Furthermore, for foot-based applications, some research works rely either on 1D-CNN or 2D-CNN when using inertial measurement unit sensors. Those relying on 1D-CNN directly scope the time series signal (data) obtained from the sensors to achieve accurate classification as shown in [32,33]. However, the classification performance is still low as the difficulty to efficiently combine all the information received from the different sensors arises [33]. Furthermore, 2D-CNN appears to be more realistic as it focuses on the analysis based on 2D images rendering it slower than 1D-CNN but more accurate and flexible in the analysis of features extracted from IMU [38]. However, it requires defining the set of images received from raw motion sensors data. Many attempts have been recorded. In [34], a 2D-based CNN method for fall detection using body sensors has been investigated by directly scoping raw motion data in a 2D image without feature extraction and achieving high accuracy of 92.3%. Thus, for the proposed method, there is only scope between two possibilities (fall detection or not). In [35], a similar work has been conducted, based on the effective representation of sEMG (Surface electromyography) signals in images by using a sliding window to continuously address all the signals obtained from the input to a grayscale image. However, none of these proposed works demonstrate the impact of a spatial representation of features used to constitute a 2D image on classification results.
Thus, we formulated two hypothesizes which are (1) for foot-based interaction context, a 2D-convolutional neural network seems to be suitable for foot gesture recognition; and (2) the selection of the most important features and their spatial representation in the 2D image greatly impact the recognition process.
By using an instrumented insole and applying a 2D-CNN algorithm, the main contribution of the present study is to develop a new methodology for a foot gesture recognition system to select a cobot operating mode. The instrumented insole was worn by the worker to acquire the foot gestures' signals. More specifically, we suggest a simple feature extraction technique using data acquired from an inertial measurement unit (IMU) and force sensors, as well as 2D image generation to classify foot gestures. To achieve this goal, we have evaluated our system in different scenarios of gestures, since those can be performed easily to control a robot. The proposed classification algorithm, trained with backpropagation, is then optimized to recognize gestures. Our results showed a new advance in this area, providing interesting directions for future research by highlighting the impact of features extraction and their spatial representation in a 2D image for the recognition process. By enhancing the existing foot recognition methods, our goal is to increase the ease of work of the operator.

Materials and Methods
Since the operator's hands were occupied during his work, this article proposes to use foot movements to control a robot. The overview of the proposed gesture recognition system is illustrated in Figure 1.
The system requires data information from a human's foot to be computed and analyzed for selecting one cobot operating mode. The material aspect is presented in Section 3.1. For the treatment process, the gesture recognition system was based on machine learning classification, thus requiring training and validation phases. The system requires data information from a human's foot to be computed and analyzed for selecting one cobot operating mode. The material aspect is presented in Section 3.1. For the treatment process, the gesture recognition system was based on machine learning classification, thus requiring training and validation phases.
The training phase began with defining a set of foot gestures to be assimilated to cobot operating modes (Section 3.2). Once the dictionary was established, we proceeded to data processing and then features selection (Section 3.3) to reduce the complexity of the model. Once completed, the selected features were transmitted to the image generation (Section 3.4) to determine the most relevant representation. The generated images were provided as an input for the 2D-CNN used for foot gestures recognition (Section 3.5).
The testing phase involved testing the classification of foot gestures with 2D-CNN. The proposed real-time implementation algorithm can be summarized in Figure 2. It depicts an initial set of conditions to discriminate between normal walking pattern and foot gesture command. Once the algorithm detected that the user starts a gesture, it waited for the time T until the gesture was completed. The detection of the start of a gesture was based on a triggering condition related to the FSR's sensors. Using the data inside the sliding windows, the algorithm proceeded to compute the features, generate an image, perform the 2D CNN classification for gesture recognition, and submit an operating mode to the cobot. The cobot selected an appropriate algorithm from the available operating modes such as trajectory tracking, collision avoidance, etc. The training phase began with defining a set of foot gestures to be assimilated to cobot operating modes (Section 3.2). Once the dictionary was established, we proceeded to data processing and then features selection (Section 3.3) to reduce the complexity of the model. Once completed, the selected features were transmitted to the image generation (Section 3.3.1) to determine the most relevant representation. The generated images were provided as an input for the 2D-CNN used for foot gestures recognition (Section 3.3.2).
The testing phase involved testing the classification of foot gestures with 2D-CNN. The proposed real-time implementation algorithm can be summarized in Figure 2. It depicts an initial set of conditions to discriminate between normal walking pattern and foot gesture command. Once the algorithm detected that the user starts a gesture, it waited for the time T until the gesture was completed. The detection of the start of a gesture was based on a triggering condition related to the FSR's sensors. Using the data inside the sliding windows, the algorithm proceeded to compute the features, generate an image, perform the 2D CNN classification for gesture recognition, and submit an operating mode to the cobot. The cobot selected an appropriate algorithm from the available operating modes such as trajectory tracking, collision avoidance, etc.

Instrumented Insole
While the user made a gesture, the instrumented insole acquired, processed, and wirelessly transmitted the data via TCP to the computer to start the gesture recognition. The proposed enactive insole is a non-intrusive, non-invasive, and inexpensive device. The sampling frequency used in data processing and transmission was 32 Hz (Figure 3). It contained a 9-axis motion processing unit MPU9250 [39], which measured the foot's acceleration, velocity, and orientation through a set of 3-axis accelerometer, 3-axis gyroscope, and 3-axis magnetometer combined with a digital motion processor (DMP). Moreover, four force-sensitive resistors (FSR), two in the forefoot position and two in heel position, were also integrated to measure the pressure applied on the insole. The analog signals acquired from pressure sensors were converted by an analog-to-digital converter (ADC) ADS1115 [40] with a 16-bit resolution. Finally, an ESP8266-12E WiFi module [41], located at the foot arch position, was used to transmit the data to a local computer. The detailed design of the insole was previously presented in [42].

Instrumented Insole
While the user made a gesture, the instrumented insole acquired, processed, and wirelessly transmitted the data via TCP to the computer to start the gesture recognition. The proposed enactive insole is a non-intrusive, non-invasive, and inexpensive device. The sampling frequency used in data processing and transmission was 32 Hz (Figure 3). It contained a 9-axis motion processing unit MPU9250 [39], which measured the foot's acceleration, velocity, and orientation through a set of 3-axis accelerometer, 3-axis gyroscope, and 3-axis magnetometer combined with a digital motion processor (DMP). Moreover, four force-sensitive resistors (FSR), two in the forefoot position and two in heel position, were also integrated to measure the pressure applied on the insole. The analog sig-  Once the material architecture was defined, a cobot operating mode based on gesture dictionaries for 2D CNN training phase was used, as presented in the next section.  Once the material architecture was defined, a cobot operating mode based on gesture dictionaries for 2D CNN training phase was used, as presented in the next section.

Foot-Based Command: Gesture Dictionnaries
Selection between cobot operating mode was based on two gesture dictionaries: pressure and IMU sensors (one for each kind of sensor) for classification purposes. Machine learning classification needs a training phase with a set of grayscale images generated by relevant features for each gesture. This section proposes a two-foot-based dictionaries utilizing information from the 3-axis accelerometer, 3-axis gyroscope (angular velocity), 3-axis magnetometer, and the four pressure sensors of the insole.
Based on the sensor readings and different movements of the foot, dictionaries of movements are shown below. Tables 1 and 2 present some basics movements recognizable by each sensor considered alone. Table 1. Dictionary of detectable movements by the accelerometer with ankle as center.

Movements of Rotation and
Translation with Toes at Center Once the material architecture was defined, a cobot operating mode based on gesture dictionaries for 2D CNN training phase was used, as presented in the next section.

Foot-Based Command: Gesture Dictionnaries
Selection between cobot operating mode was based on two gesture dictionaries: pressure and IMU sensors (one for each kind of sensor) for classification purposes. Machine learning classification needs a training phase with a set of grayscale images generated by relevant features for each gesture. This section proposes a two-foot-based dictionaries utilizing information from the 3-axis accelerometer, 3-axis gyroscope (angular velocity), 3axis magnetometer, and the four pressure sensors of the insole.
Based on the sensor readings and different movements of the foot, dictionaries of movements are shown below. Tables 1 and 2 present some basics movements recognizable by each sensor considered alone. Table 1. Dictionary of detectable movements by the accelerometer with ankle as center.

Movements of Rotation and Translation with Ankle at Center Movements of Rotation and Translation with Toes at Center
Horizontal movement rotation (with heel as center) Vertical movement rotation (with ankle as center) Vertical movement rotation (with ankle as center) Note: Each movement is described below the illustration.
Horizontal movement rotation (with heel as center) Once the material architecture was defined, a cobot operating mode based on gesture dictionaries for 2D CNN training phase was used, as presented in the next section.

Foot-Based Command: Gesture Dictionnaries
Selection between cobot operating mode was based on two gesture dictionaries: pressure and IMU sensors (one for each kind of sensor) for classification purposes. Machine learning classification needs a training phase with a set of grayscale images generated by relevant features for each gesture. This section proposes a two-foot-based dictionaries utilizing information from the 3-axis accelerometer, 3-axis gyroscope (angular velocity), 3axis magnetometer, and the four pressure sensors of the insole.
Based on the sensor readings and different movements of the foot, dictionaries of movements are shown below. Tables 1 and 2 present some basics movements recognizable by each sensor considered alone. Table 1. Dictionary of detectable movements by the accelerometer with ankle as center.

Movements of Rotation and Translation with Ankle at Center Movements of Rotation and Translation with Toes at Center
Horizontal movement rotation (with heel as center) Vertical movement rotation (with ankle as center) Vertical movement rotation (with ankle as center) Note: Each movement is described below the illustration.
Vertical movement rotation (with ankle as center) Once the material architecture was defined, a cobot operating mode based on gesture dictionaries for 2D CNN training phase was used, as presented in the next section.

Foot-Based Command: Gesture Dictionnaries
Selection between cobot operating mode was based on two gesture dictionaries: pressure and IMU sensors (one for each kind of sensor) for classification purposes. Machine learning classification needs a training phase with a set of grayscale images generated by relevant features for each gesture. This section proposes a two-foot-based dictionaries utilizing information from the 3-axis accelerometer, 3-axis gyroscope (angular velocity), 3axis magnetometer, and the four pressure sensors of the insole.
Based on the sensor readings and different movements of the foot, dictionaries of movements are shown below. Tables 1 and 2 present some basics movements recognizable by each sensor considered alone. Table 1. Dictionary of detectable movements by the accelerometer with ankle as center.

Movements of Rotation and Translation with Ankle at Center Movements of Rotation and Translation with Toes at Center
Horizontal movement rotation (with heel as center) Vertical movement rotation (with ankle as center) Vertical movement rotation (with ankle as center) Note: Each movement is described below the illustration.
Vertical movement rotation (with ankle as center) Once the material architecture was defined, a cobot operating mode based on gesture dictionaries for 2D CNN training phase was used, as presented in the next section.

Foot-Based Command: Gesture Dictionnaries
Selection between cobot operating mode was based on two gesture dictionaries: pressure and IMU sensors (one for each kind of sensor) for classification purposes. Machine learning classification needs a training phase with a set of grayscale images generated by relevant features for each gesture. This section proposes a two-foot-based dictionaries utilizing information from the 3-axis accelerometer, 3-axis gyroscope (angular velocity), 3axis magnetometer, and the four pressure sensors of the insole.
Based on the sensor readings and different movements of the foot, dictionaries of movements are shown below. Tables 1 and 2 present some basics movements recognizable by each sensor considered alone. Table 1. Dictionary of detectable movements by the accelerometer with ankle as center.

Movements of Rotation and Translation with Ankle at Center Movements of Rotation and Translation with Toes at Center
Horizontal movement rotation (with heel as center) Vertical movement rotation (with ankle as center) Vertical movement rotation (with ankle as center) Note: Each movement is described below the illustration.
Horizontal movement of rotation (with toes as center) Once the material architecture was defined, a cobot operating mode based on gesture dictionaries for 2D CNN training phase was used, as presented in the next section.

Foot-Based Command: Gesture Dictionnaries
Selection between cobot operating mode was based on two gesture dictionaries: pressure and IMU sensors (one for each kind of sensor) for classification purposes. Machine learning classification needs a training phase with a set of grayscale images generated by relevant features for each gesture. This section proposes a two-foot-based dictionaries utilizing information from the 3-axis accelerometer, 3-axis gyroscope (angular velocity), 3axis magnetometer, and the four pressure sensors of the insole.
Based on the sensor readings and different movements of the foot, dictionaries of movements are shown below. Tables 1 and 2 present some basics movements recognizable by each sensor considered alone. Table 1. Dictionary of detectable movements by the accelerometer with ankle as center.

Movements of Rotation and Translation with Ankle at Center Movements of Rotation and Translation with Toes at Center
Horizontal movement rotation (with heel as center) Vertical movement rotation (with ankle as center) Vertical movement rotation (with ankle as center)  Once the material architecture was defined, a cobot operating mode based on gesture dictionaries for 2D CNN training phase was used, as presented in the next section.

Foot-Based Command: Gesture Dictionnaries
Selection between cobot operating mode was based on two gesture dictionaries: pressure and IMU sensors (one for each kind of sensor) for classification purposes. Machine learning classification needs a training phase with a set of grayscale images generated by relevant features for each gesture. This section proposes a two-foot-based dictionaries utilizing information from the 3-axis accelerometer, 3-axis gyroscope (angular velocity), 3axis magnetometer, and the four pressure sensors of the insole.
Based on the sensor readings and different movements of the foot, dictionaries of movements are shown below. Tables 1 and 2 present some basics movements recognizable by each sensor considered alone. Table 1. Dictionary of detectable movements by the accelerometer with ankle as center.

Movements of Rotation and Translation with Ankle at Center Movements of Rotation and Translation with Toes at Center
Horizontal movement rotation (with heel as center) Vertical movement rotation (with ankle as center) Vertical movement rotation (with ankle as center)  Once the material architecture was defined, a cobot operating mode based on gesture dictionaries for 2D CNN training phase was used, as presented in the next section.

Foot-Based Command: Gesture Dictionnaries
Selection between cobot operating mode was based on two gesture dictionaries: pressure and IMU sensors (one for each kind of sensor) for classification purposes. Machine learning classification needs a training phase with a set of grayscale images generated by relevant features for each gesture. This section proposes a two-foot-based dictionaries utilizing information from the 3-axis accelerometer, 3-axis gyroscope (angular velocity), 3axis magnetometer, and the four pressure sensors of the insole.
Based on the sensor readings and different movements of the foot, dictionaries of movements are shown below. Tables 1 and 2 present some basics movements recognizable by each sensor considered alone. Table 1. Dictionary of detectable movements by the accelerometer with ankle as center.

Movements of Rotation and Translation with Ankle at Center Movements of Rotation and Translation with Toes at Center
Horizontal movement rotation (with heel as center) Vertical movement rotation (with ankle as center) Vertical movement rotation (with ankle as center)  Once the material architecture was defined, a cobot operating mode based on gesture dictionaries for 2D CNN training phase was used, as presented in the next section.

Foot-Based Command: Gesture Dictionnaries
Selection between cobot operating mode was based on two gesture dictionaries: pressure and IMU sensors (one for each kind of sensor) for classification purposes. Machine learning classification needs a training phase with a set of grayscale images generated by relevant features for each gesture. This section proposes a two-foot-based dictionaries utilizing information from the 3-axis accelerometer, 3-axis gyroscope (angular velocity), 3axis magnetometer, and the four pressure sensors of the insole.
Based on the sensor readings and different movements of the foot, dictionaries of movements are shown below. Tables 1 and 2 present some basics movements recognizable by each sensor considered alone. Table 1. Dictionary of detectable movements by the accelerometer with ankle as center.

Movements of Rotation and Translation with Ankle at Center Movements of Rotation and Translation with Toes at Center
Horizontal movement rotation (with heel as center) Vertical movement rotation (with ankle as center) Vertical movement rotation (with ankle as center) Horizontal movement of rotation (with toes as center)

Movement of translation (up/down)
Vertical movement of rotation (with toes as center) Note: Each movement is described below the illustration.
Vertical movement of rotation (with toes as center) Note: Each movement is described below the illustration.

Active or Inactive Force Sensors (FSR) during the Movements 4 FSRs 2 FSRs 1 FSR
The four sensors are inactive (foot is not touching the ground) The two sensors at the front are active (foot is inclined forward) The two sensors at the back are active (foot is inclined backward) Only the sensor at the front outside is active (foot is inclined front-outward) The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once.

Active or Inactive Force Sensors (FSR) during the Movements 4 FSRs 2 FSRs 1 FSR
The four sensors are inactive (foot is not touching the ground) The two sensors at the front are active (foot is inclined forward) The two sensors at the back are active (foot is inclined backward) Only the sensor at the front outside is active (foot is inclined front-outward) The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once.

Active or Inactive Force Sensors (FSR) during the Movements 4 FSRs 2 FSRs 1 FSR
The four sensors are inactive (foot is not touching the ground) The two sensors at the front are active (foot is inclined forward) The two sensors at the back are active (foot is inclined backward) Only the sensor at the front outside is active (foot is inclined front-outward) The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once.

Active or Inactive Force Sensors (FSR) during the Movements 4 FSRs 2 FSRs 1 FSR
The four sensors are inactive (foot is not touching the ground) The two sensors at the front are active (foot is inclined forward) The two sensors at the back are active (foot is inclined backward) Only the sensor at the front outside is active (foot is inclined front-outward) The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once.

Active or Inactive Force Sensors (FSR) during the Movements 4 FSRs 2 FSRs 1 FSR
The four sensors are inactive (foot is not touching the ground) The two sensors at the front are active (foot is inclined forward) The two sensors at the back are active (foot is inclined backward) Only the sensor at the front outside is active (foot is inclined front-outward) The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once.

Active or Inactive Force Sensors (FSR) during the Movements 4 FSRs 2 FSRs 1 FSR
The four sensors are inactive (foot is not touching the ground) The two sensors at the front are active (foot is inclined forward) The two sensors at the back are active (foot is inclined backward) Only the sensor at the front outside is active (foot is inclined front-outward) The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once.

Active or Inactive Force Sensors (FSR) during the Movements 4 FSRs 2 FSRs 1 FSR
The four sensors are inactive (foot is not touching the ground) The two sensors at the front are active (foot is inclined forward) The two sensors at the back are active (foot is inclined backward) Only the sensor at the front outside is active (foot is inclined front-outward) The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once.

Active or Inactive Force Sensors (FSR) during the Movements 4 FSRs 2 FSRs 1 FSR
The four sensors are inactive (foot is not touching the ground) The two sensors at the front are active (foot is inclined forward) The two sensors at the back are active (foot is inclined backward) Only the sensor at the front outside is active (foot is inclined front-outward) The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once. Table 3. Representation of the five proposed gestures denoted from G1 to G5.
Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor. From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once. Table 3. Representation of the five proposed gestures denoted from G1 to G5.
The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once. Table 3. Representation of the five proposed gestures denoted from G1 to G5.
In Table 3 The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once. Table 3. Representation of the five proposed gestures denoted from G1 to G5.
In Table 3 The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once. Table 3. Representation of the five proposed gestures denoted from G1 to G5.
In Table 3 The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once. Table 3. Representation of the five proposed gestures denoted from G1 to G5.
In Table 3 The four sensors are active (foot flat on the ground) The two outside sensors are active (foot is inclined outwards) The two inside sensors are active (foot is inclined inwards) Only the sensor at the front inside is active (foot is inclined front-inward) Notes: The movements are described below the illustrations. In each illustration (signal sent by pressure sensors), the empty blue circle represents inactive sensor while the full blue circle represents active sensor.
From these simple foot gestures dictionaries, combinations of three or four movements were used to create five gestures, as shown in Table 3. Each movement has multiple advantages. It was simple to execute and easy to detect at once. Table 3. Representation of the five proposed gestures denoted from G1 to G5.
In Table 3 In Table 3 Once identified, the foot gestures needed to be mapped with the defined cobot operating mode. In this study, based on observation of Alexander et al. [8], the following commands with mapping gestures are presented in Table 4. Additional gestures with different commands could be certainly defined, as described in the introduction such as physical collaboration [10], autonomous action in shared activities [11], remote control motion [12], and learning new tasks [4]. The proposed foot-based dictionary mapped with cobot operating mode must be decoded in order to accurately scope the difference between gestures. The next section proposes the overall process for data acquisition and features selection.

Data Acquisition and Features Selection
The data presented in Table 3 are acquired by an instrumented insole worn in the left foot. In this study, the gestures of a single participant (one of the authors of this paper, Sensors 2021, 21, 5743 9 of 24 a healthy adult) were recorded. The measurement time of each gesture was set at 15 s. For numerical simulation, signals from the 3-axis accelerometer, 3-axis gyroscope, and the 4 FSRs were exploited. We also measured the Euler angles and the quaternions from the Digital Motion Processor (DMP). The details from the insole's signals are provided in Table 5. For this study, we only focused on the sum of FSR sensors rather than considering them alone because, based on our proposed gestures, it is difficult to only have one FSR sensor activated at once.
Once the insole's data were collected, features enhancement and selection or reduction could be conducted to accurately scope the characteristics of each proposed gesture for classification purposes, thus limiting the complexity of the model [43].
We tried two methods using the proposed dataset. Firstly, we selected 08 features, presented in Table 6, from the acquired data. The choice of the 08 proposed features was based on our observation of signals behavior for each gesture. We noticed a difference in the signal's variation for each gesture. Table 7 presents the latter. Table 7. Signals of the norm of acceleration and the norm of angular velocity related to the five proposed gestures.

Gestures
Norm of Acceleration Na m Norm of Angular Velocity Ngy

F2
Sum of two FSR sensors located in the heel Ftot Sum of the four FSR sensors The choice of the 08 proposed features was based on our observation o havior for each gesture. We noticed a difference in the signal's variation for e Table 7 presents the latter. The choice of the 08 proposed features was based on our observation of signals be havior for each gesture. We noticed a difference in the signal's variation for each gesture Table 7 presents the latter. Table 7. Signals of the norm of acceleration and the norm of angular velocity related to the five proposed gestures.

Gestures
Norm of Acceleration Nam Norm of Angular Velocity Ngy G1 G2 G3

Ftot
Sum of the four FSR sensors The choice of the 08 proposed features was based on our observation of havior for each gesture. We noticed a difference in the signal's variation for e Table 7 presents the latter. The choice of the 08 proposed features was based on our observation of signals be havior for each gesture. We noticed a difference in the signal's variation for each gesture Table 7 presents the latter. Table 7. Signals of the norm of acceleration and the norm of angular velocity related to the five proposed gestures.

Gestures
Norm of Acceleration Nam Norm of Angular Velocity Ngy

Ftot
Sum of the four FSR sensors The choice of the 08 proposed features was based on our observation of havior for each gesture. We noticed a difference in the signal's variation for e Table 7 presents the latter. The choice of the 08 proposed features was based on our observation of signals behavior for each gesture. We noticed a difference in the signal's variation for each gesture Table 7 presents the latter.

G5
At this point, we could observe that the norm of acceleration for G1 and important peaks of about 8000 mm·s −2 . However, for G3 and G4, the value of lower and equals to 4900 mm·s −2 . As for G5, the norm of amplitudes attends a h for a long time. Moreover, the signals obtained from the norm of the angular v sum of the two FSR sensors located at the forefoot (F1), the sum of the two F located at the heel (F2), and the Euler Angles are suitable to be selected as d tures to discriminate foot gestures.
The second method used in this paper considered feature enhancement tion, which consists of using the raw signals obtained from the instrumented then computed feature enhancement. This operation led to a set of 78 feature in Tables 8 and 9.  At this point, we could observe that the norm of acceleration for G1 and G2 presents important peaks of about 8000 mm·s −2 . However, for G3 and G4, the value of the peak is lower and equals to 4900 mm·s −2 . As for G5, the norm of amplitudes attends a higher value for a long time. Moreover, the signals obtained from the norm of the angular velocity, the sum of the two FSR sensors located at the forefoot (F1), the sum of the two FSR sensors located at the heel (F2), and the Euler Angles are suitable to be selected as different features to discriminate foot gestures.
The second method used in this paper considered feature enhancement and reduction, which consists of using the raw signals obtained from the instrumented insole, and then computed feature enhancement. This operation led to a set of 78 features presented in Tables 8 and 9.  At this point, we could observe that the norm of acceleration for G1 and G2 presents important peaks of about 8000 mm·s −2 . However, for G3 and G4, the value of the peak is lower and equals to 4900 mm·s −2 . As for G5, the norm of amplitudes attends a higher value for a long time. Moreover, the signals obtained from the norm of the angular velocity, the sum of the two FSR sensors located at the forefoot (F1), the sum of the two FSR sensors located at the heel (F2), and the Euler Angles are suitable to be selected as different features to discriminate foot gestures.
The second method used in this paper considered feature enhancement and reduction, which consists of using the raw signals obtained from the instrumented insole, and then computed feature enhancement. This operation led to a set of 78 features presented in Tables 8 and 9.  Table 9. Features preselected for statistical analysis part 2.

Statistical Parameters (Abbreviation) Skewness (Skew) Kurtosis (Kurt) Root Mean Square (Rms)
Characteristics AcX skew , AcY skew , AcZ skew AcX kurt , AcY kurt , AcZ kurt AcX rms , AcY rms , AcZ rms Notes: Ac and Va correspond, respectively, to the acceleration and the angular velocity computed along the X, Y, Z axis; Na is the norm of the acceleration; P, R, and Y are the Euler angles; q1, q2, q3, q4 are the Quaternions.
Dimension reduction technics used in this paper extract the relevant features to be used in the image generation process. According to the state-of-the-art, there are mainly two approaches. One is based on the reduction in features by searching possible combinations of features to identify the principal components with the highest variance which will be used for classification purposes. Usually, the employed method is based on principal component analysis (PCA) which only focuses on generating new inputs, regardless of the label of data, thus posing the problem of the features selection in real-time identification where the principal components might differ from one time to another. The other solution is to deal with features selection which consists of choosing between the set of possible features, the most representatives ones. The method is usually based on statistical analysis in which the evaluation of features importance for discriminating between gestures is realized. In this work, ANOVA statistical analysis, which is the most used in statistical computation, was used to compare the significant differences in characteristics to determine whether or not a characteristic allows good features identification of gestures as suggested in [44]. ANOVA's result was then calculated from the null hypothesis. The null hypothesis is that all the calculated characteristics distribution is similar. Given that there is a null hypothesis if the probability (p-value) is less than 0.05, the characteristics were significantly different. The ANOVA's results computed with Matlab 2016b for a data set of 100 samples as 20 per gestures are given in Table 10.  For each gesture, ANOVA results determined that there are three main characteristics which are the norm of acceleration (N am ), the sum of the two sensors located at the forefoot (F1 m ), and the sum of the two FSR sensors located at the heel (F2 m ). Figure 4 presents the ANOVA representation of each selected feature and its corresponding values for each of the proposed five gestures numerated from G1 to G5. For each gesture, ANOVA results determined that there are three main characteristics which are the norm of acceleration (Nam), the sum of the two sensors located at the forefoot (F1m), and the sum of the two FSR sensors located at the heel (F2m). Figure 4 presents the ANOVA representation of each selected feature and its corresponding values for each of the proposed five gestures numerated from G1 to G5.  An analysis of the proposed ANOVA results shows the possibility to enhance our classification method by means of a threshold. Figure 4a shows that, for the mean of the norm of acceleration Nam, there is a threshold of 0.25. This means that, for gestures where An analysis of the proposed ANOVA results shows the possibility to enhance our classification method by means of a threshold. Figure 4a shows that, for the mean of the norm of acceleration Na m , there is a threshold of 0.25. This means that, for gestures where the variation of Na m is important, such as for gestures 1, 2, and 5, the measured value is greater than 0.25, whereas, for gestures 3 and 4, the value of the Na m is less than 0.25. Therefore, additional conditions were set.
By reproducing the same analysis, a similar set of conditions applying on the mean of the sum of the two FSR sensors in the heel shows a threshold value of 0.4, meanwhile, for the sum of the two FSR sensors located at the forefoot, the threshold appears to be difficult to be set. A further histogram analysis conducted in a more complete data set of about 100 samples per gesture is presented in Tables 11 and 12. the variation of Nam is important, such as for gestures 1, 2, and 5, the measured greater than 0.25, whereas, for gestures 3 and 4, the value of the Nam is less th Therefore, additional conditions were set. By reproducing the same analysis, a similar set of conditions applying on th of the sum of the two FSR sensors in the heel shows a threshold value of 0.4, mea for the sum of the two FSR sensors located at the forefoot, the threshold appea difficult to be set. A further histogram analysis conducted in a more complete da about 100 samples per gesture is presented in Tables 11 and 12. the variation of Nam is important, such as for gestures 1, 2, and 5, the measured value is greater than 0.25, whereas, for gestures 3 and 4, the value of the Nam is less than 0.25. Therefore, additional conditions were set. By reproducing the same analysis, a similar set of conditions applying on the mean of the sum of the two FSR sensors in the heel shows a threshold value of 0.4, meanwhile, for the sum of the two FSR sensors located at the forefoot, the threshold appears to be difficult to be set. A further histogram analysis conducted in a more complete data set of about 100 samples per gesture is presented in Tables 11 and 12. the variation of Nam is important, such as for gestures 1, 2, and 5, the measured v greater than 0.25, whereas, for gestures 3 and 4, the value of the Nam is less th Therefore, additional conditions were set. By reproducing the same analysis, a similar set of conditions applying on th of the sum of the two FSR sensors in the heel shows a threshold value of 0.4, mea for the sum of the two FSR sensors located at the forefoot, the threshold appea difficult to be set. A further histogram analysis conducted in a more complete dat about 100 samples per gesture is presented in Tables 11 and 12. the variation of Nam is important, such as for gestures 1, 2, and 5, the measured value is greater than 0.25, whereas, for gestures 3 and 4, the value of the Nam is less than 0.25. Therefore, additional conditions were set. By reproducing the same analysis, a similar set of conditions applying on the mean of the sum of the two FSR sensors in the heel shows a threshold value of 0.4, meanwhile, for the sum of the two FSR sensors located at the forefoot, the threshold appears to be difficult to be set. A further histogram analysis conducted in a more complete data set of about 100 samples per gesture is presented in Tables 11 and 12. the variation of Nam is important, such as for gestures 1, 2, and 5, the measured greater than 0.25, whereas, for gestures 3 and 4, the value of the Nam is less th Therefore, additional conditions were set. By reproducing the same analysis, a similar set of conditions applying on th of the sum of the two FSR sensors in the heel shows a threshold value of 0.4, mea for the sum of the two FSR sensors located at the forefoot, the threshold appea difficult to be set. A further histogram analysis conducted in a more complete da about 100 samples per gesture is presented in Tables 11 and 12. the variation of Nam is important, such as for gestures 1, 2, and 5, the measured value is greater than 0.25, whereas, for gestures 3 and 4, the value of the Nam is less than 0.25. Therefore, additional conditions were set. By reproducing the same analysis, a similar set of conditions applying on the mean of the sum of the two FSR sensors in the heel shows a threshold value of 0.4, meanwhile, for the sum of the two FSR sensors located at the forefoot, the threshold appears to be difficult to be set. A further histogram analysis conducted in a more complete data set of about 100 samples per gesture is presented in Tables 11 and 12.

G5
Histogram analysis of Nam shows the same threshold value of 0.25 a sented from ANOVA's result in Figure 4a. The histogram analysis of F2m pre old value of about 0.35 and for F1m the threshold values appear to be 0.38. are mainly the same obtained from ANOVA's analysis in Figure 4b for F1m for F2m. In order to generalize the threshold results, we decided to set it to 0 and F2m. Table 13 presents a summary of the proposed threshold values for algorithm to ensure images normalization.  Once the features are selected, the next section proposes the 2D-CNN tion for classification purposes.

2D-CNN Image Generation
A 2D-CNN system was used to recognize gestures. The 2D-CNN sys input of a 2D image constituted by the features presented above. Independ features selected for image generation, the algorithm of the temporal met the signal preprocessing and the image composition follows five steps: (1 the sensor data; (2) segmentation of the signals (the beginning of each gestu fied and then the first twenty-five pieces of data were recorded from the b determination of all the maximum values of the insole's sensor measurem malization of the data between 0 and 1 (a division of the data by the previou Histogram analysis of Na m shows the same threshold value of 0.25 as the one presented from ANOVA's result in Figure 4a. The histogram analysis of F2 m presents a threshold value of about 0.35 and for F1 m the threshold values appear to be 0.38. Those results are mainly the same obtained from ANOVA's analysis in Figure 4b for F1 m and Figure 4c for F2 m . In order to generalize the threshold results, we decided to set it to 0.4 for both F1 m and F2 m . Table 13 presents a summary of the proposed threshold values for the processing algorithm to ensure images normalization.  Once the features are selected, the next section proposes the 2D-CNN image generation for classification purposes.

2D-CNN Image Generation
A 2D-CNN system was used to recognize gestures. The 2D-CNN system has as an input of a 2D image constituted by the features presented above. Independently from the features selected for image generation, the algorithm of the temporal method involving the signal preprocessing and the image composition follows five steps: (1) collection of the sensor data; (2) segmentation of the signals (the beginning of each gesture was identified and then the first twenty-five pieces of data were recorded from the beginning); (3) determination of all the maximum values of the insole's sensor measurements; (4) normalization of the data between 0 and 1 (a division of the data by the previously measured maximum); and (5) composition of the matrices of the pixels.
For 2D-CNN image generation, we firstly define a set of images based on features selection presented in Table 6. These 8 features were represented in an image according to the spatial disposition presented in Figure 5a. This representation results in a 15 × 15 pixels image and the images obtained from the 5 different foot gestures are shown in Figure 5b Secondly, for complexity reduction purposes, we constructed two sets of image based on the three selected features obtained from ANOVA analysis. A first set of image was constructed based on rectangles representation of the selected features according to Figure 6. Each feature was converted into a pixel and displaced accordingly to the repre sentation in Figure 6a. Since they are grayscale images, the value of each pixel in the ma trix is between 0 (indicating black) and 255 (indicating white). The images presented in Figure 6 are based on a set of rectangles. Images are also made up of 11 × 11 pixels. Secondly, for complexity reduction purposes, we constructed two sets of images based on the three selected features obtained from ANOVA analysis. A first set of images was constructed based on rectangles representation of the selected features according to Figure 6. Each feature was converted into a pixel and displaced accordingly to the representation in Figure 6a. Since they are grayscale images, the value of each pixel in the matrix is between 0 (indicating black) and 255 (indicating white). The images presented in Figure 6 are based on a set of rectangles. Images are also made up of 11 × 11 pixels.
Secondly, for complexity reduction purposes, we constructed two sets of images based on the three selected features obtained from ANOVA analysis. A first set of images was constructed based on rectangles representation of the selected features according to Figure 6. Each feature was converted into a pixel and displaced accordingly to the representation in Figure 6a. Since they are grayscale images, the value of each pixel in the matrix is between 0 (indicating black) and 255 (indicating white). The images presented in Figure 6 are based on a set of rectangles. Images are also made up of 11 × 11 pixels. To reduce the grid size of the image, a new set of geometric representations was proposed for modeling the three selected characteristics. The square, the rectangle, and the triangle represent the mean of the norm of acceleration Na m , the mean of the sum of two FSR sensors integrated into the forefoot position F1 m , and of the two ones integrated in the heel position F2 m , respectively. This method is called "Data Wrangling" and it consists of transforming the raw data to another format in order to make it easier to use. Figure 7 presents the proposed method to obtain a set of images to be used. The threshold is determined from the analysis previously presented in Section 3.3. To reduce the grid size of the image, a new set of geometric representations was proposed for modeling the three selected characteristics. The square, the rectangle, and the triangle represent the mean of the norm of acceleration Nam, the mean of the sum of two FSR sensors integrated into the forefoot position F1m, and of the two ones integrated in the heel position F2m, respectively. This method is called "Data Wrangling" and it consists of transforming the raw data to another format in order to make it easier to use. Figure 7 presents the proposed method to obtain a set of images to be used. The threshold is determined from the analysis previously presented in Section 3.3. The output of such image generation is a 9 × 9 pixels images that characterizes each gesture. Figure 8 shows the theoretical image obtained for each gesture. The output of such image generation is a 9 × 9 pixels images that characterizes each gesture. Figure 8 shows the theoretical image obtained for each gesture. The output of such image generation is a 9 × 9 pixels images that characterizes each gesture. Figure 8 shows the theoretical image obtained for each gesture.

2D-CNN Classification Method
A grayscale image was used as an input of CNN. CNN consists of a succession of layers that include feature maps and subsampling maps. The CNN model is designed with four main building blocks, as shown in Figure 9: (1) convolution; (2) pooling or subsampling; (3) non-linearity (ReLU); and (4) fully connected.

2D-CNN Classification Method
A grayscale image was used as an input of CNN. CNN consists of a successio layers that include feature maps and subsampling maps. The CNN model is desig with four main building blocks, as shown in Figure 9: (1) convolution; (2) pooling or sampling; (3) non-linearity (ReLU); and (4) fully connected.
Convolution is the first layer of CNN. Indeed, its role consists of extracting the c acteristics of the images presented as the input. During this phase, 2D convolution is plied to the image in order to determine its useful information. The filtered images p through the second layer (pool) of the CNN. The role of this part is to reduce the siz the image while preserving its most important information. Indeed, a sliding wind traverses the image and reduces its size by using a local maximum operation. The recti linear unit (ReLU) is the third layer of the CNN in which each negative value will be placed by zero. Therefore, the size of the image is not changed in this layer. The f connected layer is a multilayer perceptron that combines the characteristics of the ima and determines the probability of each class presented in the learning phase. In this posed CNN architecture, the nonlinear function used is the sigmoid function. Figu presents the general structure of the CNN used for gesture recognition. Based on the structure of image presented as input, there are some characteris adopted as given in Table 14.  Figure 9. 2D-CNN methodology adopted.
Convolution is the first layer of CNN. Indeed, its role consists of extracting the characteristics of the images presented as the input. During this phase, 2D convolution is applied to the image in order to determine its useful information. The filtered images pass through the second layer (pool) of the CNN. The role of this part is to reduce the size of the image while preserving its most important information. Indeed, a sliding window traverses the image and reduces its size by using a local maximum operation. The rectified linear unit (ReLU) is the third layer of the CNN in which each negative value will be replaced by zero. Therefore, the size of the image is not changed in this layer. The fully connected layer is a multilayer perceptron that combines the characteristics of the images and determines the probability of each class presented in the learning phase. In this proposed CNN architecture, the nonlinear function used is the sigmoid function. Figure 9 presents the general structure of the CNN used for gesture recognition.
Based on the structure of image presented as input, there are some characteristics adopted as given in Table 14. posed CNN architecture, the nonlinear function used is the sigmoid function. Figure 9 presents the general structure of the CNN used for gesture recognition. Based on the structure of image presented as input, there are some characteristics adopted as given in Table 14.

Results
Foot gesture identification task was considered as a pattern recognition problem in which a set of foot's movements of one of this paper's authors was recorded for training and validation steps. The classification of gestures was based on statistical information extracted from its patterns. For every gesture, 70% of data were defined as training samples, 15% as validation samples, and 15% as test ones. The CNN model was trained using the training and validation set and tested independently with the testing set. Many of tests (100) were finally performed to obtain an optimized model. The selected parameters to test the CNN model in TensorFlow were obtained from the training process and were presented for each type of image presented as input. Table 15 presents the latter. The recognition process is based on the gradient method. Confusion matrix related to each method and the recognition rate for the five-foot gestures are presented in Table 16.

Images Input Recognition Rate Comments
Image based on temporal analysis The recognition rate is about 60%. By exploiting all the 8 features from human observation, only 3 foot gestures are correctly recognized (G2, G4, G5). For gesture recognition, G1 and G2 are recognized as the same gesture. Furthermore, the system could not accurately identify G3 because there is 30% of cases where G3 is classified as G2 and 70% where G3 is classified as G5.
Image based on ANOVA features First attempt: (Set of rectangles) By using statistical analysis based on ANOVA, the recognition rate appears to be greater than the previous one for about 14%. This set of images based on a spatial representation of selected features using a rectangular form could successfully recognize 3 foot gestures (G1, G3, G4). Furthermore, the system is able to make a clear distinction between G1 and G2.

Results
Foot gesture identification task was considered as a pattern recognition problem in which a set of foot's movements of one of this paper's authors was recorded for training and validation steps. The classification of gestures was based on statistical information extracted from its patterns. For every gesture, 70% of data were defined as training samples, 15% as validation samples, and 15% as test ones. The CNN model was trained using the training and validation set and tested independently with the testing set. Many of tests (100) were finally performed to obtain an optimized model. The selected parameters to test the CNN model in TensorFlow were obtained from the training process and were presented for each type of image presented as input. Table 15 presents the latter.

Results
Foot gesture identification task was considered as a pattern recognition problem in which a set of foot's movements of one of this paper's authors was recorded for training and validation steps. The classification of gestures was based on statistical information extracted from its patterns. For every gesture, 70% of data were defined as training samples, 15% as validation samples, and 15% as test ones. The CNN model was trained using the training and validation set and tested independently with the testing set. Many of tests (100) were finally performed to obtain an optimized model. The selected parameters to test the CNN model in TensorFlow were obtained from the training process and were presented for each type of image presented as input. Table 15 presents the latter. The recognition process is based on the gradient method. Confusion matrix related to each method and the recognition rate for the five-foot gestures are presented in Table 16.

Images Input Recognition Rate Comments
Image based on temporal analysis The recognition rate is about 60%. By exploiting all the 8 features from human observation, only 3 foot gestures are correctly recognized (G2, G4, G5). For gesture recognition, G1 and G2 are recognized as the same gesture. Furthermore, the system could not accurately identify G3 because there is 30% of cases where G3 is classified as G2 and 70% where G3 is classified as G5.

Results
Foot gesture identification task was considered as a pattern recognition problem in which a set of foot's movements of one of this paper's authors was recorded for training and validation steps. The classification of gestures was based on statistical information extracted from its patterns. For every gesture, 70% of data were defined as training samples, 15% as validation samples, and 15% as test ones. The CNN model was trained using the training and validation set and tested independently with the testing set. Many of tests (100) were finally performed to obtain an optimized model. The selected parameters to test the CNN model in TensorFlow were obtained from the training process and were presented for each type of image presented as input. Table 15 presents the latter. The recognition process is based on the gradient method. Confusion matrix related to each method and the recognition rate for the five-foot gestures are presented in Table 16.

Images Input Recognition Rate Comments
Image based on temporal analysis The recognition rate is about 60%. By exploiting all the 8 features from human observation, only 3 foot gestures are correctly recognized (G2, G4, G5). For gesture recognition, G1 and G2 are recognized as the same gesture. Furthermore, the system could not accurately identify G3 because there is 30% of cases where G3 is classified as G2 and 70% where G3 is classified as G5.

Results
Foot gesture identification task was considered as a pattern recognition problem in which a set of foot's movements of one of this paper's authors was recorded for training and validation steps. The classification of gestures was based on statistical information extracted from its patterns. For every gesture, 70% of data were defined as training samples, 15% as validation samples, and 15% as test ones. The CNN model was trained using the training and validation set and tested independently with the testing set. Many of tests (100) were finally performed to obtain an optimized model. The selected parameters to test the CNN model in TensorFlow were obtained from the training process and were presented for each type of image presented as input. Table 15 presents the latter. The recognition process is based on the gradient method. Confusion matrix related to each method and the recognition rate for the five-foot gestures are presented in Table 16.

Images Input Recognition Rate Comments
Image based on temporal analysis The recognition rate is about 60%. By exploiting all the 8 features from human observation, only 3 foot gestures are correctly recognized (G2, G4, G5). For gesture recognition, G1 and G2 are recognized as the same gesture. Furthermore, the system could not accurately identify G3 because there is 30% of cases where G3 is classified as G2 and 70% where G3 is classified as G5. The recognition process is based on the gradient method. Confusion matrix related to each method and the recognition rate for the five-foot gestures are presented in Table 16. The recognition process is based on the gradient method. Confusion matrix related to each method and the recognition rate for the five-foot gestures are presented in Table 16.

Images Input Recognition Rate Comments
Image based on temporal analysis The recognition rate is about 60%. By exploiting all the 8 features from human observation, only 3 foot gestures are correctly recognized (G2, G4, G5). For gesture recognition, G1 and G2 are recognized as the same gesture. Furthermore, the system could not accurately identify G3 because there is 30% of cases where G3 is classified as G2 and 70% where G3 is classified as G5.
Image based on ANOVA features First attempt: (Set of rectangles) By using statistical analysis based on ANOVA, the recognition rate appears to be greater than the previous one for about 14%. This set of images based on a spatial representation of selected features using a rectangular form could successfully recognize 3 foot gestures (G1, G3, G4). Furthermore, the system is able to make a clear distinction between G1 and G2.

Image based on temporal analysis
Learning Rate 0.01 0.00019 0.005 Momentum Coefficient 0.6 0.899 0.9 The recognition process is based on the gradient method. Confusion matrix related to each method and the recognition rate for the five-foot gestures are presented in Table 16.

Images Input Recognition Rate Comments
Image based on temporal analysis The recognition rate is about 60%. By exploiting all the 8 features from human observation, only 3 foot gestures are correctly recognized (G2, G4, G5). For gesture recognition, G1 and G2 are recognized as the same gesture. Furthermore, the system could not accurately identify G3 because there is 30% of cases where G3 is classified as G2 and 70% where G3 is classified as G5.
Image based on ANOVA features First attempt: (Set of rectangles) By using statistical analysis based on ANOVA, the recognition rate appears to be greater than the previous one for about 14%. This set of images based on a spatial representation of selected features using a rectangular form could successfully recognize 3 foot gestures (G1, G3, G4). Furthermore, the system is able to make a clear distinction between G1 and G2.
The recognition rate is about 60%. By exploiting all the 8 features from human observation, only 3 foot gestures are correctly recognized (G2, G4, G5). For gesture recognition, G1 and G2 are recognized as the same gesture. Furthermore, the system could not accurately identify G3 because there is 30% of cases where G3 is classified as G2 and 70% where G3 is classified as G5.

Parameters
Learning Rate 0.01 0.00019 0.005 Momentum Coefficient 0.6 0.899 0.9 The recognition process is based on the gradient method. Confusion matrix related to each method and the recognition rate for the five-foot gestures are presented in Table 16.

Images Input Recognition Rate Comments
Image based on temporal analysis The recognition rate is about 60%. By exploiting all the 8 features from human observation, only 3 foot gestures are correctly recognized (G2, G4, G5). For gesture recognition, G1 and G2 are recognized as the same gesture. Furthermore, the system could not accurately identify G3 because there is 30% of cases where G3 is classified as G2 and 70% where G3 is classified as G5.
Image based on ANOVA features First attempt: (Set of rectangles) By using statistical analysis based on ANOVA, the recognition rate appears to be greater than the previous one for about 14%. This set of images based on a spatial representation of selected features using a rectangular form could successfully recognize 3 foot gestures (G1, G3, G4). Furthermore, the system is able to make a clear distinction between G1 and G2. The recognition process is based on the gradient method. Confusion matrix related to each method and the recognition rate for the five-foot gestures are presented in Table 16.

Images Input Recognition Rate Comments
Image based on temporal analysis The recognition rate is about 60%. By exploiting all the 8 features from human observation, only 3 foot gestures are correctly recognized (G2, G4, G5). For gesture recognition, G1 and G2 are recognized as the same gesture. Furthermore, the system could not accurately identify G3 because there is 30% of cases where G3 is classified as G2 and 70% where G3 is classified as G5.
Image based on ANOVA features First attempt: (Set of rectangles) By using statistical analysis based on ANOVA, the recognition rate appears to be greater than the previous one for about 14%. This set of images based on a spatial representation of selected features using a rectangular form could successfully recognize 3 foot gestures (G1, G3, G4). Furthermore, the system is able to make a clear distinction between G1 and G2.
By using statistical analysis based on ANOVA, the recognition rate appears to be greater than the previous one for about 14%. This set of images based on a spatial representation of selected features using a rectangular form could successfully recognize 3 foot gestures (G1, G3, G4). Furthermore, the system is able to make a clear distinction between G1 and G2. However, there is still some confusion of G2, between G1 and G5, and G5, between G1 and G5, with 66.6%, 33.3%, 30%, and 70%, respectively. However, there is still some confusion of G2, between G1 and G5, and G5, between G1 and G5, with 66.6%, 33.3%, 30%, and 70%, respectively.
Image based on ANOVA features Final proposition: (Set of rectangles and triangles) With the enhancement of the images using ANOVA for selecting feature and the modification of the spatial representation of the features in the images using a set of forms (squares, rectangles, and triangles), the system achieves a 100% of recognition rate. Therefore, each foot gesture is correctly identified.
Based on these results, it can be inferred that ANOVA analysis contributed to the great increase (about 14%) in the recognition rate, implying that features specification has an important place in the recognition process. Furthermore, by using the spatial distribution of the selected features obtained from the ANOVA analysis, we achieved different results, 74% for the first case and 100% for the second one. These results show that the rescaling method of the features data has an important impact on the classification base 2D-CNN method.

Limit of the Study
Limitations in this study can be seen in several points. Firstly, the recognition process only accounts for one user (the first author of this research works) whose characteristic has previously been scoped in the convolutional neural network, thus requiring for every new user to compute the training process. Secondly, our study is conducted in a strictly supervised environment where noises arisen from environmental consideration, such as vibrations, are taken out, thus requiring the enhancement of disturbances robustness for all industries purposes. Thirdly, this current study has not been yet implemented in realtime embedded system for online classification tests. Finally, a study of the proposed classification algorithm for a larger set of gestures and participants is yet to be considered.

Conclusions and Future Works
In this paper, a new method that can be used for human-robot interaction in hybrid work cells is proposed. The goal is to switch between possible cobot operating modes based on foot gesture command. Therefore, this article presents a foot gesture humanrobot interface using an instrumented insole located inside the worker's left shoe. Firstly, However, there is still some confusion of G2, between G1 and G5, and G5, between G1 and G5, with 66.6%, 33.3%, 30%, and 70%, respectively.
Image based on ANOVA features Final proposition: (Set of rectangles and triangles) With the enhancement of the images using ANOVA for selecting feature and the modification of the spatial representation of the features in the images using a set of forms (squares, rectangles, and triangles), the system achieves a 100% of recognition rate. Therefore, each foot gesture is correctly identified.
Based on these results, it can be inferred that ANOVA analysis contributed to the great increase (about 14%) in the recognition rate, implying that features specification has an important place in the recognition process. Furthermore, by using the spatial distribution of the selected features obtained from the ANOVA analysis, we achieved different results, 74% for the first case and 100% for the second one. These results show that the rescaling method of the features data has an important impact on the classification base 2D-CNN method.

Limit of the Study
Limitations in this study can be seen in several points. Firstly, the recognition process only accounts for one user (the first author of this research works) whose characteristic has previously been scoped in the convolutional neural network, thus requiring for every new user to compute the training process. Secondly, our study is conducted in a strictly supervised environment where noises arisen from environmental consideration, such as vibrations, are taken out, thus requiring the enhancement of disturbances robustness for all industries purposes. Thirdly, this current study has not been yet implemented in realtime embedded system for online classification tests. Finally, a study of the proposed classification algorithm for a larger set of gestures and participants is yet to be considered.

Conclusions and Future Works
In this paper, a new method that can be used for human-robot interaction in hybrid work cells is proposed. The goal is to switch between possible cobot operating modes based on foot gesture command. Therefore, this article presents a foot gesture humanrobot interface using an instrumented insole located inside the worker's left shoe. Firstly, two foot gesture dictionaries were formulated, then five gestures assimilated to five selected commands to control a robot were chosen. Foot gesture signals were collected from the insole and processed for features selection. In this process, a statistical analysis utiliz-With the enhancement of the images using ANOVA for selecting feature and the modification of the spatial representation of the features in the images using a set of forms (squares, rectangles, and triangles), the system achieves a 100% of recognition rate. Therefore, each foot gesture is correctly identified.
Based on these results, it can be inferred that ANOVA analysis contributed to the great increase (about 14%) in the recognition rate, implying that features specification has an important place in the recognition process. Furthermore, by using the spatial distribution of the selected features obtained from the ANOVA analysis, we achieved different results, 74% for the first case and 100% for the second one. These results show that the rescaling method of the features data has an important impact on the classification base 2D-CNN method.

Limit of the Study
Limitations in this study can be seen in several points. Firstly, the recognition process only accounts for one user (the first author of this research works) whose characteristic has previously been scoped in the convolutional neural network, thus requiring for every new user to compute the training process. Secondly, our study is conducted in a strictly supervised environment where noises arisen from environmental consideration, such as vibrations, are taken out, thus requiring the enhancement of disturbances robustness for all industries purposes. Thirdly, this current study has not been yet implemented in real-time embedded system for online classification tests. Finally, a study of the proposed classification algorithm for a larger set of gestures and participants is yet to be considered.

Conclusions and Future Works
In this paper, a new method that can be used for human-robot interaction in hybrid work cells is proposed. The goal is to switch between possible cobot operating modes based on foot gesture command. Therefore, this article presents a foot gesture human-robot interface using an instrumented insole located inside the worker's left shoe. Firstly, two foot gesture dictionaries were formulated, then five gestures assimilated to five selected commands to control a robot were chosen. Foot gesture signals were collected from the insole and processed for features selection. In this process, a statistical analysis utilizing a dataset recorded from one person who repeated the different foot gestures several times was computed to identify the most representative features, i.e., the mean of the acceleration norm, the mean of the sum of the two FSR sensors located in the forefoot, and the mean of the sum of the two FSR sensors located in the heel. Then several sets of grayscale images based on the spatial representation (geometric form) of the above features in the selected 2D image were proposed to adequately scope the differences between the suggested five gestures. Thus, the proposed 2D images were given as input to a 2D convolutional neural network with backpropagation algorithm for foot gesture recognition. Offline results showed the great impact of variance analysis in the recognition process as we achieve a higher recognition rate of 74% only by selecting the relevant features. Furthermore, a spatial representation of the selected features in the 2D images seems to greatly impact the recognition process as there a set of geometric configurations exists in which the recognition rate is very high, nearly 100%. From these results, it can then be inferred that the use of foot gesture classification for cobot operating mode selection is possible.
Future research aims to increase the number of chosen gestures in order to have more assimilated commands. Furthermore, for globalization purposes, larger sets of foot gesture executions methods from different persons are required and, finally, a real-time implementation of the proposed solution in the instrumented insole processors ought to be attempted.