Optimizing the Sensor Placement for Foot Plantar Center of Pressure without Prior Knowledge Using Deep Reinforcement Learning

We study the foot plantar sensor placement by a deep reinforcement learning algorithm without using any prior knowledge of the foot anatomical area. To apply a reinforcement learning algorithm, we propose a sensor placement environment and reward system that aims to optimize fitting the center of pressure (COP) trajectory during the self-selected speed running task. In this environment, the agent considers placing eight sensors within a 7 × 20 grid coordinate system, and then the final pattern becomes the result of sensor placement. Our results show that this method (1) can generate a sensor placement, which has a low mean square error in fitting ground truth COP trajectory, and (2) robustly discovers the optimal sensor placement in a large number of combinations, which is more than 116 quadrillion. This method is also feasible for solving different tasks, regardless of the self-selected speed running task.


Introduction
The definition of foot plantar pressure is the distribution of force between the foot's sole and the support surface. Plantar pressure measurement systems have been used in several applications, such as sports performance analysis and injury prevention [1], gait monitoring [2], and biometrics [3]. In the literature, various sensor placement patterns, based on modified from the foot anatomical area or filled with mesh-like sensors array, were discussed by Razak et al. [4]. The design approach filled with mesh-like sensors increases the measuring accuracy but also increases the prices. However, reducing the number of sensors and achieving acceptable accuracy is challenging. Usually, finding a sensor placement pattern is determined by a human expert. By contrast, this paper proposes another new design approach for plantar sensor placement based on plantar pressure data and deep reinforcement learning (DRL) [5,6] algorithm. This approach uses the center of pressure (COP) trajectory to evaluate the sensor placement quality. Using this mechanism, we are trying to find new placement patterns that human knowledge has not yet discovered.
Reinforcement learning (RL) [7] is an algorithm that consists of an environment and an agent and trains the agent's policy through feedback from the environment. In many complex domains, reinforcement learning is the only feasible way to train a program to perform at high levels. Furthermore, deep reinforcement learning (DRL) merges deep learning (DL) [8] and reinforcement learning (RL) algorithms. Deep learning is a branch of machine learning which uses an artificial neural network to extract information from high dimensional data and has led to breakthroughs in computer vision [9][10][11][12] and speech recognition [13,14]. Within DRL, it uses a deep neural network as a function approximator [15], which not only allows this algorithm to extract information from high dimensional data but also scale-up the ability to solve more complex problems. Moreover, DRL has accomplished many achievements, such as mastering the game of go without human knowledge [16] and winning the world champions in a multiplayer real-time strategy game [17], in modern machine learning. For the sensor placement problem, many combinations for sensor placement solving by brute force are not feasible, so we adopted a DRL to this problem.
We have organized the rest of this paper in the following way: first, we describe the data collection for self-selected speed running plantar pressure videos and the preprocessing. Then, we propose the environment and the reward system for designing the sensor placement. This environment and reward system aim to optimize the sensor placement for COP accuracy and adapt to DRL. Third, we briefly illustrate the Soft Actor-Critic Discrete (SAC-Discrete) [18], a discrete version of the Soft Actor-Critic (SAC) RL algorithm [19], and apply it for the sensor placement task with some simple testing data, which we created. Fourth, we utilize the Population-Based Training (PBT) [20] method to tune the hyperparameter using SAC-Discrete. Applying this method enhances training stability and performance in our sensor placement task. Finally, we feed the plantar pressure videos to the sensor placement environment and then present the results and conclusions.

Experimental Protocol
Each subject needs to run for three minutes with their self-selected speed on the treadmill. The data logger is triggered by an external trigger button when a subject is comfortable with the treadmill's current speed. Furthermore, they are wearing the same model of shoes with their proper size.

Self Selected Speed Plantar Pressure Video Collection
Plantar pressure video is recorded with the F-Scan [21] system by Tekscan, which receives plantar pressure with an insole pressure sensor array. This system contains a pair of resistive sensor [22] sheets placed on top of the insole and applying double-sided tape to avoid sensor sheets slipping during recording. The pressure range of this sensor sheet for this experiment is 1-150 psi (approximately 7-1000 kPa). The F-Scan system's recording software version is 7.50-07, and, before recording, the sensor sheet is calibrated by this software. The maximum spatial resolution of this F-Scan hardware/software system is up to 750 Hz. In this experiment, we set the acquisition frequency at 100 Hz, so the recording video's output frame rate is also at 100 Hz. Since the plantar pressure video is obtained from the F-Scan software, which has done the calibration, this video's spatial resolution is 21 × 60, and the unit of each pixel value is kPa. The F-Scan system starts to record the data when a subject is comfortable with the treadmill's current speed and finishes a record after three minutes. This experiment is illustrated in Figure 1.

Data Preprocessing
Plantar pressure videos collected from the F-Scan system are preprocessed to construct a data set; for each episode, the sensor placement environment will randomly select a plantar pressure video within this data set to calculate rewards. The preprocessing steps are as follows:

1.
A gait cycle consists of the stance phase and the swing phase; during the swing phase, the F-Scan system will not receive any pressure information. Thus, we remove the swing phase within a three-minute plantar pressure video by splitting it into many stance-phase plantar pressure videos.

2.
To reduce the amount of stance-phase plantar pressure videos, we divide those videos into five equal groups with the time sequence and randomly choose one video from each group.

3.
The stance-phase plantar pressure videos are cropped to remove the white border, which is a row or column that does not receive any pressure within this plantar pressure video.

4.
After cropping the stance-phase plantar pressure videos, each video presents different spatial resolution. Thus, we downsample each video to 7 × 20 by the pressure formula P = F/A.
For each subject, this experiment collects two three-minute plantar pressure videos; one is the left foot, and the other is the right foot. After the preprocessing, data collected from a subject produces ten 7 × 20 stance-phase videos, as Figure 2 shows. Since fifteen subjects join this experiment, there are 150 stance-phase plantar pressure videos using in the sensor placement environment. The green and yellow videos represent the stance-phase plantar pressure video; each video's total frame count depends on its stance-phase duration. The pink videos represent the stance-phase plantar pressure video that randomly selected from five equal groups. (b) In step three and step four, we use one of the chosen videos for a demonstration; the image beside each video is its pixel-wise accumulated image, which utilizes to visual the cropping and resampling processes. The purple video represents the cropped video, and the orange video is the final result, which has a 7 × 20 spatial resolution.

Sensor Placement Environment and Reward System
Reinforcement learning is an algorithm that consists of an agent and an environment. For each time step, the environment provides the state's information for the agent and then the agent using it to select an action. After the action has taken, the environment updates its state and then offers the next state's information and reward to the agent. These interactions between the environment and the agent produce a serial of state-action pairs. The length of this state-action pairs depends on the environment's termination condition and could also be infinite. By using those state-action pairs, the RL algorithm reinforces the agent's policy to maximize the environment's accumulative reward. To optimize the sensor placement for the COP trajectory during the self-selected speed running task, we present a sensor placement environment and a reward system.

Sensor Placement Environment
At the initial state, the sensor placement environment gets a plantar pressure video, which will be used to calculate the reward, and provides an empty 7 × 20 board information to the agent. Figure 3 shows the plantar pressure video. The agent owns eight sensors at the beginning, which can be placed on this empty board. For each time step, the agent places one of its sensors on the board. It does not matter whether another sensor is placed on a position where other sensors already exist; in other words, having multiple sensors on the same position is allowed. The terminating condition for the sensor placement environment is that the agent finishes placing all of its sensors. When this episode terminates, the agent will receive the only reward given by the environment; it means that the agent will only receive a zero as the reward until it reaches the terminate state. The agent's main objective is to place the sensors in the crucial positions to get the maximum reward at the end of this episode. Figure 4a illustrates the interaction between the agent and the environment, and Figure 4b shows the reward given by the environment.  separately, and the number represents the sensors count on this position. The notation S T represents the terminate state. S 3 demonstrates the situation when the agent selects a position where another sensor already existed. (b) Due to the reward and the next state provided by the environment simultaneously, the first reward starts at S 1 . Without reward redistribution, the agent only receives the reward in the terminate state. (c) The redistributed reward is the current and previous accumulative reward difference. Since S 3 has the same masked position as S 2 , they get the same accumulative reward in this episode; the redistributed reward at S 3 is zero. It is also possible to get a negative reward, as shown at S 2 .

Reward System
In reinforcement learning, planning a reward system is essential. The positive reward given by the environment encourages the agent to do more actions that can receive this reward. With the sensor placement environment scenario, it encourages the agent to arrange sensor positions for fitting the COP trajectory to get a higher reward at the terminate state. The environment calculates a reward using a plantar pressure video and the final sensor positions, which the agent determines. For each episode, the environment can use different plantar pressure videos to calculate the reward. It means that, even if the agent places all of its sensors in the same positions in each episode, the reward could be changed. To calculate the reward, we first introduce the COP trajectory formula for a plantar pressure video as follows: The COP trajectory is a serial of points lying on the 2D plane, and the length of this series is the video frame count. In this formula, n is the index for the frame number, the pressure n represents a pixel value within the n-th frame, and coordinate x and coordinate y are the relative position for that pixel. Now, we describe how to calculate the reward with the sensor positions given by the agent. The environment can use the sensor positions as a pixel-wise mask for the plantar video to get two various COP trajectories. One is the COP trajectory calculated from the original plantar pressure video; another is the COP trajectory calculated from the masked plantar pressure video. Optimizing the sensor positions for fitting the COP trajectory can achieve by minimizing the distances between COP positions calculated from the original and the masked ones for each frame, as Figure 3 shown. Thus, the reward function is defined as follows: where (ĈOP n,x ,ĈOP n,y ) denotes the COP position calculated from the plantar pressure video, (COP n,x , COP n,y ) denotes the masked version, and N + 1 is the total frame count. The distance between two COP positions is normalized to [0, 1] by divided by max distance. Using one minus the normalized distance as the reward, when the distance is zero, the agent will get the maximum reward, which is one. The exponent 0.4 in this equation is used to increase the precision for a smaller distance and encourages the agent to get a better score. Finally, average the reward over each frame by summing up the reward for each frame and then dividing by N + 1.

Reward Redistribution
Training an agent in a delayed rewards environment is a challenging problem in RL. First, since the agent can not immediately notice if its action is good or bad, reinforcing its policy becomes harder. Second, it also takes time to propagate the delayed rewards to the current state, which means it takes a much longer time for training. To solve this problem in the sensor placement environment, we used the concept proposed in the RUDDER algorithm [23]. RUDDER's idea is to distribute the delayed reward to those actions that cause this delayed reward to happen and can be implemented by the following steps: 1.
Using a Long Short-term Memory (LSTM) model to construct a sequence-to-sequence supervised learning task [24]. The serial of state-action pairs as the input and the delayed reward as the label.
The output sequence of this model can be treated as the accumulative reward at each state.

2.
After this supervised learning is finished, the redistributed rewards for each state will be calculated by differencing the current and previous state accumulative reward.

3.
Replacing the original reward with the redistributed rewards then trains the agent with any feasible RL algorithm.
Since the accumulative reward in the sensor placement environment can be calculated with Equation (2) for each state, we can skip the first step. The rewards after the RUDDER algorithm are shown in Figure 4c.

Soft Actor-Critic Discrete
Various deep RL algorithms have been proposed in recent years, like Asynchronous Advantage Actor-Critic (A3C) [25], Proximal Policy Optimization (PPO) [26], and Soft Actor-Critic (SAC) [19]. We chose the discrete version of Soft Actor-Critic (SAC-Discrete) [18] for the following reasons: First, the SAC-Discrete objective function optimizes the agent's policy while also maximizing its policy entropy. This objective function increases training stability and encourages the agent to discover the environment. Second, SAC-Discrete is an off-policy RL algorithm; it increases the data reusability so that it can reduce training time. In the sensor placement environment, different sensor placement patterns can get the same reward at the end of the episode, using SAC-Discrete can discover all of those patterns. In this section, we first introduce notation, followed by the maximum entropy reinforcement framework, and finally the SAC-Discrete algorithm.

Notation
An RL problem can be mathematically formulated by a Markov Decision Process (MDP). An MDP P consists of five tuples, P = (S, A, R, p, γ), where S is a set of states s (random variable of S t at time t), A is a set of actions a (random variable of A t at time t) and R is a set of rewards r (random variable of R t+1 at time t). P has transition-reward distributions as follows p(S t+1 = s , R t+1 = r|S t = s, A t = a) conditioned on state-action pairs at time t. γ ∈ [0, 1] is a discount factor that ensures an MDP will converge. We often equip an MDP P with a policy π. A given policy π(a t |s t ), ρ π (s t ) denotes the state marginals of transition-reward distributions, and ρ π (s t , a t ) denotes the state-action marginals of transition-reward distributions.

Maximum Entropy Reinforcement Framework
The maximum entropy reinforcement framework varies the standard RL objective function ∑ t E (s t ,a t )∼ρ π [γ t r(s t , a t )]; this framework maximizes the expected sum of rewards, while maximizing its policy entropy as the following equation: where π * is the optimal policy, T is the number of time steps, and H(π(.|s t )) is the entropy of π at state s t . The temperature parameter α is a hyperparameter within the equation, determining the relative importance of the entropy term versus the reward. Thus, it also can be tuned during training time; when α is close enough to 0, this equation falls back to the standard RL objective function.
To reinforce the policy in the RL algorithm is to alternate between the policy evaluation and the policy improvement. The discrete setting of soft policy iteration for the maximum entropy reinforcement framework is presented in [18]. First, the policy evaluation is as follows: In the discrete action setting, policy outputs the probability for each possible action π ∈ [0, 1] |A| , Q(s t ) is the soft Q-function that outputs the Q-value for each action Q(s t ) : S → R |A| . V(s t ) is the state-value function defined as the dot product of the action probabilities and the Q-values with entropy turn. Then, the policy improvement is achieving by a policy gradient method [27]; its objective function is as follows: The subscript φ and θ represent the parameters of the policy neural network and the Q-function neural network separately. Training data s t is sampled from a replay buffer D since the maximum entropy reinforcement framework is an offline learning algorithm.

SAC-Discrete Algorithm
SAC-Discrete uses the maximum entropy reinforcement framework to train an agent and uses a clipped double-Q trick to avoid Q-value overestimate [28]. We added a bar on top of the notation to denote a target network, and the target network smoothly updates with Polyak averaging using a hyperparameter τ. This hyperparameter τ is between 0 and 1, τ ∈ [0, 1]. SAC-Discrete is given by Algorithm 1.

13:
Update Q-functions by one step of gradient decent using

14:
Update policy by one step of gradient decent using 15: To apply SAC-Discrete, we need to design the policy and soft Q-function network. The policy network and the soft Q-function network input an 7 × 20 image as the state information and output each position's logit or Q value dependent on the network type. Since both networks share the same input and output shapes, we used the same design structure, as Figure 5 shows.

Testing Sensor Placement Environment with Created Video
We created a testing video with a simple pattern, as shown in Figure 6a, in order to test the sensor placement environment. Meanwhile, we set up various temperature hyperparameters for this experiment. Temperature hyperparameter affects the final training reward and the convergence time. We tested ten temperatures from 1 × 10 −3 to 10 × 10 −3 . When the temperature is too low, like alpha equals 1 × 10 −3 , the agent lacks discovery and training stability and performs the worst, as shown in Figure 6b. Using a higher temperature increases training stability. However, the convergence time increases as the temperature value increases as well, as shown in Figure 6c. The result showed that selecting a proper temperature hyperparameter is critical, not only increases the training stability and final reward but also decreases the training time like alpha equals 4 × 10 −3 . Figure 6. Illustration of created testing video and episodic rewards. (a) The first and last frames are empty images without any pressure. Using a simple increase and a decrease patterns to generate the rest frames, (b,c) are the episode reward using different temperature parameters from 1 × 10 −3 to 10 × 10 −3 and filtered by a moving average filter with a window size 1000.

Tuning Temperature with Population Based Training
To select a proper temperature hyperparameter, we utilized the Population Based Training (PBT) method [20]. This method combines the parallel search and sequential optimization hyperparameters tuning method. First, the PBT method initializes the population with some agents, which have various hyperparameters. After a training period, it exploits agents whose performance in the top 20% of the population to replace the bottom 20%. Meanwhile, perturbing the hyperparameters to explore the hyperparameter space. Keep repeating the exploit-and-explore process to tune the hyperparameters. For a population P with N training models , the PBT method is given by Algorithm 2.
Applying the PBT method for the sensor placement task, we created a population with a size of 15 and only allowed the PBT method to optimize the temperature parameter. The temperature parameter is initialized with a log scale uniform distribution between 1 × 10 −1 to 1 × 10 −3 . Functions that invoked in the PBT method are described in the following: • Step: Each training iteration updates by the gradient descent with Adam optimizer [29], the learning rate is set to 3 × 10 −4 . • Eval: We evaluate the current model with averaging the last 10 episodic rewards.

•
Ready: A member of the population is considered ready to go through the exploit-and-explore process when the agent elapsed 5 × 10 5 agent steps since the last time that it was ready.
• Exploit: First, we rank all the members of the population using the evaluation value. If the current member is in the bottom 20% of the population, we randomly sample another agent from the top 20% of the population and copy its parameters and hyperparameters. • Explore: We randomly perturb the hyperparameters by a factor of 0.8 or 1.2.
The whole training process runs for 10 M agent steps, which is 1.25 M episodes since each episode takes eight agent steps until terminated. All agents have learned the optimal policy in the testing video's sensor placement environment; as Figure 7a shows, the maximum episodic reward in the sensor placement environment that can be obtained is 1. Moreover, the PBT method adjusts a hyperparameter during a training process, as shown in Figure 7b.

Results
To optimize the sensor placement for the foot plantar center of pressure without any prior knowledge, we proposed the sensor placement environment and solved it with the SAC-Discrete algorithm. Using the reward redistributed trick to make the training process feasible, as mentioned in Section 2.2.3, and the PBT method to tune the temperature hyperparameter makes the training process more stable and better performing, as mentioned in Section 2.4.3. In the testing video task, this mechanism achieves the optimal sensor placement for the COP trajectory, as shown in Figure 7a; this experiment demonstrates the robustness of the training process.
For the self-selected speed running task, we fed 150 stance-phase plantar pressure videos to the sensor placement environment. Hyperparameters setup for SAC-Discrete in this experiment can be found in Appendix A Table A1, and the PBT setup for tuning temperature parameter can be found in Appendix B Table A2. We ran this experiment for 17 M agent steps, which is 2.125 M episodes. The best agent within the population gets an average reward of 0.7986 in the final 1000 episodes, as Figure 8a shows. Rewards start to converge around 0.8 M episodes, and so does the temperature hyperparameter, as Figure 8b shows. The final designed sensor placement position is presented in Figure 8c. The difference of the COP trajectory between the F-Scan System and the designed eight-sensor setting is shown in Figure 8d. We compared our designed eight-sensor setting with the placement design using the concept of WalkinSence [30], as Table 1 shows. Table 1 clearly shows that the performance of our method obtains a higher average reward. The sensor placement design for the WalkinSense can be found in Appendix C Figure A1.

Discussion
Although this study proposed a method that can find a sensor placement within a large number of combinations, we only applied it for finding an eight-sensor placement for self-selected speed running tasks. Applying this method for a different task is to replace the plantar pressure video from self-selected speed running tasks to others. Since the objective of this optimization is to reduce the average distance for the COP distance for each video frame, this method is putting more effort on the region where the COP is dense, as Figure 9a shows. This is the reason why our method placed two sensors in the toe region, and it also increases the accuracy in the toe-off phase, as Figure 9b,c shows. Due to the small number of participants in this experiment, the sensor placement result may not be general enough. However, it shows that this method can be applied to more than one subject and performing better in the COP trajectory accuracy. On the other hand, applying this method for only one subject can create a personalized custom sensor placement design. Using a different number of sensor counts can be studied in the future work, by increasing or decreasing the environment's number of sensors.

Conclusions
This paper presented a sensor placement environment, which can be applied for the SAC-Discrete, a deep RL algorithm, to find the optimal sensor position for self-selected speed running tasks without any prior knowledge of the foot anatomical area. Furthermore, this work introduced a reward redistribution trick to make the training process feasible and the PBT method to tune the temperature hyperparameter making the training process more stable and better performing. The final sensor placement, determined by the best agent, achieved 0.7986 rewards for average within the environment. In summary, the sensor placement environment can find an excellent sensor position for fitting the COP trajectory without any prior knowledge of foot anatomical area, and the performance surpassed the human-designed sensor placement.  Appendix C. Walkinsense Sensor Placement Figure A1. WalkinSense R sensor placement.