Estimation of a Human-Maneuvered Target Incorporating Human Intention

This paper presents a new approach for estimating the motion state of a target that is maneuvered by an unknown human from observations. To improve the estimation accuracy, the proposed approach associates the recurring motion behaviors with human intentions, and models the association as an intention-pattern model. The human intentions relate to labels of continuous states; the motion patterns characterize the change of continuous states. In the preprocessing, an Interacting Multiple Model (IMM) estimation technique is used to infer the intentions and extract motions, which eventually construct the intention-pattern model. Once the intention-pattern model has been constructed, the proposed approach incorporate the intention-pattern model to estimation using any state estimator including Kalman filter. The proposed approach not only estimates the mean using the human intention more accurately but also updates the covariance using the human intention more precisely. The performance of the proposed approach was investigated through the estimation of a human-maneuvered multirotor. The result of the application has first indicated the effectiveness of the proposed approach for constructing the intention-pattern model. The ability of the proposed approach in state estimation over the conventional technique without intention incorporation has then been demonstrated.


Introduction
Most dynamic targets to track or engage are either human-maneuvered or humans themselves. Estimating the state of such a human-maneuvered target is essential and important, and has attracted tremendous interest in the last decades [1][2][3][4]. Despite the importance, difficulty in the estimation of the human-maneuvered target lies in the motion uncertainty. Even though the motion model of the target may be well or precisely known, the control of the human is often unknown [5]. The motion, as a result, becomes considerably different from the expectation. This gives rise to need for the ability to handle motion uncertainty [6].
For a human-maneuvered target, estimation techniques proposed in the past to handle motion uncertainty can be classified into two types. In the first, a single accurate motion model is developed and used to describe the motion behavior. Due to their robust estimation upon past observations, various Bayesian methods, including the parametric Kalman filters (KFs) and the nonparametric particle filters, have been applied by characterizing the estimation problem and identifying the best estimation technique for the problem [7][8][9][10][11][12][13]. Steckenrider and Furukawa [14] proposed to introduce higher-order terms to the motion model through Taylor series expansion and adaptively estimated the target state. Gindele et al. [15] improved the motion model by incorporating the situational context and extending the state space. As human control is unknown most of the time, conservative motion behaviors such as constant velocity (CV) and constant acceleration (CA) have been incorporated as the most probable human controls [5]. Instead of the motion model, Mehra [16] estimated the covariances of motion noise and observation noise when the filter is detected not working optimally. Almagbile et al. [17] evaluated three adaptation methods of noise covariances and showed improvements over the conventional Kalman filter. It is effective to control uncertainty when the deterministic motion accuracy can no longer be improved. In addition to the model and its uncertainty, other work has dealt with unknown human control and its uncertainty from the motion noise [5]. The human control dominates the motion behavior when the target has a large unconstrained workspace. Bogler et al. [18] represented the time-varying human control deterministically by piecewise constants and estimated the control in addition to the state. Chakrabarty et al. [19] assumed the exogenous input and its derivative to be bounded for a class of nonlinear systems in state estimation. Conte and Furukawa [20] used head motion as an additional indicator when the target is a human and improved the estimation accuracy. While they are more detailed and more adaptively represented, these motion models cannot keep capturing the target motion and estimating its state well particularly if the motion is drastically changed by a human. This is due to the limited representation of a single model.
In the second, multiple models, which are either superposed or switched, have been used to estimate more varying motion behavior [21][22][23][24][25]. The multiple-model (MM) estimation methods extend existing techniques to handle multiple models and cover a wider range of motion behavior [26]. Henk et al. [21] proposed the interacting MM (IMM) method that uses a fixed set of motion models with Markovian switching coefficients. The transition probability and model likelihood were introduced to recursively adapt the model probabilities. Li et al. [27] proposed the variable-structure MM (VSMM) method to overcome the limitations of using a fixed set of models in describing the motion. The VSMM method introduces model set adaption besides the model adaptation and thus can describe and estimate even a broader range of motion behavior. Recently, Xu et al. [28] has engaged with estimating varying motion behaviors by adapting parameters where a fixed coarse grid and an adaptive fine grid of the parameters were combined to determine the models that best match the target motion behavior. Despite the wider covering, it is still insufficient to capture and estimate the target if the human control changes considerably. The MM methods are rather formulated to cover a larger state space given the most probable human control. As the drastic control change may magnify changes in state space, the resulting target state could be beyond the permissible space of the MM estimation. In addition, the use of the deterministic control makes the estimation underestimated as the human control is most uncertain. This paper presents an approach for estimating the state of a human-maneuvered target by associating the recurring motion behaviors with human intentions. The proposed approach consists of a preprocess, which constructs the so-called intention-pattern model to encapsulate the human intention, and the main process, which allows state estimation using the intention-pattern model. In the preprocess, the intention-pattern model is constructed from the prior observations by running a revised IMM estimation, extracting motion behaviors of each human intention, aligning them, and probabilistically representing its behavior. The main process, then, uses standard state estimation such as KF extensively using the probabilistically represented intention-pattern model. The strength of the proposed approach lies in the incorporation of the intention-pattern model as the incorporation can make the estimation not only accurate in mean but also precise in covariance.
The paper is organized as follows. The next section describes the estimation problem and its solution using the IMM estimation method, which is not only a generalized formulation but also the technique used in the preprocess of the proposed approach. Section 3 presents the proposed estimation approach including the preprocess and the main process. Numerical validation investigating the effectiveness of both the intention-pattern model and the state estimation is presented in Section 4. Conclusions are summarized in the final section. Figure 1 shows a schematic diagram of the problem of estimating the state of a humanmaneuvered target in case the target is a multirotor. When maneuvering a target, a human operator interacts with a controller using an interface device such as a vehicle panel or a joystick. The controller may be implemented in the interface device, in the target, or both. Some parameters of the controller, such as the maximum speed of the target, are usually configurable to realize different motion behavior. The information of human operation and configurable parameters are not known as no communication with the target is available. The estimator does not affect the human operator and target as well. Having the target observed in the field of view (FOV) of a fixed sensor such as a stereo camera, the goal of the problem is to design the estimator to estimate the target state from observations. The discrete motion model of the target and the observation model are generically given by

Estimation Problem Formulation
where f f f and h h h are the motion and the observation models, respectively; x x x k is the state of the target at step k to estimate; u u u k is the input; z z z k is the observation; and w w w k and v v v k are the motion and observation noises, respectively. Because it is a target maneuvered by a human, f f f and w w w k may not be well known, while u k u k u k is fully unknown. On the other hand, h h h and v v v k are fully known as they are with the sensor(s) of the estimator. With short time interval, it is valid to assume that w w w k and v v v k are Gaussian. The problem is resultantly defined as the estimation of x x x k with no knowledge on u u u k and some knowledge on f f f and w w w k , given h h h, z z z k and v v v k .

IMM Estimation
Lacking information of f f f and u u u k results in high motion uncertainty. The MM estimation methods deal with motion uncertainty by describing the motion with several motion behaviors called modes, which are denoted by S = {s j }, j ∈ N where N represents all natural integers. With the definition, a mode s j is used to represent the motion at step k when it approximates the motion behavior well, i.e., s k = s j . To describe the behavior with minimum complexity, a mode s j is most commonly described by a mathematical model m i , which is collectively represented by M = {m i }, i ∈ N. A model m i thus represents the motion behavior at a step, i.e., s k = s j = m i . Li et al. [29] is referred for more description. Most MM estimation methods utilize models of known motion behaviors with different parameters such as variants of the CV and CA models. As an example, suppose that f f f of Equation (1) at step k is approximated by a single linear Gaussian model. The motion model m i is given by where k is a control matrix, and w w w (i) k is Gaussian with mean 0 0 0 and covariance Q Q Q (i) k . The symbol (i) indicates that the model m i is used. The observation model (2) is also supposed to be linear Gaussian: where C C C k is the observation matrix, and v v v k is Gaussian with mean 0 0 0 and covariance R R R k . Figure 2 shows the framework of the IMM estimation method to estimate the target state. The motion behavior is described with a set of models {m 1 , m 2 , m 3 , ...}. Having the observation z z z k of the state x x x k , KF is applied for each model (3) with the linear Gaussian assumption. Each KF updates the target state of meanx x x The outputx x x k|k and covariance Σ Σ Σ k|k are calculated by the estimate fusion of all ofx x x  The mathematical derivation is as follows. The event that the model m i matches the mode at step k is denoted as m where Pr{·} indicates the probability of an event. The IMM estimator assumes the probability of transitioning from a model m i at step k to a model m j at step k + 1 is constant and known as π ij : where i, j ∈ N. For one cycle, the predicted model probability,μ (i) k|k−1 , is given bŷ where z z z 1:k−1 are the observations from step 1 to step k − 1. The weight that m The KF of each model starts with the derivation of input: According to the KF formulation, the predicted mean and covariance are derived as For correction, the KF gain is first computed through where the residual covariance is given by The corrected mean and covariance are derived as where the observation residual is given bỹ For IMM estimation, the model likelihood L (i) k is assumed as and the model probability µ The overall mean and covariance are derived as Owing to the introducing of transition probability Pr{m (14), the model probabilities µ (i) k adapt to match the current motion. Suppose the model m (i) matches the current mode better, the filter of m (i) contributes more onx x x k|k and Σ Σ Σ k|k by having a higher model probability µ For estimation with one motion model, the single motion model (1) cannot impair its inconsistency from the actual motion when the human has changed the target motion considerably. The IMM method estimates in a larger state space due to the usage of multiple models, but it still uses the most probable deterministic human control such as the CV and CA. If the control is different, the multiple models of the IMM method may not be able to cover the space of estimation and could lead to a wrong estimation. The uncertainty could also be underestimated since the unknown human control, which is most uncertain, is handled deterministically. This limitation of the conventional techniques affects the quality of estimation when the target is human-maneuvered.

Proposed Approach Using Intention-Pattern Model
As the contributions of this paper are the construction of a intention-pattern model and the state estimation using the intention-pattern model, this section describes each contribution in a subsection. Section 3.1 presents the overview of the construction of an intention-pattern model, followed by the details of the two major components, which are the intention inference and the intention-pattern modeling. The implementation of the constructed intention-pattern models into the state estimation is then detailed in Section 3.2. Figure 3 shows the construction of the intention-pattern model where an example illustration is given on the right side. The prior analysis of the target behavior leads to the extraction of a set of human intentions H = {η (i) |∀i} and the corresponding approximate control terms, U U U (i) . Each human intention η (i) is an expression describing an aim or a plan, such as "moving forward" and "turning right". Each human intention at step k is defined as a function of the recent states of N h steps:

Construction of Intention-Pattern Model
Compared with the human actions which are of one step and are associated with the control input u u u k , the human intentions are labels of continuous states of multiple steps. The corresponding control term could vary, but let it be constant for simplicity. Because the intention is defined for a period, the figure illustratively shows each control term with two steps. Given a sequence of observations z z z 1:K , the human intention at step k, η k = η (i) ∈ H, is first inferred for all steps, i.e., η 1:K = {η 1 , ..., η K }. We assume that the observations are fully observable for simplicity. After smoothing the observations and deriving the state trajectoryx x x 1:K , the proposed construction technique identifies segments in state space exhibiting the extracted intention,x x x k s(i,j) :k e(i,j) , j ∈ N, where k s(i,j) and k e(i,j) are the starting and the ending steps. Three segments of intention η (1) and two segments of intention η (2) are shown in the example illustration. The segments of the same intention are aligned to characterize the pattern of motion probabilistically. The intention-pattern model, describing the relationship between the input intention and the output motion pattern, is finally represented by a set of Gaussian distributions.

Intention Inference
As states x x x k−N h +1:k are not directly available, this section proposes the approach to infer the human intention η k based on observations. Given observations z z z 1:k in addition to the control terms corresponding to the extracted intentions, U U U (i) , ∀i, the first process of the intention inference is to run the IMM estimation. There is only one motion model, but the motion is simulated for each intention U U U (i) : where A A A k and w w w k are determined from the analysis of target motion. Having Equation (4) as an observation model, the KF updates the mean x x x k|k and the covariance Σ Σ Σ k|k similarly to Equations (9)-(13) for each intention: The likelihood of the motion model at step k is given by Equation (14). As the intention is determined for a period, let the number of steps that defines an intention be N h steps. The intention likelihood is defined and derived as the joint likelihood of the model likelihoods The control term that maximizes the intention likelihood is then selected: if the intention likelihood is above the threshold The corresponding intention η k is given by where ø indicating an empty element means that there is no matching intention. The recursive operation infers intention for all steps, η 1:K .

Intention-Pattern Modeling
The first process of the intention-pattern modeling, the extraction of the intended motions, checks the intention η k and identifies its period. Let the jth segment of the ith intention extracted from the smoothed state trajectoryx x x 1:K bex x x k s (i,j):k e (i,j) . The second process of alignment aligns the extracted segments by co-locating their originsx x x k 0 (i,j) : where the step of the origin k 0 (i, j) ∈ {k s (i, j), ..., k e (i, j)}.
The final process of motion pattern characterization derives the intention-pattern model by probabilistically characterizing the aligned segments. Figure 4 shows the characterization after three segments-green, blue, and purple-are aligned. As the number of segments increases, it is valid to assume that the variation of the motion follows a Gaussian distribution:x where κ is a step of the intention-pattern model after alignment, and the mean and the covariance arex κ is the number of segments at step κ for the ith intention [30]. The intention-pattern model is finally derived asx where δ(·) is a Dirac delta function. This means that the intention-pattern model is defined by a set of Gaussian distributions:  Figure 5 shows the schematics of the proposed state estimator using the intentionpattern model. Given a new observation z z z k , Equations (19)-(22) output the intention of the current step, η k = η (i k ) . The estimator then checks the corresponding Gaussian distributions N (i k ) to find the matching step κ k with respect to the recent estimate state x x x k−1|k−1 :

Estimation Using Intention-Pattern Model
The step κ k of N (i k ) is chosen if the intention-pattern model of the i k th intention is satisfactory: Equation (28) shows that the covariance propagation is more than that of the conventional KF-based estimation by the addition of T T T (i k ) k . The correction is conducted by Equation (19c)-(19g) of KF with i k on behalf of i.
Because the proposed approach estimates the state incorporating the human intention, the mean of the estimated state is potentially more accurate than the conventional state estimation if observations are not available or reliable. The covariance of the estimated state is also more precise as it is updated adding the uncertainty of the human intention and prevents underestimation. Finally, note that the proposed state estimation allows future prediction with human intention in addition to the current estimation by recursively predicting the state with Equation (28).

Numerical Validation
Having the strength of the intention-pattern model identified, it is essential to test the proposed approach numerically and identify the capability and limitations. The approach was evaluated by applying to the state estimation of a human-maneuvered multirotor, which is one of the applications of this class with high demand. To identify the capability and limitations in depth, a simulated environment was created and used. Figure 6 shows the controller interface used to create the multirotor motion and the resulting hovering, accelerating, and decelerating motions in the software-in-the-loop (SITL) simulation environment, whereas Table 1 lists the parameters used for simulation. With the right joystick of the controller interface, the human issues void command for hovering and forward or backward command for accelerating or decelerating.
The multirotor dynamics was calculated in Gazebo, which also created motion noise artificially. As the most fundamental and typical motion, the linear horizontal motion of the multirotor was considered. The multirotor's state, x x x, is given by where p is the position in the moving direction,ṗ is the linear velocity, θ is the attitude (pitch angle), andθ is the angular velocity. The estimator was assumed to observe all the state variables of the multirotor, i.e., z z z = [z p , zṗ, z θ , zθ] . The observation noise was set high as the proposed approach is effective when the observation is uncertain or unavailable. The first 100 s was used to construct the intention-pattern model, and the state estimation using the proposed approach was conducted with the observation of the remaining 60 s. The command varies dynamically, and the multirotor motion is seen to reflect the commands of forward, void and backward.  Figure 7 shows the time-varying human command, true state and observation. The observation was created with [o s1 , o s2 ] = [1, 0.05]. The observation noise was set high as the proposed approach is effective when the observation is uncertain or unavailable. The first 100 s was used to construct the intention-pattern model, and the state estimation using the proposed approach was conducted with the observation of the remaining 60 s. The command varies dynamically, and the multirotor motion is seen to reflect the commands of forward, void and backward.     Through the analysis of the multirotor state estimation problem, the motion and observation models used by the proposed approach were linear. The motion matrix A A A k is given by whereas the observation matrix C C C k is a four-dimensional identity matrix. Table 2 lists the parameters of the proposed approach for both the intention-pattern model construction and the state estimation. The number of prediction steps between observations is denoted as n p as it takes a different value for each process/study. While the variances of the observation noise is known, those of the motion noise were determined from the theoretical and experimental analyses. U U U (1) , U U U (2) , and U U U (3) were chosen to infer the decelerating intention η (1) , the hovering intention η (2) , and the accelerating intention η (3) , respectively. θ U is a parameter to control the value of U U U (i) for parametric study. Section 4.1 investigates the validity of the construction process of intention-pattern model through the parametric study. Section 4.2 then validates the estimation performance using the intention-pattern model.
Decelerating, Hovering, Accelerating U U U (1) [0, 0, −x  Figure 8 shows the inferred intentions and those in the corresponding smoothed trajectories when θ U was 0.2. The smoothed trajectories are segmented based on the inferred intentions. The position is seen to appropriately increase and decrease when the human intention is with accelerating and decelerating respectively. As U U U (1) , U U U (2) , and U U U (3) differ from each other in the pitch angle θ, the pitch angle plot also shows intentions clearly: θ near 0 indicates hovering; positive θ with large magnitude indicates accelerating; negative θ with large magnitude indicates decelerating.  Figure 9 shows the aligned segments and the variances of each resulting intentionpattern model when n p = 1. It is first seen that the aligned segments are consistent, which indicates that the proposed intention inference is valid. More consistency is shown in position than in pitch angle partly because the Gaussian assumption is not flexible enough to describe the pitch angle. The derived variances show that the intention-pattern models are modeled probabilistically from observations and could be used to perform state estimation more precisely.

Construction of Intention-Pattern Model
To analyze the dependency of the intention inference, the F1 score [31], evaluating the inference performance was derived with different levels of observation noises and control terms. The parameters varied were o s2 for the observation noise and θ U for the control term as the pitch angle θ characterizes the intention. The ground truth intention was defined based on the real θ value: hovering when |θ| ≤ 0.05; accelerating when θ > 0.05; decelerating when θ < −0.05. The F1 score is calculated as 2 TP+FP TP + TP+FN TP where TP, FP, and FN correspond to the number of steps of true positive, false positive, and false negative [31]. The F1 score which is closer to 1 indicates better inference. Figure 10 shows the distribution of the F1 score over o s2 and θ U . As seen from the figure, the smaller the noise, the better the inference. For θ U , there is a best value in the middle; either too large or too small will result in poor inference.

Estimation Using Intention-Pattern Model
Having the intention-pattern model constructed using the first 100 s, Figure 12 shows the result of state estimation incorporating the constructed intention-pattern model in the subsequent 60 s. Unlike the intention-pattern model, the state estimation uses n p = 5 as the effect of the proposed approach can be seen with the motion prediction. The ground truth and the result of the conventional KF estimation without intention incorporation are also shown for comparison. The estimation result of the proposed approach is seen to be closer to the ground truth than that of the conventional approach. The estimation of p anḋ p particularly shows the responsive estimation of the proposed approach when the target motion is changed by the human while the conventional estimation exhibits notable delay.
The faster response is due to the use of the intention-pattern model. The conventional approach could improve estimation by frequent accurate observation, but observations are often uncertain or unavailable.  Figure 13 shows the absolute error of estimated mean of each state variable with respect to time. While seeing less difference in θ andθ, the error of the proposed approach in p andṗ consistently and significantly stays low compared to the conventional approach. The difference is particularly large when the human changes the target motion as the conventional approach does not take the human intention into account. The maximum error and the mean squared error (MSE), integrating the absolute errors to a single quantity, are improved by almost three times and 8.7 times, respectively, when the proposed approach was deployed. Figure 14 shows the variance of each state variable estimated by the proposed and the conventional approaches. The result shows that the proposed approach exhibits larger variances than those of the conventional approach when the error is large. As the proposed approach infers human intentions and adds their uncertainties, its variance is estimated more precisely and adequately. The variance of the conventional approach, on the other hand, is significantly smaller though the mean estimation is wrong. Having the human control deterministically treated without inferring intentions, the uncertainty of the conventional approach is markedly underestimated.
The performance of the proposed approach in state estimation was further examined through the parametric study. Figure 15 shows the MSE of the proposed approach when o s1 and n p were varied. o s1 was varied to examine the effect of the observation noise as it contributed less at the construction of the intention-pattern model. The result of the conventional approach is also shown for comparison. It is first seen that the MSE of the proposed approach is significantly lower than that of the conventional approach when o s1 is large. The large o s1 increases the dependency of state estimation onto the prediction. As a result, the proposed approach, incorporating human intention and effective in prediction, can thus keep the MSE low. The result also shows that the MSE of the proposed approach remains low even when n p is large. n p also increases the dependency of state estimation onto the prediction, so the proposed approach becomes better than the conventional approach in accuracy. Meanwhile, the proposed and the conventional approaches exhibit a similar MSE when o s1 is low and n p is one. This is because the estimation becomes correction-driven as the frequency of the accurate correction becomes high.

Conclusions
This paper has presented an approach that estimates the state of a human-maneuvered target incorporating human intention, which consists of a preprocess constructing an intention-pattern model, and the main process allowing state estimation using the intentionpattern model. The preprocess constructs the intention-pattern model from the prior observations and probabilistically represents the model. The main process, then, uses standard state estimation such as KF extensively leveraging the probabilistically represented intention-pattern model. In the application of the proposed approach to the state estimation of a human-maneuvered multirotor, the numerical result has first shown that the constructed intention-pattern model represents the human intention appropriately. The result of state estimation of the human-maneuvered multirotor then shows that the proposed approach estimates the state more accurately than the conventional approach particularly when observations are uncertain or unavailable. The proposed approach has also demonstrated that it can estimate the covariance more precisely.
The paper has reported the first progress of the state estimation of a human-maneuvered target using human intention, and much future work is possible. Ongoing work includes the extension of the proposed approach for partially observable problems and model predictive control. Observations are necessary for the construction of the intention-pattern model, but the state may not be fully observable. The proposed approach is effective in prediction-driven estimation, so the model predictive control of an autonomous robot becomes one of the most effective extensions. The outcomes will be summarized and published in the form of papers as soon as they are ready.