Efficient Reject Options for Particle Filter Object Tracking in Medical Applications

Reliable object tracking that is based on video data constitutes an important challenge in diverse areas, including, among others, assisted surgery. Particle filtering offers a state-of-the-art technology for this challenge. Becaise a particle filter is based on a probabilistic model, it provides explicit likelihood values; in theory, the question of whether an object is reliably tracked can be addressed based on these values, provided that the estimates are correct. In this contribution, we investigate the question of whether these likelihood values are suitable for deciding whether the tracked object has been lost. An immediate strategy uses a simple threshold value to reject settings with a likelihood that is too small. We show in an application from the medical domain—object tracking in assisted surgery in the domain of Robotic Osteotomies—that this simple threshold strategy does not provide a reliable reject option for object tracking, in particular if different settings are considered. However, it is possible to develop reliable and flexible machine learning models that predict a reject based on diverse quantities that are computed by the particle filter. Modeling the task in the form of a regression enables a flexible handling of different demands on the tracking accuracy; modeling the challenge as an ensemble of classification tasks yet surpasses the results, while offering the same flexibility.


Introduction
Novel technological developments have led to great advances in computer assisted surgery, whereby image-guided support of surgeons plays a particularly important role besides direct robotic assistance [1]. Some advances can be observed in recent years, including, among other approaches, dedicated open source platforms to advance and evaluate technical progress [2]. Assistive systems often rely on visual information, such as videos or depth images, and imaging sensors play a crucial role for a wide array of medical domains. One such domain with a potentially high benefit constitutes the field of osteotomy, where the exact position and pose of the bone has to be known precisely. In this context, current practice is to use markers, such as 3D-printed templates, to assist the surgeon [3]. However, the design and production of these templates constitutes a significant overhead that could be avoided by markerless tracking systems. Such intelligent tracking methods enable the reliable identification or segmentation of the images into semantically meaningful parts, e.g., relevant organs, bone structure, or malignant parts, and tracking these over time.
Several approaches address optical tracking in the medical domain, such as models that are based on optical flow [4], or probabilistic frameworks [5]. In the last decade, particle filtering became one of the most prominent methods for reliable positioning and tracking in the medical domain and beyond [6,7]. Particle filtering basically constitutes a technique for an efficient Monte Carlo simulation to solve filtering problems that arise when internal states of dynamic systems, such as the location of a specific constituent in an image, need to be estimated based on partial observations that are subject to sensor noise. Thereby, a crucial part of the model is the design of a suitable probability model that captures the true underlying probability but is sufficiently easy to compute. There exists extensive work regarding appropriate probabilistic models in continuous and discrete settings [8,9] as well as extensive research on the accuracy of the result and how to arrive at unbiased estimators [10][11][12][13]. Usually, real-time performance is required, such that computationally complex models are prohibited and considerable work addresses computation schemes on the edge [14].
Popular models can typically incorporate some form of generic measurement noise, yet major disturbances, such as occlusions, pose a challenge. Hence, it is a matter of ongoing research as to how to design robust tracking schemes that enable a re-detection of the object after it has been lost and realize a larger robustness and invariance to noise, as an example, by means of multiple correlation filters or robust representations, such as hidden layers of deep networks [15][16][17]. However, in realistic scenarios, it is unavoidable that tracked objects are lost in some settings, as an example, in the case of a total occlusion of the tracked object. In critical settings, such as assisted surgery, it is desirable to equip tracking algorithms with the notion of a reject option-meaning that tracking algorithms are capable of detecting on their own if the tracking is no longer reliable and a major discrepancy of the tracked object and the result of the tracking algorithm is to be expected. In such cases, an autonomous or assisted re-initialization of the tracking algorithm could be done and major harm due to a lost tracking be prevented. Work in this direction exists that focuses on rejecting outlier measurements [18], which uses non-parametric kernel density estimation. The latter is well known to be impractical already for a medium number of dimensions [19]. In contrast, the sum of likelihoods of updated particles is used for a decision on rejecting in [20]. The authors of [21] mitigate the effect of corrupted observations by updating only a subset of the particles, where an observation is considered to be corrupted if the largest particle likelihood is below a pre-defined threshold. We consider this detection mechanism, which is also investigated by Chow [22] in a more general setting, as a baseline in our evaluation. In this contribution, we investigate the challenge of how to equip a recent state-of-the-art particle filtering technology for object tracking, which is based on numerically stable representations of particle weights in the logarithmic domain [23], by efficient possibilities to equip the tracking result with a notion of reliability in the form of a reject option, which filters too large deviations of the tracked quantities from their true values. We investigate this, in particular, with respect to changing configuration settings of the particle filter.
The question of how to enhance models with a notion of their reliability and an according reject constitutes an open research challenge for more general models than particle filters, as considered in our approach. For example, in classification tasks, the notion of selective classification or classification with a reject option has already been introduced quite early on [22]. It has been shown that a simple threshold strategy yields an optimum reject in the sense of cost optimization, provided that the true underlying probability is known. The latter is usually not the case, and the models inferred from data provide approximations of the true values only, provided probabilistic, rather than frequentist, modeling takes place. As a consequence, it has been investigated, in how far deterministic models can be extended to incorporate a reject option by suitable loss modeling [24], how confidence values can be approximated by suitable alternatives that are provided by the given model [25], and how reject options can be learned in parallel to the task at hand [26]. Note that statistical frameworks exist-such as conformal prediction [27]-which compute confidence values from a given model and data in such ways that strong mathematical guarantees hold. Yet, these approaches are usually time consuming and they cannot be used in an online tracking task where real-time conditions hold.
In this contribution, we address the question of how to equip a state-of-the art particle filter for tracking by efficient schemes that implement a reliable reject option to detect cases where the tracking is invalid. Because particle filters are based on a probabilistic modeling and, hence, provide likelihood values, a natural baseline builds a reject option on a global threshold strategy: settings are rejected if the likelihood is too small, following the strategy, as investigated by Chow [22]. It turns out that the results of this threshold strategy are not satisfactory for a benchmark problem from the domain of assisted surgery. We propose modelling the problem as a machine learning task, and we propose two possible ways to formalize the problem: as a regression task that predicts the amount of displacement, or as an ensemble of classification tasks that detect settings for which the misplacement is larger than a given threshold. We will formalize these two approaches and evaluate those in a realistic benchmark and different settings of the particle filter.

Experimental Prerequisites
A novel approach to jaw reconstruction surgery utilizes bone cut from the patient's pelvis ( Figure 1a) to create a graft for the jaw (Figure 1b) with low risk of getting rejected by the body [28]. For the cutting procedure a custom 3D model is created as a template for the surgeon. So far, this model was printed and then placed on the bone, which is expensive and time consuming. In a new approach [29], a robotic arm is equipped with a depth camera to track a representation of the template in the depth image while using a particle filter [23], and a projector to display cutting outlines onto the bone ( Figure 2) during surgery. However, this new approach leads to the risk that the projection is off when the tracking is inaccurate. The projection cannot be static, since displacements of the body are unavoidable during surgery. Hence, a tracking of the shape and an according adjustment of the projection is necessary. There exist several reasons why tracking can be wrong-as an example, a major challenge consists in the fact that the tracking target might be partially or fully occluded by the surrounding tissue or a person.

Project graft model
Track in depth image Our goal is to find a method that identifies when the track is inaccurate, so the surgeon can be notified to stop cutting, and a manual reset may be conducted, if needed. Thereby, we want to find a technology that works for different parameterizations of the particle filter that we will introduce in Section 2.2, such that an easy transfer in between different settings becomes possible. Because it might be specific to the situation, which deviation of the tracking from the ground truth is crucial, we also consider different degrees of inaccuracy that will be explained in Section 2.4. This way, our method can also be applied to other setups that utilize particle filter tracking.
For this purpose, we utilize the data and particle filter implementation from the study [29]. In the following, we first summarize the according particle filter methodology before we then provide more details on the used data sequences.

Particle Filter Tracking
Assume a depth image at time t is denoted where m and n are the width and height of the depth image and x i,j the distance to the camera of the pixel at position i, j. The basic objective of the particle filter tracking is to find the poseρ consisting of positionp t as 3D coordinates in the room, and orientationq t as a quaternion, for the 3D model of the graft that has the biggest pixel overlap to the depth image Im t at time t.ν trans t is the current translational velocity of the particle andν rot t the current rotational velocity. The tracking algorithm that has been implemented and evaluated in the approach [29] consists of the following steps: it is manually initialized at approximately the correct position. Subsequently, within an iterative loop, we predict K possible new poses P t+1 = (ρ 1 t+1 , . . . , ρ K t+1 ) (particles) by sampling according to a noise distribution with the parameters for the translational noise n trans , rotational noise n rot , and a velocity factor v f . N refers to the normal distribution. The new velocities are obtained by adding gaussian noise to the old ones. Additionally, the velocity is assumed to be decaying due to a loss of energy that is modeled by the multiplication of the velocity factor. Using the new velocities, the new position and orientation are calculated by applying the backward Euler integration method. Note that the new orientation is the Nlerp between our old orientation and the new orientational velocity. For each new particle, we generate an expected depth image Im ρ k t+1 that is based on simple geometric principles, and we calculate the weight where ll is the log likelihood estimating how likely the observed pixel x o fits the expected pixel x e and it is calculated as where w uni , w gaus , w exp are configurable as uniform, Gaussian, and exponential weight, and the camera intrinsic valuesd,ď as maximum and minimum depth value, d f as depth factor, and bn as the camera's base noise. In the first case, the expected depth at this pixel is invalid, i.e., it belongs to the background, therefore any observation fits. In the second case, occlusion is a possible reason why the observation is closer than the expectation. This possibility is modeled with an additional exponential probability term as compared to the third case. In the case of an invalid observation, the uniform weight is taken as a random measurement (case 4). The tracking poseρ t+1 is then determined by the particle with the maximum log likelihoodρ See the reference [23] for details.

Tracking Scenario
As a concrete tracking scenario that is representative for assisted surgery, we use a rosbag [30] that was recorded during a test surgery on a cadaver at University Hospital RWTH Aachen. This bag contains the depth image data during the whole surgery as a sensor_msgs/PointCloud2 [31] and a tracking output as geometry_msgs/Pose [32] (Figure 3). Because this output was under human supervision and manually reset and corrected, we can utilize it as our ground truthρ t . This allows for us to rerun the tracking while replaying the bag to record different outputs of the tracking algorithm in different configurations.
For each run, we save the parameter configuration X conf ∈ R 7 which contains: translational noise (n trans ), rotational noise (n rot ), velocity factor (v f ), exponential weight (w exp ), gaussian weight (w gaus ), uniform weight (w uni ), and particle count (K). Table 1 shows the chosen value ranges.  Additionally, we record the particle filter output X out ∈ R 4 that contains the maximum log likelihoodll = max k w ρ k t the effective sample size ess = −ln(∑ K i=1 2e w ρ i ) where normalized weights are used for computation and the mean and variance of all weights. Lastly, we save the labels Y ∈ R 2 which contain the distance of our current track to the ground truth δ t = (p t −p t ) 2 and the angular difference between two quaternions α t = 2acos((q t * q −1 t ) 4 ). The two most common problems during tracking are occlusions in the depth image (Figure 4a) by the surgeon and sudden displacements when the body's position is readjusted for easier cutting (Figure 4b). For our data set, we chose two sequences where either an occlusion or displacement takes place that present considerable problems for the tracking algorithm. Both of the sequences are around 450 frames long. We reran the tracking with the different parameter configurations on both sequences by replaying the aforementioned rosbag. An overview of the recorded data set can be seen in Table 2.

Formalization of an Inaccurate Track
We need to define when the tracking is too inaccurate and we consider the track lost in order to train our machine learning models. This is a priori unclear, since suitable choices depend on the current task. The track can be off in its position, which can be measured as the distance to the ground truth, and in its orientation measured as the angle difference to the ground truth. We choose 20 equidistant threshold pairs in order to evaluate different constellations for possible inaccurate tracking, thus enabling a larger flexibility for practical applications where d i represents the tolerance value for the distance (in m) to the ground truth and a i the tolerance value for the according angular difference (in rad). The pairs are ordered from very strict (σ 1 = (0.05, 0.13)) to very tolerant (σ 20 = (0.2, 0.26)). Tracks that are off by more than d i in distance or a i in angle are considered to be lost in regards to the chosen threshold pair, i.e., one such pair defines the ground truth for the subsequent prediction tasks.

Reject Strategies Based on Machine Learning Methods
We investigate three different approaches for determining when the current track is lost. Threshold strategy: because particle filters are based on probabilistic models, we can implement a simple thresholding based approach, as proposed by Chow [22] as a baseline. Intuitively the maximum log likelihood gives confidence for the current track. Accordingly, we utilize a threshold τ to predict a lost track for a chosen pair of parameters σ = (d, a) characterizing the desired accuracy:l l < τ σ (8) τ σ could be chosen according to a desired likelihood, yet this requires a suitable scaling of the values and its relation to the geometric accuracy. In our approach, we optimize the threshold value τ that is based on the training data by evaluating a set of candidate values in the range [−200, 200].
Ensemble of binary classification tasks: we consider two approaches for framing the current problem of deciding whether the current track is lost (as defined by the selected pair σ i ) as a machine learning task. Firstly, we model our problem as a binary classification task for each σ: Here, the input to the machine learning model x is chosen as a vector of representative quantities of the particle filter. It is the four-dimensional vector of the particle filter outputs X out and the seven parameters X conf , whereby the values are averaged over a 20 frame window, and This way, a separate classifier is used for every desired accuracy regarding distance and angle difference. We train our classification models on the same distance and angle difference combinations as in our baseline evaluation and with a 10 frame overlap between windows. We use Support Vector Machine (SVM) [33] with a linear kernel and Random Forests [34] as the classification models. We also apply SMOTE sampling [35] to combat any imbalance in our data set. See [36] for the implementation that we used.
Note that SVM and Random Forests both provide a certainty value of the class by means of the distance to the decision boundary or the percent of trees that predict a certain class, respectively. By shifting the threshold from 0 to a different value, we can extend a model that is optimized for a specific threshold pair σ to neighboring ones. This gives rise to a sparse ensemble of models: instead of using the models f σ for all pairs σ, we can rely on a small number of classifiers, which extend a model f σ to a neighborhood of different thresholds σ by moving the classifier's threshold φ. The range σ contains every σ i for which a decision threshold φ exists, so that classifier f φ σ still yields target values for precision and recall. We then find the minimum number of classifiers, so that their ranges σ contain every σ i at least once. If a threshold σ i is contained in multiple ranges, then the classifier that yields the highest combined precision and recall is chosen. This ensemble of classifiers can then be evaluated on the test set for each threshold combination.
Regression task: ss a second modeling approach, we train a regression model: where x is the same as above and y = d a is a prediction of the distance and angle difference. In this case, only one model is trained for every choice of threshold pairs σ. Given a specific threshold pair σ i = (d i , a i ), a decision is done that is based on the question whether d i ≥ d and a i ≥ a for f (x) = (d, a) being the predictions of the regression model on x. In the following, we evaluate such regression against the same set of distance and angle difference combinations. As candidate models for this regression task, we consider a Linear Regression [37], Support Vector Regression with Gaussian Kernel [38], and Random Forest Regression [39]. All of the models are implemented using the scikit learn library [40].

Results
All methods are evaluated in a 10-fold cross-validation scheme, adjusted for time series data in order to ensure time consistency: In each split, consecutive data samples amounting to 1 11 of a time sequence are used for testing while all the preceding samples are used for training. In the next split, the old test samples are added to the training set and the successive 1 11 of the time series are employed for the next test set, such that every data sample is used for testing at maximum once.
The direct threshold strategy yields a precision larger than 80% but a recall of close to 50% only (see Figure 5). Hence the baseline evaluation shows that the maximum log likelihood is not scaled in such a ways that it allows a robust reject option based on a threshold.
As a comparison, the results obtained by modeling the task as a regression are significantly better. Here, Random Forest regression performs best. It yields both precision and recall close to 90%, which is an acceptable result and much higher than our baseline. Provided that a separate classification model is designed for every pair of threshold values σ, we can further improve, i.e., using a Random Forest classification, we obtain a precision that is close to 1 and recall of around 98% for almost all values. An overview of all trained models can be seen in Table 3 and a comparison between the best performing regression and classification model against the baseline is shown in Figure 5. The current classification models, as shown in Table 3, consist of an ensemble of different classifiers according to every pair of threshold values. Because this can become quite exhaustive, the question occurs as to whether an acceptable accuracy can be achieved by using fewer classification models of the same type with different threshold values of the classifier per instance σ. Therefore, we investigate how to minimize the number of different classifiers required in the ensemble given a desired quality of precision and recall, as described in Section 2.5. Tables 4 and 5 show the results of this method. Interestingly, already a single Random Forest classifier yields precision and recall 0.9, and five models already cover the full space with precision and recalls of 0.98, yielding flexible as well as high performant models for a reject option for particle tracking. For SVM classification, more models are needed for the ensemble. The individual decision boundaries need to be modified, such that target recall values are met, thus resulting in actually lower precision values for a higher number of models.

Discussion
We have investigated efficient possibilities to enhance particle filtering models for tracking by a reject option, which enables practitioners to react to an alert as soon as the tracking precision is lower than a predefined threshold. Although particle filters provide a tracking probability, this quantity is not scaled in such a way that a simple threshold strategy provides satisfactory results. As a consequence, we have investigated the possibility to model reject options as a learning task, more precisely as regression or classification problem. It turns out that an ensemble of classifiers provides the best results, with an accuracy reaching 0.98, depending on the chosen setting.
The proposed approach provides an immediate possibility to enhance particle filters for tracking in assisted surgery by an alert strategy to prevent malfunctioning. Because of the generality of the proposed approach, we expect that the proposed modeling framework is of more general interest to enhance particle filtering from visual sensor data by a reject option. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to ethical concerns since they were obtained in a clinical trial.

Conflicts of Interest:
The authors declare no conflict of interest.