ssMousetrack: Analysing computerized tracking data via Bayesian state-space models in {R}

Recent technological advances have provided new settings to enhance individual-based data collection and computerized-tracking data have became common in many behavioral and social research. By adopting instantaneous tracking devices such as computer-mouse, wii, and joysticks, such data provide new insights for analysing the dynamic unfolding of response process. ssMousetrack is a R package for modeling and analysing computerized-tracking data by means of a Bayesian state-space approach. The package provides a set of functions to prepare data, fit the model, and assess results via simple diagnostic checks. This paper describes the package and illustrates how it can be used to model and analyse computerized-tracking data. A case study is also included to show the use of the package in empirical case studies.


Introduction
Recent technological advances allow the collection of detailed data on ratings, attitudes, and choices during behavioral tasks. Unlike standard surveys and questionnaires, these tools provide a rich source of data as they adopt tracking devices that collect subject-based information about the dynamics involved during the data collection task [10,24]. Examples of such devices include eye-tracking, body movement-tracking, computer mouse-tracking, and electrodermal activity. Among these, computer mouse-tracking has become an important and widely used tool in behavioral sciences, as it provides a valid and cost-effective way to measure the usually unknown processes underlying human ratings and decisions [9]. Mouse-tracking data are often collected by means of standard computer mouse, wii instruments, and joystick devices and consist of collections of real-time trajectories recorded during the behavioral task. In a typical mouse-tracking task, individuals are presented with a computerbased interface showing the stimulus at the bottom of the screen (e.g., the image of a "dolphin") and two labels on the left and right top corners (e.g., the labels "mammal" vs. "fish"). They are asked to decide which of the two labels is appropriate given the task instruction and stimulus (e.g., to decide whether dolphin is mammal or fish). In the meanwhile, the x-y trajectories of the computer device are instantaneously recorded. The real-time trajectories offer an effective way to study the decision process underlying the hand movement behavior by revealing, for instance, the presence of some levels of decisional uncertainty. The applications of mouse-tracking tools spread across many normalized on a common sampling scale such that N is the same over subjects and stimuli. Next, the arrays (x, y) ij ∈ R N × R N are rotated and translated into the quadrant [−1, 1] × [1,1] with (x 0 , y 0 ) ij = (0, 0) and (x N , y N ) ij = (1, 1) by convention. Finally, normalized data are projected onto a (lower) 1-dimension space via atan2 function. In this way, the final ordered data y ij = y (1) , . . . , y (n) , . . . , y (N ) ∈ (0, π] N lie on the arc defined by the union of two disjoint sets, i.e. the set {y ∈ (0, π] : y ≥ π 2 } which represents the right-side section of the screen (usually called target, T) and the set {y ∈ (0, π] : y < π 2 } which instead represents the left-side section (usually called distractor, D). The final data are arranged as an Y N ×JI column-wise stacked matrix.
The state-space model implemented in ssMousetrack is as follows: (1) where Equation (1) is a von Mises measurement equation with µ (n) JI×1 ∈ (0, π] IJ being the mean for the n-th time step and κ (n) JI×1 ∈ R IJ the concentrations around the n-th mean vector, Equation (2) represents the locations on the arc defined in (0, π] from which the data vector y (n) is sampled from and it behaves according to the a real function G : R → (0, π], which maps reals into radians. In the current version of ssMousetrack, G can be defined as: (i) π-scaled logistic function: with β J×1 ∈ R J representing the contribution of the stimuli on y (n) .
(i) π-scaled Gompertz function: where β J×1 ∈ R J + has the same meaning as before. Although they represent two cases of the general family of S-shaped functions, logistic and Gompertz models differ in some respects. For instance, unlike the logistic model, the Gompertz function is not symmetric around its inflection point, with the consequence that its growth rises rapidly to its maximum rate occurring before the fixed inflection point [23]. Moreover, the parameters of the Gompertz function are always positive, a constrain which is often required by applications where the covariates of the model cannot take negative values (e.g., reaction times). These two implementations allow users to choose the type of G function on the basis of the experimental designs they have used in their studies.
In Equation (3) is a Normal states equation which represents a lag-1 autoregressive process with time-fixed variance parameter σ x . In the current version of ssMousetrack, the covariance matrix of the latent processes is set to an identity matrix I without loss of generality (σ x = 1). Equation (4) is the linear term modeling the contribution of the experimental design (e.g., two-bytwo design) and variables involved (e.g., categorical variables, continuous covariates). Note that Z is a design (dummy) matrix of main and high-order effects defined by adopting the dummy coding (e.g., treatment contrasts, sum contrasts) whereas γ is the associated vector of parameters for the columns of Z, with γ 1 being the usual baseline term for the contrasts. Finally, Equation (5) defines the concentrations around the n-th location by using the transformed data: with exp † : (0, π] → [lb, ub] ⊂ R + being the exponential function scaled in the natural range of the concentration parameter (e.g., lb = 10, ub = 200). In the current implementation of the package, the parameter λ is fixed to unity. The interpretation of Equations (2)-(4) is as follows. The n-th mean vector µ (n) is expressed as function of the stimuli-related component β and subject-based component x (n) , which are integrated together to form the conditional sampling y (n) |β, x (n) through the function G. As a result, Equation (3) can be interpreted as the individual latent dynamics that are unaffected by the experimental stimuli whereas Equation (4) represents the experimental effect regardless to individual dynamics. More generally, Equation (3) conveys information about the hand movement process underlying the tracking device and as such it can be used to analyse how much individuals differ in executing the task. By contrast, Equation (4) collects information on how a certain experimental manipulation has an effect or not on the movement responses. Interestingly, when normalized into [0, 1], µ (n) can be interpreted as the probability of the i-th individual at the j-th stimulus to navigate close the distractor cue in the left-side section of the arc. Finally, Equation (5) follows from the fact that hand movements underlying computerized tracking data tend to be smooth over the experimental task, with small changes being more likely close to left (distractor) or right (target) endpoints [3].

Estimation and inference
The state-space model in Equations (1)-(5) requires estimating the array of latent trajectories X ∈ R I×N together with the array of parameters γ K×1 , with γ 1 ∈ R and γ (K−1)×1 ∈ R K−1 (logistic case) or γ (K−1)×1 ∈ [−γ 1 , ∞) K−1 (Gompertz case). The array of unknown quantities Θ = {X, γ} can be estimated in various way, by adopting both a frequentist and Bayesian perspectives [30]. In the ssMousetrack package, the parameters are recovered in Bayesian way by means of a marginal MCMC algorithm through which X and γ are alternately recovered [1,29]. The reason is twofold: (i) MCMC algorithms, as those implemented in rstan package, provide a more efficient and complete solution for sampling from the probability distribution of the parameters. (ii) The Bayesian approach offers an elegant solution for data analysis and inference [13] by means of which the model is adequately assessed by the analysis of (marginal) posterior distributions of the parameters [22].
More in details, the posterior density f (Θ|Y) after factorization of the joint density f (γ, X, Y), is as follows [1]: where f (γ|Y) is the (marginal) likelihood function, f (X|Y) is the filtering density, whereas f (γ) is the prior ascribed on the model parameters. In the current version of ssMousetrack, f (X|Y) is approximated via Kalman filtering/smoothing, with f (γ|Y) being computed as a byproduct of the Kalman theory (see Appendix A).

Model assessment
In the Bayesian context of data analysis, ssMousetrack provides a simulation-based procedure to evaluate the adequacy of the model to reproduce the observed data Y [13]. More technically, given the posteriors of parameters and latent states f (Θ|Y), M new (simulated) datasets Y * 1 , . . . , Y * M are generated according to the estimated model structure and, for each new dataset, two discrepancy measures are considered [18]: which measure the total amount of data reconstruction based on the overall JI × N observed array Y (Equation 7) and the amount of data reconstruction based on the J × N observed matrix Y (i) for each subject i = 1, . . . , I (Equation 8). Both the indices are in the range 0-100%, with 100% indicating optimal fit. Note that the measure PA sbj allows for evaluating the adequacy of the model to reconstruct the individual-based set of data. In addition, the dynamic time warp distance (dtw), as implemented in dtw package, is also computed between Y (i) m and Y (i) . Unlike the PA sbj index, the dtw distance measures the similarity among time series by considering their different dynamics [14].

The ssMousetrack package
The ssMousetrack is distributed via the Comprehensive R Archive Network (CRAN). It is based on rstan [31], the R interface to the probabilistic programming language Stan [6]. The current version of the package allows for (i) simulating data according to a given experimental design, (ii) analysing mouse-tracking data via state-space modeling, and (iii) evaluating the adequacy of model results. The package consists of five main function (generate_data(), run_ssm(), check_prior(), prepare_data(), evaluate_ssm()), two datasets (language, congruency), and three sub-functions (compute_D(), generate_Z(), generate_design()). The main functions generate_data() and run_ssm() are wrappers to previously-compiled Stan codes which implement the model described in Section 2. Table 1 provides an overview of the functions and datasets provided in the ssMousetrack package whereas a description of the usage of the functions is reported in the next subsections.

Generate artificial data
To simulate artificial data we use the function generate_data(), which requires as input the experimental template for the data generation process. More generally, the function works by first sampling the parameters γ from the prior density f (γ) and then generates the latent states X from Equation (3), computes the matrix µ from Equation (2) and the matrix D, drawns the matrix of data Y by simply applying Equation (1). For instance, an experiment with one categorical independent variable and two levels, each with three trials, can be generated via the following syntax: function type description generate_data() main simulate data according to a user-defined experimental design. run_ssm() main run state-space model on a given mouse-tracking dataset. check_prior() main allows users to define a list of priors for f (γ) prior running run_ssm(). prepare_data() main pre-process raw tracking data prior running run_ssm(). evaluate_ssm() main run model evaluation given an output of run_ssm(). The function can plot results if requested by users. compute_D() internal compute the matrix of distances D given the observed data Y (see Equation 5). generate_Z() internal generate the Boolean trial-by-variable (design) matrix Z (see Equation 4). generate_design() internal allows users to specify an experimental design in terms individuals, trials, variables, and design matrix Z. congruency dataset subset of data from [8]. language dataset subset of data from [2].  where M = 100 is the number of data to be generated, N = 61 is the number of time step for the mouse-tracking trajectories, K = 2 means that we have just one variable with two levels, J = 6 indicates the total number of trials such that J/K is the number of trials for each level of the variable, I = 2 is the number of subject, Z.formula indicates the formula for the contrast matrix Z with standard R syntax. Note that selective priors are specified for each level of the experimental design using the Stan syntax (see the help of check_prior() for further details).
The output is a list containing three sublists, as follows: • params, which contains the model parameters generated for the M datasets: Similarly, artificial datasets can be generated using more complex designs. For instance, a bivariate design with two variables is produced by typing: where K = c(2,4) codifies two variables each with two and four levels, Z.formula = " Z1*Z2" indicates that the variables interact whereas Z.type=c("symmetric","random") indicates that trials must be assigned to the first variable using the symmetric method and to the second variable using the random method (see the help of generate_Z() for further details). Figure 1 shows a sample of mouse-tracking data Y generated in the univariate design case with I = 2, K = 2 and J = 6. We report the univariate case only for the sake of simplicity but the same graphical representations can be done for the more complex designs as well.

Run state-space analysis
State-space analysis can be run on both simulated and real data. In the first case, after the datageneration process, the state-space model implemented in the ssMousetrack package can be fit using run_ssm(). For instance, the syntax: runs the state-space analysis on the iid = 2 artificial data datagen2_ssm. Note that niter indicates the number of total samples to be drawn, nwarmup the number of warmup/burnin iterations per chain, and nchains the number of chains to be executed in parallel. The function run_ssm() allows for parallel computing via the parallel package when nchains > 1. In this case, since ncores="AUTO" (default), the function will run two parallel chains using two cores. Unlike for the case of artificial data, the analysis of real datasets requires preparing raw data in a proper format via prepare_data(), the function that implements the steps described in Section 2. Generally, raw datasets need to be organized using the long-format, with information being organized as nested. The dataset language is an example of a typical data structure required by prepare_data(): data("language") str(language,vec.len=2) ## 'data.frame': 6060 obs. of 6 variables: where condition is the categorical variable involved in the study. The pre-processing of raw data is performed by the call: language_proc <-prepare_data(X = language,N = 61,Z.formula = "~condition") where the output language_proc is a data frame containing the pre-processed dataset together with the column-wise stacked matrix Y of angles, the contrast matrix Z, and the matrix of distances D.
Once raw data have been pre-processed, the state-space analysis is performed as for the case of artificial data: The function returns as output a list composed of three sublists, as follows: • params, which contains the posterior samples for the free parameters γ and β: Note that users can also export the stanfit object with all the Stan results by specifying stan_object=TRUE in run_ssm().

Evaluate the model results
The methods described in Section 2.2 for the model evaluation are implemented by the function evaluate_ssm(), which requires as input the output of run_ssm(). For instance, considering the fitted object language_fit, the model evaluation can simply be run via the command: where M = 1000 is the number of replications to compute the indices. The function returns as ouput a list containing the mean values of the indices PA overall , PA sbj , and dtw, as well as the distributions obtained over the M replications. Note that, users can also ask for a graphical representation of the indices by setting plotx = TRUE.

An Illustrative example
In this section we provide a full example about the way ssMousetrack can be used for statespace analysis of real computerized tracking data. Note that the application reported here has an illustrative purpose only. To this end, we will make use of the dataset language, a subset of data originally presented in [2]. In this typical computerized tracking task, participants saw a printed stimulus on the screen (e.g., the word water ) and were requested to perform a dichotomous choice task where stimuli need to be classified as word or non-word. The experimental variable condition was a categorical variables with four levels (HF: High-frequency word; LF: Low-frequency word; PW: Pseudo-word; NW: Non-word). Participants had to classify each stimulus as word vs. non-word by using a computer-mouse tracking device. The dataset contains I = 5 participants, J = 12 trials, one categorical variable with K = 4 levels, each with J/K = 3 trials. From the data-analysis viewpoint, we evaluat the extent to which the parameters of the state-space model γ reflect eventual differences associated with the levels of condition.
The raw computerized tracking trajectories in the dataset consist of Cartesian coordinates with N = 101 (i = 1, . . . , I; j = 1, . . . , J). The dataset is partially pre-processed as raw trajectories have the same length (N = 101). However, we need to run prepare_data() in order to rotate/translate the raw data into the quadrant [−1, 1]× [1,1] and compute the atan2 projections. The pre-processing step is called by the command: data("language") language_proc <-prepare_data(X = language, N = 101, Z.formula = "~condition") Figure 2 shows the trajectories Y associated with the task for all participants and trials.
Next, the state-space model is fit to the pre-processed data by the following call: where, in this case, the prior for γ have been choosen to codify a priori expectations about the effect of the variable condition [2]. Figure 3 shows some MCMC graphical diagnostics for the model parameters γ computed using bayesplot [12] whereas Table 2 reports the posterior quantities for the model parameters. In the Bayesian context of data-analysis, we evaluate the effects of the variable condition by computing the degree of overlapping among marginal posterior densities for each level of the experimental variable (i.e., the more the overlapping, the weaker the evidence supporting the experimental manipulation). Figure 4 shows the results graphically. Overall, the variable condition showed no strong effect, as the densities of the levels are overlapped. In particular, stimuli in HF, LF, and NW conditions showed no activation of the distractor section of the tracking space asγ HF ,γ NW , andγ LF approach zero. By contrast, stimuli in PW condition showed a small effect on activating the target section (γ HF > 0), possibly due to the fact that PW stimuli require less cognitive workload [2].   tigate how individual dynamics differ over the levels of condition, we can make use ofX and ask whether HF, LF, NW, and PW stimuli differ in terms of evidence of mouse-tracking competition.
The idea is that the higher the evidence, the larger the difficulty in categorizing stimuli as word (target) or non-word (distractor).
To do this, we follow the findings of [2] and divide the entire respose process 1, . . . , N into three disjoint windows W 1 = 10 − 35%, W 2 = 45 − 65%, and W 3 = 70 − 85%. Usually it is expected that a higher competition would be observed in W 1 and W 2 rather than W 3 . More formally, let x M ) be the sequence of filtered states for the i-th subject and the generic time window W , with M being equals to the cardinality of W . Next, the probability to select non-word (distractor) responses are computed by normalizing the G function into the domain [0, 1], as follows: whereγ is the array of posterior means of the model parameters. Note that in this example we use the logistic function because we set gfunction="logistic" in run_ssm(). Finally, the evidence measures can be defined in terms of log-odd ratio using the probability matrix P (i) : is the profile probability for HF, LF, NW, and PW. The interpretation of r (i) is as follows. For r (i) > 0 there is a higher competition in categorizing the stimulus as word (target) vs. non-word (distractor). By contrast, for r (i) < 0 there is a lower competition in the response process, as stimuli are easily categorized as word (target). Finally, the case r (i) = 0 indicates that there is no difference in terms of evidence between word (target) and non-word (distractor) responses. Figure 5-B shows the results for the four levels of condition. As expected, the competition in the third phase of the response process W 3 is low, as the probability to select the target is higher. The same applies to W 2 . On the contrary, in the first stage of the process W 1 the competition is higher although the evidence ratio for all the levels of condition approximate zero. Interestingly, the second phase W 2 shows a higher whithin-subject variability of competition, which probably indicates that subjects differ in the categorization process just in the middle phase of the response process. Finally, we assess the adequacy of the model with regards to the observed data by means of evaluate_ssm(), as follows: where language_fit is the fitted object returned by run_ssm() whereas M = 500 is the number of replications used to compute the three fit indices. The output of the function consists of a list containing means and distributions of the fit indices: Overall, in this example the fitted model is adequate to reproduce the observed trajectory data as supported by high values of the indices PA ov , PA sbj , and dtw. Figure 6 shows the results graphically.

Conclusions
In this paper we introduced the R package ssMousetrack that analyses computerized-tracking data using Bayesian state-space modeling. The package provides a set of functions to facilitate the preparation and analysis of tracking data and offers a simple way to assess model fit. The package can be of particular interest to researchers needing tools to analyse computerized-tracking experiments using a complete statistical modeling environment instead of descriptive statistics only. In addition, the package ssMousetrack allows for individual-based analysis of trajectories where latent dynamics are used to obtain richer information which can pave the way to further analyses (e.g., profile analysis). The current version of the package can be extended in several ways. For instance, the inclusion of other state-space representations beyond the simple Gaussian AR(1) model can be a further generalization of the package. Still, model parameters like σ x and λ can be free to allow for multi-group analysis. Similarly, more comprehensive model diagnostics could also be considered in future releases of the package. Finally, we believe our package may be a useful tool supporting researchers and practitioners who want to make analysis of computerized-tracking experiments using a statistical modeling environment. This will surely help them to improve the interpretability of data analysis as well as the reliability of conclusions they can draw from their studies.

A Appendix
Given a candidate sample γ † , the mean x and variance λ of the density f (X|Y) are approximated via the following recursion: (n = 0)x (n) where ⊗ is the Kronecker product, the (element-wise) Hadamard product, the (elementwise) Hadamard division, whereas A = I I×I ⊗ n1 J×1 is a scaling matrix with n = 1/J. As a byproduct of the Kalman filter, the marginal likelihood f (γ † |Y) is multivariate Gaussian with meanŷ and variance diag(σ), with diag() being the linear operator that transforms a vector into a diagonal matrix. Finally, the arrayX I×N contains the filtered latent states implied by the model whereasΛ I×N is the array of variances associated with the filtered states. The smoothing part of the algorithm is implemented using the fixed-interval Kalman smoother [29] where the filtered arrayŝ X I×N andΛ I×N are used as input of the backward recursion.