Attention-Based Variational Autoencoder Models for Human–Human Interaction Recognition via Generation

The remarkable human ability to predict others’ intent during physical interactions develops at a very early age and is crucial for development. Intent prediction, defined as the simultaneous recognition and generation of human–human interactions, has many applications such as in assistive robotics, human–robot interaction, video and robotic surveillance, and autonomous driving. However, models for solving the problem are scarce. This paper proposes two attention-based agent models to predict the intent of interacting 3D skeletons by sampling them via a sequence of glimpses. The novelty of these agent models is that they are inherently multimodal, consisting of perceptual and proprioceptive pathways. The action (attention) is driven by the agent’s generation error, and not by reinforcement. At each sampling instant, the agent completes the partially observed skeletal motion and infers the interaction class. It learns where and what to sample by minimizing the generation and classification errors. Extensive evaluation of our models is carried out on benchmark datasets and in comparison to a state-of-the-art model for intent prediction, which reveals that classification and generation accuracies of one of the proposed models are comparable to those of the state of the art even though our model contains fewer trainable parameters. The insights gained from our model designs can inform the development of efficient agents, the future of artificial intelligence (AI).


Introduction
Humans possess a remarkable ability to predict the intentions of others during physical interactions, a skill that is crucial for seamless social interactions, collaborative tasks, and competitive scenarios [1][2][3][4].The ability to perceive others as intentional agents is innate and crucial to development [5].Humans begin to understand others' intentions during physical interactions within the first year of life.Infants start to attribute intentions to others' actions as they develop their motor skills and engage in social interactions.By around five months of age, infants begin to produce smooth object-directed reaches, which is a milestone in their ability to produce coordinated goal-directed actions [6].This development in their actions could provide information to structure infants' perception of others' actions, suggesting that as infants become more capable of intentional actions such as reaching or tool use, they may also start to understand the intentions behind others' actions [6].
In artificial intelligence (AI) and related areas, human intent prediction has been extensively studied in the context of different applications such as assistive robotics (e.g., [7]), human-robot interaction (e.g., [8]), video and robotic surveillance (e.g., [9]), and autonomous driving (e.g., [10]).Following [11], we define "intent prediction" as the problem of simultaneously inferring the action/interaction class and generating the involved persons' future body motions.Models that perform both generation and recognition of human-human interactions are scarce.This paper proposes two attention-based agent models that sample 3D skeleton(s) via a sequence of glimpses for predicting the intent of the skeleton(s).The models implement a perception-action loop to optimize an objective function.At each sampling instant, the models predict the interaction class and complete the partially observed skeletal motion pattern.The action (attention) is modeled as proprioception in a multimodal setting and is guided by perceptual prediction error, not by reinforcement.This kind of embodied agent model was first introduced in [12], and has since been used for handwriting generation from images and videos [13], handwriting recognition [14], human interaction generation [15], human interaction recognition [11], and speech emotion recognition [16].As in [11], at each sampling instant, our models simultaneously predict the interaction class and the motion of both 3D skeletons.The models are used in both first-person (FP) and third-person (TP) environments.Unlike large AI models, the proposed models actively and selectively sample their environment, which allows them to be efficient in terms of model size (number of trainable parameters), data size (number of skeleton joints sampled at each glimpse on average), and training time.On comparing the proposed models (say, M2 and M3) with that in [11] (say, M1), our findings are as follows: 1.
The efficiency, and generation and classification accuracy on benchmark datasets of the three models (M1, M2, M3) are analyzed in both FP and TP environments.M1 yields the highest classification accuracy, followed closely by M2.In each environment, the accuracies are correlated with the number of trainable parameters.No model is the clear winner for generation accuracy.

2.
Three action selection methods (where to attend to) are analyzed for each of M1, M2, M3.Classification accuracy is comparable when sampling locations are determined from prediction error (without any weighting) or from learned weights (without involving prediction error); however, the latter is less efficient in terms of model size.
The rest of this paper is organized as follows.The next section presents a review of the literature on related work.The proposed agent models are described in Section 3 and evaluated on benchmark datasets in Section 4. The paper concludes in Section 6. Objective function derivations are included in the Appendix A.
The models in [28,30] generate the 3D pose of one of the skeletons upon observing the motions of the other.Given a sequence of 3D skeletal interactions of two persons, the model in [29] generates their 3D skeletal interaction data for future time-steps.Some of these models use attention.For example, temporal attention is used in [21,25], an attention mechanism that weighs different modalities is used in [22,23], and spatiotemporal attention is used in [24].
As noted in [11], the environment in these works is viewed from one of two perspectives: first person (FP), where one of the interacting persons is the observer while the other constitutes his environment (e.g., [11,28,30,42]), or third person (TP), where a person, such as an audience, is the observer and the two interacting persons constitute the observer's environment (e.g., [11,29]).
Very few end-to-end AI/ML models perform both generation and recognition.In a model, generation and recognition can be performed either separately, such as in [41], or simultaneously, such as in [11,42] and our current work.In [11], both interacting skeletons in both FP and TP are generated by utilizing a variational recurrent neural network (RNN)based model.In [42], only the reacting skeleton in FP is generated using a generative adversarial network.To the best of our knowledge, the model reported in [11] is the only one that performs simultaneous generation and recognition of two-person interactions.
Some of these models are attention-based.They utilize different attention mechanisms, such as temporal (e.g., [21,25,31,37,42]), spatiotemporal (e.g., [11,24,34]), multimodal (e.g., [22,23]), or multilayer (e.g., [35]).In most models, attention is implemented by strategically introducing additional learnable parameters.For example, a transformer-based attention mechanism is used in [41], and a sequence-to-sequence long short-term memory (LSTM)-based attention layer is used in [42], both of which introduce additional attention parameters learned during training.As a consequence, the model size may increase exorbitantly to the extent that the execution of its software code for learning and inference requires specialized hardware resources, as in the case of many transformer-based large language models.In [11] as well as in this paper, attention is computed directly from the generation error, which is why generation is necessary.Learnable attention parameters may or may not be used in the models in [11] and in this paper.We show that these models yield stateof-the-art recognition accuracy while being efficient, and learnable attention parameters to weigh the generation error do not increase the accuracy on benchmark datasets.

Preliminaries
This section defines a few concepts that are well established in the field and form the basis of this paper, so that there is no ambiguity in the meaning of these terms.
Agent: An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators [43].
Perception is the mechanism that allows an agent to interpret sensory signals from the external environment [44].
Proprioception is perception where the environment is the agent's own body [12].Proprioception allows an agent to internally perceive the location and movement of parts of its body [44].
Generative model: A generative model, p model , maximizes the log-likelihood L(x; θ) of the data, where θ is a set of parameters and x is a set of data points [45].
Evidence lower bound (ELBO): If z is a latent continuous random variable generating the data x, computing log-likelihood requires computing the integral of the marginal likelihood, p model (x, z)dz, which is intractable [46].Variational inference involves optimization of an approximation of the intractable posterior by defining an evidence lower bound (ELBO) on the log-likelihood, L(x; θ) ≤ log p model (x; θ).
Variational autoencoder (VAE) is a deep generative model that assumes the data consist of independent and identically distributed samples, and the prior, p θ (z), is an isotropic Gaussian.VAE maximizes the ELBO given by [46]: where q ϕ (z|x) is a recognition model, p θ (x|z) is a generative model, E denotes expectation, and D KL denotes Kullback-Leibler divergence.
Saliency is a property of each location in a predictive agent's environment.The attention mechanism is a function of the agent's prediction error [47,48].

Problem Statement
Let X = {X (1) , X (2) , . . ., X (n) } be a set of observable variables representing an environment in n modalities (or signal types or sources).The variable representing the i-th modality is a sequence: T ⟩, where T is the sequence length.Let x ≤t = {x (1) , x (2) , . . ., x (n) } be a partial observation of X such that x (i) = ⟨x Let y be a variable representing the class labels.Following [11], we define the problem of pattern completion and classification as generating X and y as accurately as possible from the partial observation x ≤t .Given x ≤t and a generative model p θ with parameters θ, at any time t, the objective is to maximize the joint likelihood of X and y, i.e., arg max θ p θ (X, y|x ≤t ).

Models
We present two models (M2, M3) for solving this problem and closely compare them with the model (M1) in [11].See Figure 1 for block diagrams of the agent within which these models reside.
Model M1. [11] The completed pattern and class label are generated from the latent variable z ≤t .Mathematically, arg max The model is trained end-to-end.See Figure 2a.The pseudocodes, borrowed from [11], are shown in Algorithms 1 and 2.
(b) Third person (TP) perspective involving only one modality: visual perception.Hence, superscript indicating the modality is not shown.[µ (i) end for Product of Experts 10:

Generative Model 12:
for i = 1 to 2 do 13: t ]) 14: [µ t , h t ]) t−1 , z t ]) 19: end for Model M2.The class label is inferred directly from partial observations, and then passed as an input to the generative model which generates the completed pattern.This is similar to the model in [49].Mathematically, arg max θ log(p θ (X|x <t , z ≤t )p θ (z ≤t ))dz + arg max ϕ log q ϕ (y|x <t ) where q ϕ is a recognition model.The model is trained end-to-end.See Figure 2b.The pseudocodes are shown in Algorithms 1 and 3.

Product of Experts
15:

21:
end for 22: end for Model M3.The completed pattern is generated from the latent variable z ≤t .The class label is inferred from the completed pattern.The pattern completion model is pretrained: Then the classification model is trained: Therefore, the model is not end-to-end.See Figure 2c.The pseudocodes are shown in Algorithms 1 and 2.

Agent Architecture
The proposed predictive agent architecture comprises five components: environment, observation, pattern completion and classification, action selection, and learning, each of which are explicated in this section.See block diagrams in Figure 1, which show the input/output relations between these components.The agent architecture is the same for the three models (M1, M2, M3) and is borrowed from [11].

1.
Environment.The environment is the source of sensory data.It is time-varying.

2.
Observation.The agent interacts with the environment via a sequence of eye and body movements.The observations, sampled from the environment at each time instant, are in two modalities: perceptual and proprioceptive.

3.
Pattern completion.A multimodal variational recurrent neural network (MVRNN) for variable-length sequences is used for completing the pattern for each modality.Recognition and generation are the two processes involved in the operation of an MVRNN.
Recognition (encoder).The recognition models, q ϕ (z t |x ≤t , z <t ) for models M1 and M3 and q ϕ (z t |x ≤t , z <t , y t ) for M2, are probabilistic encoders [46].They produce a Gaussian distribution over the possible values of the code z t from which the given observations could have been generated.
Model M1 [11].The MVRNN consists of two recurrent neural networks (RNNs), each with one layer of long short-term memory (LSTM) units.Each RNN generates the parameters for the approximate posterior distribution and the conditional prior distribution for each modality, as in [50].
Model M2.In addition to the perceptual and proprioceptive modalities, the class label is presented as an input modality.A fully connected layer from the class labels generates the parameters for the approximate posterior density for the class modality.The recognition model generates the class label.
The distribution parameters from all modalities are combined using product of experts (PoE), as in [51], to generate the joint distribution parameters for both the conditional prior, p θ (z t |x <t , z <t ) for M1 and M3 or p θ (z t |x <t , z <t , y t ) for M2, and the approximate posterior, q ϕ (z t |x ≤t , z <t ).
The recognition model, similar to that in [50], is mathematically expressed in Lines 3-9 of Algorithm 2 and Lines 6-14 of Algorithm 3. Here, ϕ prior generates the mean as a linear function of its input, ϕ enc generates the logarithm of standard deviation as a nonlinear function of its input, ϕ prior accepts the hidden state as input, and ϕ enc accepts the hidden state and the current observation as input.[11].The generative model,

Generation (decoder). Model M1
t , y t |x <t , z ≤t ), generates the perceptual and proprioceptive data and the class label from the latent variables, z t , at each time step.
t |x <t , z ≤t ), generates the perceptual and proprioceptive data from the latent variables, z t , at each time step.
Each RNN in the MVRNN generates the distribution parameters of the sensory data for a modality.The sensory data are sampled from this distribution.We assume the perceptual and proprioceptive distributions to be multivariate Gaussian as the skeletal joints are real-valued.We assume the class label distribution to be multivariate Bernoulli.
The pattern, X, is completed at each time using an iterative method.At any time t, the model predicts xt+1 given the observations x k:t (1 ≤ k < t), then predicts xt+2 given {x k+1:t , xt+1 }, then predicts xt+3 given {x k+2:t , xt+1:t+2 }, and so on till xT is predicted.This method allows a fixed and finite model to predict a variable-or infinite-length sequence.Since only the next instant is predicted at any iteration, the model can be size-efficient.
The generative model, similar to that in [50], is mathematically expressed in Lines 12-16 of Algorithm 2 and Lines 17-21 of Algorithm 3. Here, RNN θ represents an LSTM unit, and ϕ dec is the same function as ϕ enc .

4.
Action selection.In the proposed models, action selection is to decide the weight (attention) given to each location in the environment in order to sample the current observation.At any time t, a saliency map S (i) t is computed for modality i from which the action is determined.The saliency map assigns a salience score S (i) t,l to each location l.There are 15 locations corresponding to the 15 skeleton joints: head (J1), neck (J2), torso (J3), left shoulder (J4), left elbow (J5), left hand (J6), right shoulder (J7), right elbow (J8), right hand (J9), left hip (J10), left knee (J11), left foot (J12), right hip (J13), right knee (J14), right foot (J15).As in [11], we compute the weights in three ways, as follows.
Weights are determined by thresholding the prediction error (pe).The threshold is statistically estimated on the fly and is not predetermined. where t+1 , X(i) t+1 are the true and predicted data (skeleton joint coordinates), respectively, ∥.∥ 1 denotes L 1 norm, |.| denotes the cardinality of a set, n r = 5 is the number of regions in the skeleton (J1-J3, J4-J6, J7-J9, J10-J12, J13-J15) (see Figure 3), and S (i) t,r is the mean saliency over the joints in region r.At any time, at least one region will be salient.Our experiments show that variable number of salient regions at each time step is more effective.Fixing the number of salient regions to a constant value occasionally leads to selection of regions with low saliency or overlooking regions with high saliency.In the proposed models, only the salient joints are sampled.For the nonsalient joints, the observation at time t + 1 is the predicted observation from t.
Weights are learned as coefficients of the prediction error (lwpe).
where W a is the weight matrix.

S
where W a is the weight matrix.
where α controls the relative weight between generative and purely discriminative learning.
where q π (y|X 1:T ) is the classification model.

Datasets
As in [11], our models are evaluated on two datasets: (1) The SBU Kinect Interaction Dataset [52] is a two-person interaction dataset comprising eight interactions: approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands.The data are recorded from seven participants, forming a total of 21 sets such that each set consists of a unique pair of participants performing all actions.The dataset has approximately 300 interactions of duration 9 to 46 frames.The dataset is divided into five distinct train-test splits as in [52].(2) The K3HI: Kinect-Based 3D Human Interaction Dataset [53] is a two-person interaction dataset comprising eight interactions: approaching, departing, kicking, punching, pointing, pushing, exchanging an object, and shaking hands.The data are recorded from 15 volunteers.Each pair of participants performs all the actions.The dataset has approximately 320 interactions of duration 20 to 104 frames.The dataset is divided into four distinct train-test splits as in [53].

Experimental Setup
We use a single hidden layer, as in [50], for each modality in the MVRNN.Each modality in the MVRNN has a recurrent hidden layer of 256 units and a latent layer of 20 variables.These parameters are estimated empirically.T is variable, as interaction videos are of different lengths.Stochastic gradient descent, with a minibatch size of 100, is used to train the model.Adam optimization with a learning rate of 0.001 and default hyperparameters (β 1 = 0.9, β 2 = 0.999) are used.The objective function parameters β, λ 1 and λ 2 are fixed to 1 while λ 3 and α are fixed to 50.The models are trained until the error converges.To avoid overfitting, we use a dropout probability of 0.8 for M1 [11], M2, and M3 at the hidden layer for generation and 0.1 for M1 and M2 at the hidden layer for classification.All hyperparameters except the defaults are estimated from the training set by cross validation.

Evaluation
In the two benchmark datasets, each skeleton consists of 15 joints.The skeletal data in SBU are normalized.We do not apply any further preprocessing.We standardize the skeletal data in K3HI.Training models on low-level handcrafted features defeats the purpose of learning, hence our inclination towards operating on raw skeletal data.
Our experiments are carried out on two settings: 1.

First person:
Here we model the agent as the first person (one of the two skeletons).Its body constitutes its internal environment while the other skeleton constitutes its external (visual) environment.Two modalities are used in our model (see Figure 1a): (i) visual perception, which captures the other skeleton's 3D joint coordinates, and (ii) body proprioception, which captures the first skeleton's 3D joint coordinates.Here, i = 1, 2 in the objective function (ref.Equations ( 9)-( 11)).

2.
Third person: Here we model the agent as a third person (e.g., audience).The two interaction skeletons constitute the agent's external (visual) environment.One modality is used in our model (see Figure 1b): visual perception, which captures both the skeletons' 3D joint coordinates.Here, i = 1 in the objective function (ref. Equations ( 9)-( 11)).

Model variations:
For each of the above two settings, we experiment with the three action selection methods (ref."action selection" in Section 3.4): pe, lwpe, and lw.
Ablation study-baseline, bs (without attention): Due to lack of end-to-end models that simultaneously generate and classify two-person interactions from 3D skeletal data, our models' performances are evaluated using an ablation study, referred to as the baseline (bs).The goal is to understand the utility of attention in our models.For that, we create a baseline model (bs) where attention (i.e., action selection, ref.  in Algorithm 1) is eliminated from the models.The MVRNN is modified such that the observation is sampled from all the joints (i.e., weight distribution is uniform over all joints) from both the skeletons at any time.Thus, the models at any time (video frame) observe the entire skeletons.
For a fair comparison, the number of layers and number of neurons in each layer are the same over all model variants, including the baseline.
Evaluation metrics: We evaluate the generation accuracy using average frame distance t and X(i) t are the true and predicted skeletal joint coordinates, respectively, at time t for modality i, and T is the sequence length.We evaluate the classification performance using accuracy, recall, precision, and F1 score.

Qualitative Evaluation
From qualitative visualization, all three models (M1 [11], M2, M3) can generate realistic predictions over space and time for all the cases.As expected, short-term predictions are more accurate than long-term predictions.Even in the long term, there is continuity, and the two predicted skeletons are well synchronized.The proposed models' predicted action/reaction at each time step complies with the actual interactions.See Figures 4-7 for samples of generated interactions using M2 with pe action selection method.

Evaluation for Generation Accuracy
The AFD from the first-person environment is lower than or comparable to that from the third-person for most cases (see Tables 1-4).Modeling the two skeletons as distinct modalities helps in learning a better latent representation, resulting in more accurate generation.First-person models have more parameters than third-person models (see Table 5), which also explains the lower AFD of the first-person models.First person: AFD is the lowest for lwpe and bs for the SBU Kinect dataset and bs for the K3HI dataset.AFD is the highest for pe for both datasets.
Third person: AFD is the lowest for bs for the SBU Kinect dataset and lw and bs for the K3HI dataset.AFD is the highest for pe for both datasets.
Within the same category for action selection, we do not observe much variation in AFD for the three models for both datasets (see Tables 1-4).The generation (decoder) of the three models is similar, so their AFDs are comparable for any interaction class and action selection method.The generation process is more dependent on the action selection method; hence, we observe higher variation in AFD for different action selection methods (see Tables 1 and 3).In the proposed models (M2, M3), generation is not the primary goal but is necessary to calculate attention from generation error.That is why such attention-based models (e.g., [11,14,16]) are said to perform recognition via generation.M2 is unique since recognition influences generation and vice versa, while in M1 and M3, generation influences recognition but not vice versa.The models learn the spatiotemporal relations between joint locations in each skeleton using the VRNN in each modality and between the two skeletons using the PoE.M1 and M2 are learned end-to-end, while M3 is not.In most cases, the classification accuracy of the three models (M1 [11], M2, M3) in first person is higher than or comparable to that in third person.Also, the number of trainable parameters for first-person models is greater than that of third-person models (see Table 5).
In all experiments, the top-performing attention model yields an accuracy either comparable to or higher than the baseline (bs).The goal of attention in our models is to foster efficiency, discussed in the next section.Also, our bs's accuracy is higher than the state of the art on both datasets on raw skeleton (see Table 6).
First person: Among the three models, M1 yields the highest classification accuracy for almost all action selection methods for both the datasets, followed closely by M2 (see Tables 7 and 8).Among the three action selection methods, for the SBU Kinect dataset, bs, lwpe, and lw yield the highest classification accuracy for M1, M2, and M3, respectively (see Table 7).For the K3HI dataset, bs yields the highest classification accuracy for M1 and M3, while pe yields the highest for M2 (see Table 8).
Third person: Among the three models, M1 yields the highest classification accuracy for all action selection methods for both the datasets, followed closely by M2 (see Tables 9 and 10).Among the action selection methods, for the SBU Kinect dataset, bs yields the highest classification accuracy for M1 and M3, while pe yields the highest for M2.For the K3HI dataset, pe yields the highest classification accuracy for M2 and bs yields the highest for M1 and M3, while lwpe yields the lowest classification accuracy for all models.
M1 takes into account the partial observations and the latent variables for predicting the class, while M2 takes into account only the partial observations.Our results show that including the latent variables to predict the class can make a significant improvement in the classification performance.Additionally, in M1, the classification modality shares parameters with the generation modality, whereas in M2, the classification modality does not share parameters with the generation modality, though in both cases the generation modality shares parameters with the classification modality.Thus, it is possible that the generation modality improves the classification results in M1 as compared to M2. M3 uses the generated data to predict the class.As the generated skeletal data contain less discriminative features than the true skeletal data, M3's classification accuracy is low.We did not observe a consistent pattern in the performance accuracy due to different action selection methods for the same model.Thus, no action selection method is superior to the others.Results from pe are comparable to or better than the baseline in all the cases for M1 and M2 (see Tables 7-10).Results from lwpe and lw are comparable to the baseline, bs, for M1 and M2 for the K3HI dataset (see Tables 8 and 10).Table 6.Comparison of classification accuracy.In the table, "our models" refers to the three models (M1, M2, M3) discussed in this paper, even though M1 was proposed in [11].The other models cited in this table ([53-60]) perform classification only (no generation).They take both skeletons as input, similar to our models.These works do not distinguish between first-and third-person environments.6 compares our most accurate models (for different settings and action selection methods) with relevant models reported in the literature.Results show that M2 with lwpe for the SBU dataset and all models and action selection methods for the K3HI dataset achieve higher classification accuracy than models that operate on raw skeletal data, compared with our models.

Dataset
As stated in Section 1, models that perform both generation and recognition of humanhuman interactions are scarce.As noted in [11], only two models, [41,42], perform generation and recognition.However, both of them solve the problem of reaction generation, while our models (M1 [11], M2, M3) solve the more challenging problem of interaction generation.Hence, results from [41,42] are not included in any of the comparison tables in this paper.In [41], classification accuracy is 80% and 46.4% for SBU and K3HI, respectively, which are much lower than our models (ref.Table 6).In [42], classification accuracy is 79.2% for aggressive emotions (kick, push, punch) and 39.97% for neutral emotions (hug, shake hands, exchange objects) for SBU, which are much lower than our models (ref.Table 6).

Analysis of Action Selection
We can visualize the distribution of attention weights assigned to the joints or regions as a heatmap (see Figures A1-A12 in the Appendix B).For each interaction class, this distribution over the joints/regions is computed from the sum of all weights W t (ref. Equations ( 6)-( 8)) assigned to each joint/region.
The joints, whose movements have high variation over time, are more difficult to predict and hence are more salient.Thus, the salient regions for punch, exchange objects, push, handshake, and hug are primarily the hands (e.g., punch in Figures A1c and A4b; exchange object in Figure A7a,f; push in Figure A7d; shake hands in Figure A10b,f; hug in Figures A4a,e and A1b,d), while for kicking, they are the legs (ref.Figures A7e and A10f).This is not observed in some cases, such as kicking in Figure A1d, because the same skeleton might be the interaction initiator in some videos and the reactor in the others within the same dataset, thereby having different behaviors for the same interaction class.
We do not observe much variation in the distributions between M1, M2, M3 for the same action selection method.For any interaction, the weight distributions from lwpe and lw are similar.The attention weights are not very different for the different interactions.

Evaluation for Efficiency
Efficiency of a model is evaluated by the percentage of the scene observed for prediction.Our experiments show that during the first few sampling instants, both generation and classification accuracy improves exponentially (see .The saturation of the accuracy after that indicates our models do not need to sample a larger percentage of the data as ground truth for generation. We compute the average (over all videos for each interaction) of the number of salient joints sampled by our models at each glimpse (see Tables 11 and 12).We do not observe much variation in the average percentage for different models for both the datasets and for first-and third-person environments.On average, for any interaction in the two datasets, our model samples less than 49% and 48% of the joints in the case of FP and TP, respectively.For both datasets, the highest sparsity is for kicking.The lowest sparsity is for punching (FP) and punching/pushing (TP) for the SBU Kinect dataset and approaching/exchange object (FP) and approaching/departing (TP) for the K3HI dataset.For any percentage p, p% of the actual data is given as input and the prediction is considered as input for the rest of the time steps.The three models (M1 [11], M2, M3) require true class labels to train for classification.A subset of parameters in each model is shared between the classification and generation pathways, albeit in unique ways (see Figure 2).In M1, the generation (completed pattern) and class label are independent outputs.In M2, the class label is an input to the generative pathway; hence, classification accuracy directly influences generation accuracy.In M3, the completed pattern is the input to the classification model; hence, generation accuracy directly influences classification accuracy.
When class labels are missing, the generative parameters, including the shared parameters, are trained to minimize the generative loss only.All three models continue to infer irrespective of whether labels are present, noisy, or missing, which makes them practical for real-world applications.A drawback of M2 is that the generation depends on the predicted class label; hence, the generation will be poor if the classification pathway is not well trained.An advantage of M1 and M3 is that because the generation and classification pathways share parameters, even if the class labels are missing, the shared parameters will be updated by minimizing the generative error only, which might improve the classification accuracy.

Number of Trainable Parameters
The number of trainable parameters for the three models is shown in Table 5. Thirdperson models have fewer trainable parameters.M1 has the most and M3 has the fewest trainable parameters.lwpe and lw have more trainable parameters than pe or bs.
Training time is the time required to train a model on the training set until the error converges.The training time for our models is shown in Table 13, where we report the average (over n-fold cross validation) convergence time in hours and the average number of iterations.In order to identify offline the iteration at which convergence occurs, we smooth the classification accuracy and the generation error curves by calculating the moving average with a 50-iteration window.For classification, we assume convergence is reached at the iteration when the average accuracy exceeds 90% of the highest accuracy for M1, M2, and M3.When pretraining M3's generative model, convergence is reached at the iteration when the average error falls below 10% of the highest error.M3 is trained separately for generation and classification, while M1 and M2 are trained for generation and classification jointly.Thus, the model trained for a single task converges faster than the models trained jointly for multiple tasks.

End-to-End Training
End-to-end training allows an entire model to be optimized for a given task(s) and dataset.However, the challenge is to search for the optimal set of parameter values in a very large space.This is often circumvented by pretraining selected components (layers, blocks, functions) in isolation for a number of iterations to initialize their parameters in a suboptimal space.Then the entire model is trained end-to-end.In this paper, models M1 and M2 are trained end-to-end without any pretraining, while M3 is not end-to-end.

Discussion
This section discusses the limitations of the proposed approach for human-human interaction recognition via generation and also discusses future work.

Limitations of the Proposed Approach
The limitations stated below apply to the proposed approach and to almost all related works.

Limited Interaction Context
The physical interaction between two humans can be influenced by a wide range of variables such as age, gender, culture, personality, style, mood, relationship, context (e.g., formal vs. informal setting), difference in socioeconomic status, health, disability, past experiences (especially traumatic ones), social norms, and state of physical environment (e.g., crowded vs. open).Accounting for these variables is essential for understanding human-human interactions and developing interactive systems that can perform effectively across diverse scenarios.These variables have not been explicitly considered in the proposed approach and related works.However, the approaches that learn by imitation, such as ours, do implicitly consider some of these variables if they are captured in the training data.

Limited Interaction Modalities
Humans interact by the simultaneous use of multiple modalities such as text, speech, nonspeech vocalizations (e.g., sigh, laughter, murmur), facial expressions, gaze and eye contact, body movements for gestures and touch, proxemics, and olfactory cues, which convey emotions and intentions.The proposed approach and related works have largely concentrated only on body movements to infer intent.

Need for Labeled Training Data
The proposed approach and related works on interaction recognition are trained using data labeled with class labels.Given that labeled data are scarce and unlabeled data are abundant, it is imperative to develop models that can learn from unlabeled data.

Future Work
Our future work is to address the limitations of the proposed approach stated above and to make the approach more accurate and versatile.

Incorporate More Interaction Context
Incorporating interaction context in an AI model requires data about the context.Such data are scarce, primarily due to restrictions on usage of soft and hard sensors to collect data due to risk of confidentiality breach and privacy invasion.An alternative is to generate data using a combination of physics-based and generative AI models (see [61], for example).

Incorporate Multiple Interaction Modalities
Incorporating multiple interaction modalities would lead to more robust inference of the interacting human's intentions and emotions, which would help to generate more effective reactions.The proposed model is inherently multimodal.It combines multiple modalities using PoE, which is a scalable approach as the number of parameters increases linearly with the number of modalities m.All multimodal models are not linearly scalable.For example, the Multimodal Transformer (MulT) [62] learns a mapping from each modality to another, thereby learning O(m 2 ) mappings.As a result, the number of parameters increases quadratically with the number of modalities.The proposed model can be extended to incorporate multiple modalities in a relatively simple manner and has already been tested on different kinds of signals, such as body/skeletal motion [11,15] (and this current article), images and videos [12][13][14], and speech [16].

Alleviate the Need for Labeled Training Data
There are multiple ways to train a classifier with data not labeled with class labels.These include unsupervised learning methods (e.g., clustering, anomaly detection, non-negative matrix factorization, autoencoder), semisupervised learning methods (utilize a small amount of labeled data along with a large amount of unlabeled data), and self-supervised learning methods (learn representations from the unlabeled data by solving a pretext task, such as predicting the next word in a sequence or reconstructing the input, followed by fine-tuning on a small amount of labeled data for the target classification task).The proposed model can be easily trained using semi-supervised or self-supervised methods.

Experiment with Other Models
In our earlier works [63,64], a general-purpose predictive agent was proposed that interacts with its environment by relentlessly executing four functions cyclically: Surprise (compute prediction error), Explain (infer causes of surprise), Learn (update internal model using the surprise and inferred causes), and Predict the next observation (see Figure 12).In order to Explain, the agent can act, which includes interaction and communication with its own body (sensed via proprioception) and with its environment and other agents (sensed via perception).The proposed agent architecture (ref.Figure 1) is an implementation of the SELP cycle, which is modular and allows experimentation with different generative models in place of VAE or VRNN, and different fusion methods in place of PoE.It is interesting to note that our earlier works [65,66] proposed an agent model that decide when and with whom to communicate/interact, while the agent model proposed in this current work (and [11]) propose how to interact, all following the SELP cycle.

Conclusions
Two agent models are proposed that sequentially sample and interact with their environment, which is constituted by 3D skeletons.At each instant, they sample a subset of skeleton joints to jointly minimize their classification and sensory prediction (or generation) errors in a greedy manner.The agents operate as closed-loop systems involving perceptual ("what") and proprioceptive ("where") pathways.One of the proposed agent models is learned end-to-end, while the other is not.Extensive experiments on interaction classification and generation on benchmark datasets in comparison with a state-of-the-art model reveal that one of the proposed models is more size-efficient but still yields classification and generation accuracy comparable to the state of the art.Interesting insights drawn from the design of these models will be useful for designing efficient generative AI (GenAI) systems.The future of AI is agents.Our agent models consisting of perceptual and proprioceptive pathways in a multimodal setting contribute a unique idea towards the development of AI agents.
We assume that the modalities are conditionally independent given the common latent variables [51] and all observations till the current time.Therefore, where λ 1 , λ 2 , λ 3 , β are the weights balancing the terms.Assuming the class label does not change over time, we simplify the above expression as D KL q ϕ (z t |x ≤t , z <t ), p θ (z t |x <t , z <t ) The pseudocodes are shown in Algorithms 1 and 2.

Appendix A.2. Model M2
Here we derive the objective function in Equation (10).The generative and recognition models are factorized as p θ (X ≤T , y ≤T , z ≤T |x ≤T ) = T ∏ t=1 p θ (X t , y t |z ≤t , x <t )p θ (z t |x <t , z <t )q ϕ (z ≤T |x ≤T , y ≤T ) = T ∏ t=1 q ϕ (z t |x ≤t , z <t , y t ) The variational lower bound (ELBO) on the joint log-likelihood of the generated data, log p θ (X ≤T , y ≤T |x ≤T ), when the true label is given is derived as E q ϕ (z ≤T |x ≤T ,y ≤T ) log p θ (X ≤T , y ≤T |x ≤T ) q ϕ (z ≤T |x ≤T , y ≤T ) The pseudocode is shown in Algorithm 3.

Appendix A.3. Model M3
Here we derive the objective function in Equation (11).The generative and recognition models are factorized as p θ (X ≤T , z ≤T |x ≤T ) = T ∏ t=1 p θ (X t |z ≤t , x <t )p θ (z t |x <t , z <t ) The variational lower bound (ELBO) on the log-likelihood of the generated data, log p θ (X ≤T |x ≤T ), is derived as log p θ (X t |z ≤t , x <t )p θ (z t |x <t , z <t ) p θ (z t |x <t , z <t ) q ϕ (z t |x ≤t , z <t ) q ϕ (z t |x ≤t , z <t ) for the initial sampling (ref.experimental setup in Section 4.2) and the function F generates a sample x (i) from the environment X (i) after assigning weights W (i) 0 to modality i (ref.Action selection in Section 3.4).4: while true do 5:

Figure 1 .Figure 2 .
Figure 1.Block diagrams of the proposed attention-based agent applied to two-person interaction generation and classification.In the benchmark skeleton datasets, there is no information regarding the appearance of joints (shape, color, texture) but only their location.The appearance constitutes visual perception ('what') while location constitutes visual proprioception ('where').The mathematical symbols used in the diagrams are defined in Section 3.

Figure 3 .
Figure 3. Different regions in the skeleton.

Figure 4 .
Figure 4.The top row represents true skeletal data for the prediction at alternate time steps for SBU Kinect Interaction data for exchanging object for first person environment.Each skeleton in rows 2, 3 and 4 shows one step ahead prediction until 30%, 50% and 70% of the ground truth is given (highlighted by the grey line) respectively.Beyond that, the model uses its own prediction as input for completing the patterns until the final time step is reached.The salient joints are marked red.

Figure 5 .
Figure 5.The top row represents true skeletal data for the prediction at alternate time steps for SBU Kinect Interaction data for exchanging object for third person environment.Each skeleton in rows 2, 3 and 4 shows one step ahead prediction until 30%, 50% and 70% of the ground truth is given (highlighted by the grey line) respectively.Beyond that, the model uses its own prediction as input for completing the patterns until the final time step is reached.The salient joints are marked red.

Figure 6 .
Figure 6.The top row represents true skeletal data for the prediction at every third instant for K3HI Intersection data for shaking hands for first person environment.Each skeleton in rows 2, 3 and 4shows one step ahead prediction until 30%, 50% and 70% of the ground truth is given (highlighted by the grey line) respectively.Beyond that, the model uses its own prediction as input for completing the patterns until the final time step is reached.The salient joints are marked red.

Figure 7 .
Figure 7.The top row represents true skeletal data for the prediction at every third instant for K3HI Intersection data for shaking hands for third person environment.Each skeleton in rows 2, 3 and 4shows one step ahead prediction until 30%, 50% and 70% of the ground truth is given (highlighted by the grey line) respectively.Beyond that, the model uses its own prediction as input for completing the patterns until the final time step is reached.The salient joints are marked red.

Figure 8 .
Figure 8. Prediction (AFD) for different percentage of ground truth given as input for first person.For any percentage p, p% of the actual data is given as input and the prediction is considered as input for the rest of the time steps.

Figure 9 .
Figure 9. Prediction (AFD) for different percentage of ground truth given as input for third person.For any percentage p, p% of the actual data is given as input and the prediction is considered as input for the rest of the time steps.

Figure 10 .Figure 11 .
Figure 10.Classification accuracy for different percentage of ground truth given as input for first person.For any percentage p, p% of the actual data is given as input and the prediction is considered as input for the rest of the time steps.

2 Figure A2 . 2 Figure A3 .Figure A4 .
Figure A2.Salient joint distribution (dist.)over all interactions shown for skeleton 1 in (a-c) and the other skeleton in (d-f) for first person, (lwpe) environment using SBU Kinect interaction data.

2 Figure A5 . 2 Figure A6 .Figure A7 .
Figure A5.Salient joint distribution (dist.)over all interactions shown for skeleton 1 in (a-c) and the other skeleton in (d-f) for third person, (lwpe) environment using SBU Kinect interaction data.

2 Figure A8 . 2 Figure A9 .Figure A10 .
Figure A8.Salient joint distribution (dist.)over all interactions shown for skeleton 1 in (a-c) and the other skeleton in (d-f) for first person, (lwpe) environment using K3HI interaction data.

2 Figure A11 . 2 Figure A12 .
Figure A11.Salient joint distribution (dist.)over all interactions shown for skeleton 1 in (a-c) and the other skeleton in (d-f) for third person, (lwpe) environment using K3HI interaction data.

Table 2 .
Generation accuracy (AFD) averaged over all examples for each interaction in the test set and all train-test splits (mean, std dev) for

Table 3 .
Generation accuracy (AFD) averaged over all examples for each interaction in the test set and all train-test splits (mean, std dev) for first-person environment for K3HI Interaction Dataset.(bs), (pe), (lwpe), and (lw) are different action selection methods (ref.Section 4, action selection); interactions of approach, shake hands, and exchange object are abbreviated as Appr., Sh.Hands, and Exc.ob., respectively; metric average AFD is abbreviated as Avg.AFD.

Table 4 .
Generation accuracy (AFD) averaged over all examples for each interaction in the test set and all train-test splits (mean, std dev) for third-person environment for K3HI Interaction Dataset.(bs), (pe), (lwpe), and (lw) are different action selection methods (ref.Section 4, action selection); interactions of approach, shake hands, and exchange object are abbreviated as Appr., Sh.Hands, and Exc.ob., respectively; metric average AFD is abbreviated as Avg.AFD.

Table 5 .
Number of trainable parameters.

Table 7 .
Class prediction results using

Table 10 .
Class prediction results using

Table 11 .
Percentage of salient joints (mean, std dev) sampled by different variants of our models from the ground truth using first-person environment shown for (pe); (bs), (lwpe), and (lw) do not have sparsity.Interactions of shake hands and exchange object are abbreviated as Sh.Hands and Exc.obj., respectively.

Table 12 .
Percentage of salient joints (mean, std dev) sampled by different variants of our models from the ground truth using third-person environment shown for (pe); (bs), (lwpe), and (lw) do not have sparsity.Interactions of shake hands and exchange object are abbreviated as sh.hands and exc.obj., respectively.

Table 13 .
Training time required (hours, iterations).For the SBU Kinect dataset and both first-person and third-person environments, M3 and M2 require the least and highest training times for all action selection methods.For the K3HI dataset, M3 requires the least training time for all action selection methods for both environments, and M2 requires the highest training time for all action selection methods except M1 (pe) for first person.