Group Leader vs. Remaining Group—Whose Data Should Be Used for Prediction of Team Performance?

: Humans are considered to be communicative, usually interacting in dyads or groups. In this paper, we investigate group interactions regarding performance in a rather formal gathering. In particular, a collection of ten performance indicators used in social group sciences is used to assess the outcomes of the meetings in this manuscript, in an automatic, machine learning-based way. For this, the Parking Lot Corpus, comprising 70 meetings in total, is analysed. At ﬁrst, we obtain baseline results for the automatic prediction of performance results on the corpus. This is the ﬁrst time the Parking Lot Corpus is tapped in this sense. Additionally, we compare baseline values to those obtained, utilising bidirectional long-short term memories. For multiple performance indicators, improvements in the baseline results are able to be achieved. Furthermore, the experiments showed a trend that the acoustic material of the remaining group should use for the prediction of team performance.


Introduction
Humans are considered to be communicative, usually interacting in dyads or groups.Billions of interactions and hundreds of millions meetings take place every day around the world.Such gatherings can be quite short, for a brief exchange of gossip, or be rather long coordination meetings, defining the current state and future plans of entire countries.However, the evolution of such meetings and, thus, the creation and development of groups or teams is a dynamic process [1], indicating the flexibility in everyday interactions.In social sciences, those aspects are discussed already for a longer period of time (e.g., [2][3][4][5]), but also computer sciences need to address those issues, especially if it comes to multiagent interactions (e.g., [6,7] for an overview), whereas the agents can be either human beings, technical devices, or virtual agents.Further, the millions of meetings usually should have ideally a purpose and an intended outcome [1,2,8].Moreover, the goal as well as the gatherings context (cf.e.g., [9]) influence the way of communicating in the group (e.g., [6,7,10]).As already stated in [11]: "An informal family gathering is mainly related to fun aspects and the (social) feeling of closeness.In contrast, formal business meetings put the (efficient) exchange of information and solutions in focus (cf.e.g., [12])." Regarding the meetings' outcome, two different interpretations can be seen: (1) The classical way, where the outcome relates to specific goals or results and thus, can be assessed in terms like effectiveness; (2) The interpretations discussed in [11] where a more socialising perspective is being considered.
Regarding, at first, the second aspect, being mainly related to longer-lasting meetings that are rather considered not as effective, the authors of [11] argue: "These meetings were called "stimulating meetings", meetings being perceived as effective in terms of outcomes and the way of interaction but are not necessarily short in the sense of measured absolute meeting time[, but] are considered as interactions where the communication partners and their interaction are propelled (by each other) such that the entire group is able to perform better."Keeping this in mind, [13] further argues that such meeting are often used for familiarisation in the group (cf.also [9]), especially when the group members do not know each other, yet.Therefore, even a longer-lasting meeting can be perceived as effective and considered as a kind of "bet" in a perhaps more efficient interaction later on [13].However, the term "stimulating" has also another perspective: In brainstorming-like meetings, the discussion is usually allowed to be more interactive and lively as well as open-minded to generate a (large) number of ideas.Therefore, the ongoing interaction can be fertilised by the communication partners, resulting in stimulating conversations.Ideally, such meetings are either open-ended or repetitive, avoiding restrictions and limitations.
Coming back to the first interpretation, the classical view on meetings and their outcomes is given.In each meeting, there is an inherent longing that it should be effective, being related also to the performance of the group.Ideally, the effectiveness, often put on the same level as having a good meeting performance (cf.[14,15]), should be measured already during the gathering, in an objective way.Therefore, we aim for an automatic prediction of meeting performance, in this manuscript, being represented by performance indicators (cf.Section 2.1.2),to provide an objective statement.In this sense, the current work investigates the groups' performance results of meetings in the Parking Lot Corpus (cf.Section 2.1.1).For this, particular performance indicators from social sciences were considered, explained in Section 2.1.2.Furthermore, this is the first time that baseline prediction results on the corpus are provided, following the research questions stated in Section 1.2.Although we investigate the Parking Lot Corpus in rather the classical way, we also consider the "stimulating meeting" aspect in parallel (in terms of discussions).

Related Work
There is already work which refers to the Parking Lot Corpus, introduced in Section 2.1.In particular, this refers to the sound of the meeting in [11], where the prosodic-acoustic characteristics of the spoken samples were analysed.The authors show that there are indeed acoustic differences between (perceived) successful meetings and those considered to be not as efficient, mainly based on pitch-related features.These investigations are the foundation of the current manuscript, investigating acoustic samples (utilising extracted features) for the prediction of performance indicators.
Furthermore, the already discussed concept of "stimulating meetings" [11] was introduced.In a second paper, the corpus was used to investigate the influence of speech duration and number of turns on the perceived meeting effectiveness [13].In particular, these analyses support the observations, leading to a more detailed interpretation of effectiveness as discussed in the introduction above.
In [6], an overview of communication and interaction in multi-party settings is presented, considering social and technical perspectives.This is complemented in [7], where a respective overview as well as a listing of current challenges are provided, focusing on multimodal and acoustic investigations.
As stated in [7], group analyses can be approached multimodally.For an overview, using rather visual cues in engagement in groups, we refer to [26].In terms of acoustic analyses, [7] provides specific overviews, highlighting current achievements in analysing groups and their behaviour (e.g., [27][28][29]).
Regarding acoustic features, also utilised in the manuscript, we see that a huge variety of collections or feature sets are available in the community of speech recognition and affective computing (cf.e.g., [30]).We state that often rather larger feature sets or respective sub-sets are used, for instance, 6552 features in [31,32] or 2832 features in [33].Rather small feature sets can be seen in, for instance, [34,35] with fewer than 300 features.One of the most prominent feature set is the emobase feature set [33] and respective derivatives, like in the paralinguistic challenges (cf.[36]).In our experiment, we rely on emobase features, being introduced in Section 2.2.3.
Finally, we mention the work in [37], where a combination of acoustic and linguistic cues is used for performance prediction in groups.The authors achieve good results using domain adaptation to overcome the drawback of a low number of samples.However, the current manuscript applies machine learning techniques to the Parking Lot Corpus, which are parameterised in such a way that they are able to handle training only on one modality (acoustics).

Research Questions
Following the results presented in [11,13], we extend the investigations focusing on the central question: What are the particular influences of acoustic contributions by the leader of a group compared to the remaining group members?
For this, the overall goal of this manuscript is the automatic prediction of the group's performance results regarding a variety of performance indicators (cf.Section 2.1.2),i.e., the prediction of objective and subjective performance indicators, especially in relation to the acoustic samples provided by the group's leader vs. the remaining group.Taking this into consideration, we focus on two research questions in the manuscript: RQ1: What are the particular baselines for the prediction of performance indicators, based on the acoustic material in the Parking Lot Corpus?RQ2: What is the prediction capability of neural networks that are tailored to handle temporal dependencies?
Given these research questions, we state the following hypothesis: Hypothesis 1.The use of neural networks capable of processing temporal dependencies improves the prediction performance for the indicators.

Corpus Description
The investigations are based on the Parking Lot Corpus [11] comprising, in total, 70 meetings recorded at a public Midwestern United States university.The general setting of the recordings is visualised in Figure 1, wherein the particular occurrence slightly varies depending on the utilised location within the university.In the experiments, the group size ranges from three to six participants (mean group size: 3.6).For recruitment, in total, 245 undergraduate students from the psychology department were invited, obtaining class credits compensating for participation; no further selections of participants were made.As stated in [11]: "Each meeting occurred independently, and the different sized groups resulted from intentional non-specification of the number of participants per group." The corpus comprises audio-visual recordings of discussions, aiming for recommendation to improve the university's parking situation.Each group was instructed accordingly, providing also a list of questions to be considered during the discussion.However, the group's interaction can still be considered non-scripted and spontaneous in the sense of [38].For each group, a leader was determined by rolling a die.For the group leader, the experimenter introduced typical tasks to perform during the interaction, for instance, guiding the discussion, keeping attendees on track, etc.The total time of discussion was controlled by the experimenter to allow comparable conditions, important for some of the performance indicators (e.g., number of recommendations).Therefore, each group was given 20 min for interaction and filing the discussion results.After the meeting, each participant filled several questionnaires (cf.[11]), providing self-annotations of the meeting, using different performance indicators (cf.Section 2.1.2for details on indicators and references to questionnaires).Furthermore, the meetings were assessed by trained annotators, given annotations on additional performance indicators.
The corpus' data were recorded using a (simple) video camera and an internal microphone (cf. Figure 1).Therefore, in the analyses, we have to deal with naturalistic material (cf.[38]) in a quality that is, however, still fine for investigations.Data contributors did not ensure occlusion-free or masked-free data, which limits video-based analyses.Regarding the acoustic quality, it is proper and no noise is included, resulting from the conference room-like environment.In our experiments, we rely only on the acoustic samples, being extracted from the video material.Details on the limitations of the corpus are discussed in Section 2.1.3,and restrictions in the experiments are stated in Section 2.2.1.
Figure 1.Sketch of the recording setting.The recordings take place in a conference room-like environment, providing a conference table and multiple seats (dashed seats indicate variations in group sizes; cf.Section 2.1.1).Orientation of the camera and microphone are also visualised. Figure is taken from [11].

Performance Indicators
As already mentioned in the corpus description (cf.Section 2.1.1),performance indicators are based on either individual assessments of the participants or on the external validation by qualified raters.For the individual ratings, questionnaires (we refer to the particular performance indicator for references) are used, utilising a varying number of items, indicated on a Likert scale, to request and guide the assessment.These ratings are provided along with the recordings and were pre-processed by the team of Joseph A. Allen, University of Utha.In particular, the self-assessments allow an internal view of the perceived performance of the group, which enabled discussions in relation to aspects of "stimulating meetings" [11].This issue might be of interest in the discussion of our experiments in Section 4 , as well.
Following the suggestion in [11], we distinguish two categories of performance indicators, namely, subjective and objective indicators.These are briefly introduced in the following.
Subjective performance indicators provide an individually perceived impression of the meeting situation and are further used to obtain an internal view of the group itself.Mean indicates that the indicator is finally averaged across all group members to achieve an evaluation of the entire group.
Participants rated the meeting's achievements on a 5-point scale from "extremely ineffective" to "extremely effective".

•
MSA_mean.Four items measured satisfaction with meeting processes following Briggs et al. [40].Participants rated mainly how the meeting was conducted on a 5-point scale from "strongly disagree" to "strongly agree".• MSO_mean.Four items measured satisfaction with meeting outcomes following Briggs et al. [40].Participants rated their satisfaction with the overall meeting results on a 7-point scale from "strongly disagree" to "strongly agree".• MSP_mean.Five items measured satisfaction with process following Briggs et al. [40].
Participants rated the perceived satisfaction with the current process on a 7-point scale from "strongly disagree" to "strongly agree".

•
MSBS_mean.Twenty-seven items measured boredom in the meeting following Fahlman et al. [41].Participants rated the level of boredom in the current meeting on a 7-point scale from "strongly disagree" to "strongly agree".

•
ANX_mean.Five items measured anxiety in the meeting.The scale was generated in the group of Joseph A. Allen based on general and social anxiety scales (e.g., [42,43]).
Participants rated the anxiety in the current meeting on a 5-point scale from "strongly disagree" to "strongly agree".
Objective performance indicators measure the objective and countable outcomes of the meeting, being based mainly on the provided written recommendations.Thus, an external view on the meeting and its "productiveness" is already (somehow) given.

•
TS_Rec.Total recommendations is the counting indicator stating the written recommendations provided by the group.The higher the numbers, the more ideas or recommendations were generated.• Mean_F_Rec.Each recommendation was assessed by two independent raters regarding both feasibility and quality on a scale from "extremely low" to "extremely high".As stated in [11] "[f]or feasibility, the raters had an agreement [. . .of] Cohen's κ = 0.86".The individual scores per recommendation were accumulated and finally averaged across all recommendations per group.• Mean_Q_Rec.Each recommendation was assessed by two independent raters regarding both feasibility and quality on a scale from "extremely low" to "extremely high".As stated in [11] "[f]or quality, [the] two independent raters had an agreement [. . .of] Cohen's κ = 0.83".The individual scores per recommendation were accumulated and finally averaged across all recommendations per group.• High_Rec.Based on the scores in F_Rec.and Q_Rec.the recommendations were considered highly feasible and high quality for each group.The current indicator sums the number of recommendations, achieving a score of either a four or five on either feasibility or quality ratings.
Regarding the introduced indicators, it is to be noticed "that preceding data analyses showed that the subjective and objective performance indicators are correlated.However, correlations are weak (below r = 0.3) [, but nevertheless statistically significant,] and indicate relationships, but no clear directional effects" [11].
In the experiments (at least in the beginning), all indicators were used for the development of prediction models (cf.Section 2.2).However, regarding the indicators MSBS_mean and ANX_mean for various groups, no, or fragmentary, assessments were provided, resulting in a large number of outliers.Therefore, the results achieved on these indicators have limited significance and should be considered with caution.Further details on this aspect are given in Sections 2.1.3and 3.1.

Limitation of the Data
Given the perspective of [44], in our experiments, the Parking Lot Corpus data can be considered secondary data since they are not self-collected, and inherent variables of the data collection were not influenced by the manuscript's authors.From discussions with the data provider, we can summarise that participants were not pre-selected according to any characteristics; the only restriction was that they were enrolled in the Midwestern United States university as already explained in Section 2.1.1.There were only two interventions by the experimenter: (1) the group leader was not elected by the group itself but by rolling a die, and (2) the discussion was limited in time for comparison reasons (cf.Section 2.1.1).For our experiments, the Parking Lot Corpus comprises appropriate conditions from an experimental design perspective.The corpus provides acoustic samples from group interactions that have characteristics of naturalistic interactions (as in [38]): no scripted interaction, no pre-defined wording, but rather spontaneous interaction of the group members.Further, we have direct access to a wide-spread collection of performance indicators (cf.Section 2.1.2),being used for the assessment of groups.Finally, given the recorded 70 group interactions, a suitable number of acoustic samples are available (for details, we refer to Section 2.2.1), allowing a training of (rather shallow) machine learningbased prediction models (cf.Sections 2.2.4 and 2.2.5).It is to be noticed that we did not contribute to neither the experimental design nor the data collection.Therefore, we used the provided material as such, except the exclusion of "outliers" as explained in Section 2.2.1.
Regarding the original intention of the data collection, studying performance indicators in group interactions from a work psychology or social science perspective, the experiments presented in the manuscript are established in a post hoc manner.We focus on an acoustic-based perspective on the groups' performance results, posing the research questions in Section 1.2 or as stated in the manuscript's title, "Whose Data Should be Used for Prediction of Team Performance?".Given the definition of [45], our experiments can be seen as a kind of ex post facto experiment since we interpret given material in a novel sense, especially in an acoustic way.We use the data to tap them for automatic analyses and (maybe) for future use in the human-machine interaction domain, as the "value of co-relational [ex post facto . ..] studies lies chiefly in their exploratory [. ..] character" [45].
However, the Parking Lot Corpus has also limitations; we focus on those affecting the current experiments.The manuscript's authors were not being able to influence neither the recording conditions nor the parameters of the data collection.Given the underlying task, namely coming up with ideas on the current parking situation on the campus, the used terms and phrases are mainly focused on this issue.This might restrict variation in terms of the used vocabulary and possible acoustic variations, being seen in spontaneous interactions.However, the situation reflects a common, task-oriented group interaction.Finally, only the pre-defined performance indicators can be considered in the experiments and, thus, the prediction models are constructed specifically for those indicators.
Generally, we were not involved in the data collection and design process, which results in rather a retrospective and external view of the collection setting, which can be seen as ex post facto [45].We nevertheless state that the experimental design as well as the provided data fit our needs for the training and testing of prediction models.Also regarding the number of available acoustic samples, which is usually a question during the training of machine learning approaches, we argue that a suitable amount of data is provided.In particular, we use rather shallow networks and methods (cf.Sections 2.2.4 and 2.2.5) and thus, the need for data is reduced.For the investigations on group leaders, the Parking Lot Corpus contributes in total 5225 acoustic samples and 5222 acoustic samples with respect to self-restrictions (cf.Section 2.2.1).Regarding acoustic samples for the remaining groups, 9076 samples can be used, already taking into account self-restrictions.

Experimental Setting
Based on the research questions stated in Section 1.2, we select an approach to handle the corpus' material and further choose prediction methods to assess the performance indicators (cf.Section 2.1.2).For this, we distinguish two perspectives: (1) The data flow and how the data are generally processed to achieve a prediction-this is visualised in Figure 2 and discussed in Section 2.2.1.(2) The in-depth handling of the data to obtain a prediction-an overview of the experimental workflow is presented in Figure 3

Separation of Data
At first, we consider the material, which is available in the Parking Lot Corpus.The corpus comprises audio-visual recordings of 70 group sessions, where for each group, a group leader is chosen randomly.Relying in the experiments on the acoustic material, each group contributes a different number of speech acts/samples (i.e., the number of spoken statements).We neglect those groups with fewer than ten speech acts (self-restriction) since this can be assumed to be a minimum amount of material for making any adaptation in the prediction models.Therefore, the total number of available groups reduces to 68, comprising a total number of 14,298 acoustic samples.As suggested and used in [13], the entire material can be split into data from the leading person and those from the remaining participants (usually called remaining group in the manuscript).The flow of the data within the experiments is visualised in Figure 2.
Leader Data: Since the leading person has a distinct position in the group ( [1] pp. 91-111 or [46] e.g., pp.5-7), as she/he has the option for influencing and accentuating the behaviour and also the acoustics of this specific group, this member is of special interest.Therefore, we decided to analyse the acoustic statements of the group leader separately, applying, however, the same methods as for the remaining group (cf.Sections 2.2.4 and 2.2.5).To conduct the investigations, we split the entire Parking Lot Corpus' material, selecting those acoustic samples which were assigned to the respective group leader.This annotation was provided by the corpus' distributor, cross-checked for an arbitrary selection of samples of the subset.Since we already cleaned the data set as mentioned above, the number of leaders remained at 68, providing a range of acoustic samples from 17 to 156 statements per speaker.
Remaining Group: The remaining group part contains the acoustic material of those participants who are collaborating within the current group but were not randomly selected as the group's leader.As discussed in, for instance, [46], the forming of a group is a dynamic process which is usually ongoing during the discussion.Although we are not focused on this interesting issue, we see that the forming process should not be neglected as already discussed in Section 1 and in [11,13].However, in the current analyses, we focus rather on the influence of the acoustic statements to the group performance indicators, using prediction experiments (cf.Section 2.2.4 and 2.2.5) to obtain (ideally) objective assessments of group performance results in the future.Again, we cross-checked the split material on a random base and applied the same self-imposed restrictions as already described.This led to the same 68 remaining groups, contributing a range of acoustic samples from 55 to 274 statements.The reader should keep in mind that the absolute number is not directly related to any details of the group since the group's size varies between three and six participants, resulting in a remaining group size's range of two to five.Further, some groups are more interactive than others, which also influences the number of statements, additionally reflecting the discussion on "stimulating meetings" [11].

Validation Paradigm and Measures
Based on the grouping on meta-level (cf.Section 2.2.1) and given the group information comprised by the corpus, we applied two validation strategies in our experiments, aiming for conclusions on generalisation or generalised perspectives.
Leader Data: For the group's leader, the validation strategy is based on an individual level since only the acoustic samples of one participant are used.As we split the corpus' material accordingly, we were able to apply a Leave-One-Speaker-Out (LOSO) paradigm.This means that all acoustic samples of one group leader are only used in the test set; the remaining material of the other group leaders is employed for the training.To ensure a smooth training of the models, the training set was further split into a real training set and a validation set (10% of training material), which allows an assessment of the training process.The whole process was repeated 68 times, utilising each group leader once as a test person.
Remaining Group Data: Regarding the validation of the groups' investigations, we decided on a similar approach as for the leader.Again, we opted for a general perspective (generalisation) in the prediction experiments.Given the material, the test and training sets were constructed as follows: for testing, the acoustic statements of a particular group were applied; the samples of the rest of the groups were merged to form the training set.To assess and control the training process, the training set was split in 10% validation and 90% training material.The whole process was repeated 68 times, utilising each remaining group once for testing.This resulted in a Leave-One-Group-Out (LOGO) validation paradigm.
Validation Measure: Although we predicted the performance indicator values (either point scale or counting values-being used for objective indicators like number of recommendations), we applied the Root Mean Square Error (RMSE) according to Equation (1) for the validation of the prediction experiments: where ŷ is the predicted output, y is the expected output, and n is the number of observations.The RMSE was calculated for each leader and group, respectively.This allows an assessment of the differences between the predicted performance indicator values and the respective human assessment (cf.Section 3.1), being considered as a kind of ground truth.For an overall assessment of the current experiments, the individual RMSE values were further averaged across all leaders or groups, which allows a generalised discussion on the results achieved with a particular parameter setting for the predictors and further enables a ranking of the achievements (the best performing parameter settings are highlighted in tables visualising experimental results).
Statistical Significance: The achieved results were also analysed regarding statistical significance.For this, we used the Kruskal-Wallis test (cf.[47]) with internal Bonferroni correction.To obtain a high significance, we selected p < 0.05 as the significance level.Analyses showed that p < 0.01 was reached when statistical significance was obtained (cf.Section 3.3.3).

Extracted Features
Nowadays, two approaches for the development of features are distinguished, namely, hand-crafted and learnt features (e.g., [7,48]).In our experiment, we focused on the handcrafted features since they provide an option to interpret the respective input used for prediction.In future research, these investigations can be extended to learnt features, and both approaches might be compared.
Features derived from the raw signal by the application of functions and operations are usually called hand-crafted since "manual" effort and additional (expert) knowledge are used to obtain those features.There is a long tradition of using such features, which is currently gaining attention again in the sense of explainability.Since the raw data are processed "manually", the obtained values are (usually) more easily interpreted as in the case of automatically learnt features.For hand-crafted features, we decided on the emobase feature set (cf. [33]), which is a well-established set in the community of affective computing from speech.In our experiment, this feature set is applied to acoustic samples of either the group leader or remaining group.Given the wide range of covered speech characteristics, comprising a balanced set of spectral and prosodic features, as well as use in multiple speech-related investigations (cf.e.g., [30,36,49]), we assume that emobase is also applicable in the current task (supported by the review in [7]).
The feature extraction is handled as follows: The features are extracted on the utterance level using a common windowing approach (usually a Hamming window), applying the openSMILE toolkit [36].In total, emobase contains 988 features constructed from 52 Low Level Descriptors (LLDs) as well as 19 functionals applied to each LLD.The set of LLDs contains, for instance, Mel-Frequency Cepstral Coefficients, intensity, loudness, etc., as well as their respective delta values.The list of functionals includes amongst others mean, minimum, maximum, various quantiles, ranges, etc.

Baseline Setting
The Parking Lot Corpus is a data set which is novel to both communities, social group investigations as well as computational group analysis and thus, is not yet frequently used.Recent analyses (cf.[11,13]) focused on the qualitative and quantitative acoustic-based assessments of group characteristics in relation to performance indicators.An experimental prediction of such indicators, based on acoustic material, has not yet been performed on the data.Therefore, we first conducted experiments which allowed to define a baseline for further comparison.The overall workflow of baseline experiments was adapted from those visualised in Figure 3, using an alternative machine learning approach.
For baseline experiments, we applied a rather simple approach using Support Vector Regression (SVR) using the implementation of the Python library sklearn.Besides the default parameters, the kernel function was set to the Radial Base Function.For the evaluation, the respective LOSO and LOGO paradigms (cf.Section 2.2.2) were applied, using the hand-crafted features introduced in Section 2.2.3.To fit the requirements of SVR, we "compressed" the features as follows: for each feature of the emobase features set [33], we calculated the respective mean values across all samples per leader or remaining group.This results in a representation of the leader or the remaining group in an n-dimensional space, which can be used for SVR.The baseline results are presented and discussed in Section 3.2.

Prediction Models
In the following, we introduce the methods used in the paper's main experiments as well as the respective parameters.An overview on the entire workflow is given in Figure 3.We try our very best to communicate the most relevant aspects to also achieve an option of reproducibility, already highlighting that the parameters not mentioned in the description are kept as the default.
In contrast to the baseline experiments, we applied neural networks for the prediction experiments on hand-crafted features.In general, since, on the one hand, the creation of a team is a dynamic process ( [1] and discussion in Section 1), and, on the other hand, the interaction within a group is based on a sequence of interactions between group members (cf. Figure 3), we need an approach that is able to handle such dynamics.In the current experiments, we focused on the sequential characteristics of an interaction and thus, the interplay of communication partners needs to be tackled.Therefore, a recurrent neural network approach appears appropriate for this task.Currently, multiple state-of-the-art approaches are available, which might solve the issue, albeit often requiring large/big datasets.Given the Parking Lot Corpus with 14,298 acoustic samples, we decided to use a rather shallow realisation of recurrent neural networks.For this, in particular, we relied on Long-Short Term Memory (LSTM)-based networks (e.g., [50,51]), specifically, BLSTMs.Preliminary experiments show that especially the bidirectional characteristic of BLSTMs provides advantages for the prediction task since the context in the interaction (BLSTMs preserves past and future information) is beneficial for an assessment of the entire discussion.From our point of view, this allows a better ranking of the entire group's performance.
BLSTMs were implemented and trained using the keras framework [52] in Python.For the fitting process, the Adam optimiser was utilised based on the following parameters: learning rate η = 0.01, β 1 = 0.9, β 2 = 0.999, and ε = 1 × 10 −7 .As the neuron's activation function, we selected the sigmoid function across all neurons.Regarding the number of layers, we used a shallow model, keeping it simple, and fixed the setting to two hidden layers (a BLSTM and a dense layer).Those network parameters, neither currently mentioned nor being varied in the experiments, were kept on the default setting provided by the keras framework and thus, not being reported in the manuscript.The parameters which are varied in the experiments are (1) the number of units (in LSTMs and BLSTMs, a unit is the fundamental processing (memory) cell, being a composition of basic (artificial) neural structures and trainable gates controlling the cell; for details, we refer to, for instance, [51]), where the range is indicated in Tables 3 and 4; and (2) the number of training epochs (in the current experiments, either 50,000 or 100,000 epochs).Furthermore, we implemented early stopping on the validation loss.Although, in general, a LOSO or LOGO validation paradigm is applied for the training of the models, for internal evaluation during the individual training process, an internal validation set (size: six leaders or remaining groups, respectively) is randomly selected.All input values are normalised using the L2-norm.
Given the validation paradigm, we trained individual models for each leader or remaining group as well as for each performance indicator to be predicted.As discussed finally, the average performance predictions were calculated for a general discussions of the results.

Results
In this section, we present the results and give a first discussion of the achievements.A discussion in a broader sense is given in Section 4, especially with respect to external results and stimulating meetings.We divide the presentation into baseline (cf.Section 3.2) and prediction results (cf.Section 3.3).This allows approaching the Parking Lot Corpus with respect to an automatic prediction of performance indicators as well as a more detailed investigation with state-of-the-art neural predictors.

Human Assessments of Performance Indicators
Before we dive into the particular results, we briefly introduce the human annotations (can also be called human assessment) for the performance indicators.This refers to the aspect that the group meetings were assessed by at least two qualified raters [11], providing mainly an external view of the groups and their performance.The average performance indicators of humans across all groups are given in Table 1.These are used as the ground truth for both the training of the prediction models as well as the comparison in our experiments, where no distinction is made whether the leader or remaining group is investigated since the respective performance is evaluated for the entire group (cf.also [11]).
Hint: Regarding the individual values in Table 1, the results for ANX (mean) and MSBS (mean) are considered outliers.During the human assessment, at least one human expert did not rate either ANX (seven groups are affected) or MSBS (four groups are affected).To solve and highlight this issue, the corpus distributor marked these events by a value of −99.Computing the mere mean across all groups per performance indicator, this results in negative average values for ANX and MSBS (cf.Table 1).Neglecting those particular groups generally for all performance indicators across all prediction experiments would further reduce the number of samples for those indicators not affected by annotation issues.For this, we decided to discard only the two performance indicators in further experiments, although for the majority of groups, human assessments would be available.

Baseline Results
As already mentioned in the introduction (cf.Section 1) and the description of the baseline architecture (cf.Section 2.2.4), the Parking Lot Corpus is a rather novel data set.Therefore, we still lack baseline experiments to compare the current achievements.This is being resolved now, giving first indications for predictions using a rather simple statistical approach and further providing results utilising state-of-the-art models (cf.Section 3.3).
Table 2 presents the baseline results, relying on SVR experiments estimating respective predictions values, providing the predicted mean performance per indicator (either point scale or counting value).Therefore, the results can be directly compared to the human assessments in Table 1.
Table 2. Baseline results per performance indicator using SVR, distinguishing the performance estimated on the acoustic material of either the leader or the remaining group (Rem_Group).The results are averaged across all groups, representing the automatically obtained prediction values in terms of either point scale or counting value (cf.Section 2.1.2).Regarding the achieved results and comparing them to the human gold standard (cf.Table 1), we see already good results for both experimental settings.This is also supported by the RMSE values, measuring the differences between the human assessed performance indicators and predicted ones.It is to be noticed that we decided to avoid stating individual RMSE values for clarity of presentation; however, if needed, the values can be calculated according to Equation (1).

MSA
The indicators can be grouped by low, medium, and high RMSEs.Comparing Tables 1 and 2, we see that low RMSE values appear for MSA and ME; medium values are given for TS_Rec, F_Rec, Q_Rec, High_Rec, and MSO; and high values are seen for MSP.Given these baseline results, we see options for improvement, especially in the medium and high RMSE indicators.Therefore, we ran additional experiments, utilising higher-level prediction methods as introduced in Section 2.2.5.

Results of Prediction Models
For more advanced prediction models, we relied on BLSTM networks as introduced in Section 2.2.5.This approach is usually known as a good option to predict sequences, especially using long-term dependencies.From the experimental results, we learnt that investigations on leader and remaining group samples are highly different and thus, separated observations are recommended.

Leader
In Table 3, the prediction results, providing predicted mean performance per indicator (either point scale or counting value; cf.Section 2.1.2),using BLSTM-based networks, are visualised, varying the number of units being used in the networks and the number of training epochs.The highlighted cells indicate those settings which show the best results, relying on the RMSE, throughout the prediction experiment per performance indicator.Given the results, we see a spread across the networks' parameters, obtaining the best performance per indicator (respective column in Table 3).However, there are two settings, namely 750 and 1000 units, that provided the best performance results for half of the indicators.Nevertheless, no clear effect of the network setting could be identified in relation to subjective or objective performance indicators.There are two interesting observations.At first, regarding Table 3, we saw that the prediction performance varies with the number of BLSTM units, which is somehow expected.In the first instance, we assumed that rather small networks might be able to solve the task already.But we saw that this depends on the number of epochs (cf.rows 1 and 2 in Table 3), which indicates that the systems need quite a while to learn the necessary dependencies.To provide the network more flexibility to learn the characteristics of group interactions, we increased the number of BLSTM units, showing, by contrast, that already a limited range is appropriate to handle the prediction task per performance indicator.All indicators could be tackled using network settings in the range of 500 to 1250 units.Already with more than 1500 units, we observed a drastic decline in prediction power, increasing with increased unit numbers.
Given these achievements, we additionally ran similar experiments, also varying the number of epochs and even the architecture to LSTMs.No evidence for improvement or difference in the prediction could be seen.Therefore, we decided to neglect the results in this manuscript.
Second, comparing the prediction performance in Tables 2 and 3 to the human annotations, no improvement could be achieved across all indicators, except MSP.For this specific performance indicator, a significant improvement could be achieved.Taking the argumentation of [11] into account, that especially MSP is related to short-term speech characteristics, we saw the benefit of BLSTMs handling short-term dependencies in relation to the statistical SVR approach in the baseline.For the remaining indicators, the variation in spoken acoustic/prosodic characteristics is on a rather long-term indication, which can be already fetched by the statistical model using fewer parameters.Additionally, it seems that especially for MSA and ME, BLSTMs is also able to handle the larger variations in prosody (cf.respective analyses in [11]), although they currently do not beat the baseline achievements.However, this is a matter of further research, investigating the particularly affected network parameters in the sense of explainable AI.

Remaining Group
Regarding the results predicting the performance using the material of the remaining group only (cf.Table 4, also presented as predicted mean performances per indicator (either point scale or counting value); cf.Section 2.1.2),we saw a more diverse spread across the parameters (in this case, number of units and number of training epochs).The results vary more than those for the leader-only experiments (cf.Table 3) since a broader spectrum of speaker characteristics needs to be covered.Interestingly, the predictions for F_Rec, Q_Rec, and High_Rec are condensed in a particular setting, namely 750 units.In general, the BLSTM models show performance peaks, being different from those seen in the leader setting in Table 3.This is related to the diversity in the samples since the models need to learn a higher variations in acoustics since the variation across the members per remaining group has to be modelled.Given these results and also the analyses in [11], the complexity in the specific performance indicators can be assumed.
However, comparing the human annotations (cf.Table 1) and baseline results (cf.Table 2) to the achievements of the current models, we see an improvement in the performance for the indicators MSA, ME, TS_Rec, and MSP.Therefore, using material of the remaining group members could be a benefit for the prediction.Driven by the improvements being achieved by the models on the remaining groups' material, we also calculated the statistical significance for the results.For this, we compared the predictions of leader and remaining groups across performance indicators.Using the Kruskal-Wallis test [47] (further details in Section 2.2.2), applying a significance level of p < 0.05, we saw two results: on the one hand, if statistical significance was achieved, this was obtained at the p < 0.001 level; in particular, this is the case for TS_Rec, F_Rec, Q_Rec, and High_Rec.On the other hand, for the other performance indicators, no statistical significance can be seen.The calculated p-values are far from the pre-defined significance level.

Discussion
The following discussion is tailored to the research questions and the hypothesis stated in Section 1.2.Furthermore, we embed our findings in a broader discussion already started in [7,11].
Research Question RQ1: Regarding the first research question, we establish the baseline results on the Parking Lot Corpus.Since this is a rather novel data collection, these results (cf.Table 2) are the first automatic prediction values using acoustic material.So far, the data were analysed rather in psychological and social backgrounds, providing human-based assessments.The human annotations in Table 1 are average values across at least two certified annotators (cf.Sections 2.1.2and 2.2.4,and [11]), which can be seen as ground truth for the upcoming investigations.
Given a rather simple SVR approach, good predictions were already able to be achieved.These results indicate that predictions of performance per indicator, based on acoustics, can be set up.Further, if compared to the results of higher-level neural approaches, reasonable results were obtained.This shows that already "simple" models are able to retrieve the necessary details, providing first-level predictions.These are also being related to human assessments.
Research Question RQ2: To answer the second research question, we selected specific networks, namely BLSTM, which fit the aspect of temporal handling (further details are given in Section 2.2.5).Similar to the baseline experiments, human annotations are being used as ground truth during the evaluation of models.As shown in Section 3.3, no clear preferences for particular network settings could be shown.In both scenarios of leader or remaining group, a variation across the network settings is given for the performance indicators.Some benefits can be seen, especially in the recommendation-related indicators.However, this is not ultimately fixed and is rather a pointer towards further investigations.
The results achieved by using acoustic material from the remaining groups are more promising.We see that for half of the indicators, an improvement against the baseline is achieved.Comparing those achievements also to the leaders' results, statistical significance is reached for some indicators.However, the current results are the first findings unlocking the treasure of the Parking Lot Corpus.Detailed analyses need to be conducted to show why particular performance indicators gained improvements in the neural approach.A first interpretation might be that the recommendation part directly corresponds to specific acoustic characteristics.This is in line with the findings of [11], showing that this particular characteristic is spread across the group rather than being established in the leader's acoustics.In contrast, the decrease in the remaining performance indicators shows that neural approaches need well-selected tasks.The interplay of statistical features (given in the emobase features set [33]) and a quite robust prediction method already cover the characteristics.More detailed investigation on the specific acoustic variation responsible in the neural approaches and the analyses in [11] is necessary in the future research.
Hypothesis 1: In general, we expected an improvement in the prediction performance using networks, with the capability to model temporal characteristics and incorporate context.It should be possible to model a development in a contextual sense, especially for the remaining group; for the leader, without any group context, this is much harder.However, given the Parking Lot Corpus, this was not the case for BLSTM models and all performance indicators.We obtained improvements for some indicators, especially those related to the recommendations.Given these results, further investigations are necessary to clarify reasons for the achievements.This can be related to either the data itself, the underlying course of interaction, the models' fine tuning, or the utilised features, directly asking for interdisciplinary collaborations to assess the respective characteristics and issues.
General Discussion: In general, we provide the first prediction results for the performance of groups in the Parking Lot Corpus based on performance indicators (cf.Section 2.1.2).In this sense, the paper taps the corpus and also provides additional insights into an automatic assessment of performance predictions, aiming for an objective evaluation in the future.What we see from the data and, thus, from the results shows two different indications.
On the one hand, usually the leader has an important role in the meeting (e.g., [1,46]), but given the acoustic analyses, this is currently rather not being seen.In contrast, the prediction approaches are in fact confused by the particular material, especially in the neural-based approach.On the other hand, in contrast, the remaining group provides reasonable acoustic evidence for a prediction of performance, in both cases, baseline and BLSTMs.In this sense, the "holistic" view on the group enables the system to establish an understanding of the communication, leading to an option of performance assessment.The combination of understanding the ongoing communication and a link to contextual ("holistic") information enables not only the evaluation of the current performance but is also related to aspects like familiarisation.The authors in [9] see familiarisation as an essential part of communication (amongst humans only or in human-machine interaction), especially in "open-world" scenarios (cf.naturalistic interaction in [38]), helping to improve task solving in further interactions.This is also in line with the arguments presented in [11] and Section 1, where already the benefit of a "stimulating meeting" was introduced.Such meetings use the option to familiarise and thus bet on upcoming meetings.However, this enforces more fine-graded analyses in both social as well as computer sciences, longing for respective interdisciplinary research.The current work provides some steps toward such interpretation and combined/interdisciplinary investigations.

Conclusions
The current manuscript analysed the Parking Lot Corpus and the performance indicators assigned to the group interactions.We presented first acoustic-based prediction results on the corpus, being divided into baseline results and achievements using neural approaches.In particular, the baseline setting used statistical features derived from the emobase feature set [33] and SVR for prediction.In addition to this, in further experiments, BLSTMs were applied, utilising the emobase feature set [33], and were compared to the baseline results.Improvements in the prediction performance could be achieved only for parts of the indicators (cf.Section 3.3), showing some benefit of higher-level neural networks.The achievements are discussed in Section 4, also highlighting the relation of the investigations of [11].
Particularly, we summarise the findings in term of takeaway messages: • Baseline results of performance indicators for the Parking Lot Corpus based on acoustic samples and statistical features were provided.• Performance prediction per indicator was contributed based on acoustic samples, utilising the emobase feature set [33], and BLSTMs.

•
Comparison of baseline and BLSTM-based results.• BLSTM prediction on remaining group level is beneficial for particular objective performance indicators.

•
Regarding whose acoustic samples (leader or remaining group) might be used for prediction tasks, currently a (slight) trend towards the remaining group's samples is given.• Contextual information (as in the remaining group) is beneficial for an improvement in the performance predictions (related to (theoretical) discussions in [9,11,13]).
In future research, additional features might be tested in relation to neural approaches but also linking those results to meta-discussions as already started in Section 4 (General Discussion).In particular, the familiarisation process, being part of task-solving interactions (cf.[9]), is an important aspect of establishing a better understanding of contextual relationships during an interaction and how this is being linked to the assessment of group performance results.This establishes further collaborations with and relations to the social sciences.
, whereas the approaches are discussed especially in Sections 2.2.4 and 2.2.5.Flow of data in the prediction experiments.Data (taken from the Parking Lot Corpus (PLC); indicated by dotted arrow) per group are split into acoustic samples of the group's leader, and the remaining group is processed separately.Details on the workflow are visualised in Figure3.Finally, a comparison of the performance indicators is conducted, currently on a manual base.
Figure 3. Workflow of the prediction.For each remaining group or leader from the Parking Lot Corpus (PLC; indicated by dotted arrow), a respective sequence of acoustic samples is used for prediction.In total, 988 features (emobase feature set) are extracted per sample i and fed to the bidirectional LSTM (BLSTM).The performance indicators are predicted simultaneously.

Table 1 .
Human rater assessments for each performance indicator based either on a point scale or on counting values (cf.Section 2.1.2).For this table, the respective values are averaged across all groups.For details on the values of ANX and MSBS we refer to explanations in Section 3.1.

Table 3 .
Results per performance indicator of the group's leader using BLSTMs varying the number (#) of units utilised in the respective network.The results are presented as either point scale values or counting values (cf.Section 2.1.2).The grey cells highlight the best result per performance indicator, using the RMSE as the decision's foundation.† represents the network being trained for 50,000 epochs instead of 100,000 (default).

Table 4 .
Results per performance indicator of the remaining group using BLSTMs varying the number (#) of units utilised in the respective network.The results are presented as either point scale values or counting values (cf.Section 2.1.2).The grey cells highlight the best result per performance indicator, using the RMSE as the decision's foundation.† represents the network being trained for 50,000 epochs instead of 100,000 (default).