Interruption Audio & Transcript: Derived from Group Affect and Performance Dataset

Doyle, Daniel; Şerban, Ovidiu

doi:10.3390/data9090104

Open AccessData Descriptor

Interruption Audio & Transcript: Derived from Group Affect and Performance Dataset

by

Daniel Doyle

¹ and

Ovidiu Şerban

^1,2,*

¹

Department of Computing, Imperial College London, South Kensington Campus, London SW7 2AZ, UK

²

Data Science Institute, Imperial College London, South Kensington Campus, London SW7 2AZ, UK

^*

Author to whom correspondence should be addressed.

Data 2024, 9(9), 104; https://doi.org/10.3390/data9090104

Submission received: 29 May 2024 / Revised: 24 August 2024 / Accepted: 30 August 2024 / Published: 31 August 2024

Download

Browse Figures

Versions Notes

Abstract

Despite the widespread development and use of chatbots, there is a lack of audio-based interruption datasets. This study provides a dataset of 200 manually annotated interruptions from a broader set of 355 data points of overlapping utterances. The dataset is derived from the Group Affect and Performance dataset managed by the University of the Fraser Valley, Canada. It includes both audio files and transcripts, allowing for multi-modal analysis. Given the extensive literature and the varied definitions of interruptions, it was necessary to establish precise definitions. The study aims to provide a comprehensive dataset for researchers to build and improve interruption prediction models. The findings demonstrate that classification models can generalize well to identify interruptions based on this dataset’s audio. This opens up research avenues with respect to interruption-related topics, ranging from multi-modal interruption classification using text and audio modalities to the analysis of group dynamics.

Dataset: https://zenodo.org/records/8318812.

Dataset License: CC-BY-NC 4.0

Keywords:

overlapped speech; interruption; audio data; group interaction

1. Introduction

The goal of developing effective, ’human-like’ chatbots is wide-ranging and touches on many broad fields of research. In this study, we focus on detecting interruptions—one of the less studied areas. Historically, approaches to interruption detection were either not data-driven or focused on high-latency natural language processing approaches [1]. The emergence of novel speech-based, self-supervised learning techniques [2] applied to audio has opened up a new avenue of research with the promise of higher accuracy and lower latency [3]. While this has been a subject of active research, participation has been limited to those with access to proprietary interruption datasets. There has been a lack of publicly available, audio-based interruption datasets. In this study, we derive a new dataset for interruption classification from the Group Affect and Performance (GAP) dataset. In doing so, we open up this topic for broader research. A full audio transcript is inherited from the source dataset, which can be leveraged for multi-modal approaches. This dataset aids in the development of chatbots that discriminate interruptions from backchannels and background noise. (Backchannels are actions sharing some characteristics with interruptions). Such utterances can be seen as affirmations by listeners in response to what a speaker is saying through words (for example, ‘agreed’, ‘sure’) or noises (for example, ‘mhmm’). It is widely agreed that such utterances do not constitute interruptions [4]). Ultimately, this allows for timely pauses before finishing the response in the case of an interruption, thereby creating a more natural and human-like interaction.

Although we focus mainly on the task of interruption classification, the value of this dataset extends beyond this. The dataset can be used to study interruptions in language and, more broadly, within group interactions.

The rest of this paper is organised as follows. Section 2 summarises the pertinent aspects of the GAP dataset. Section 3 describes the dataset, while Section 4 outlines the definitions used for this dataset and the subsequent procedure of its creation. Finally, we demonstrate the usefulness of the dataset in Section 6, showing results from an interruption classification experiment.

2. Source Dataset

The GAP dataset [5], from which our interruption dataset is defined, was curated by the University of the Fraser Valley, Canada, and consists of 28 group meetings, totaling 252 min of conversational audio. Collected to study computational approaches of small group interactions, each group in the dataset had 2 to 4 participants. Groups completed a winter survival task scenario where the objective was to rank 15 items regarding usefulness in a hypothetical plane crash scenario. Participants completed a ranking task individually and were subsequently recorded performing the same task jointly.

Since each group member had previously accomplished the task independently, discussions were enriched by comparing answers. The nature of these discussions welcomed frequent interruptions, with participants readily sharing their perspectives on the ranking of items. This aligned beneficially with our objective and made the setup of the GAP dataset ideal for our purposes.

2.1. Data Collection Protocol

In recording the dataset, each group was placed around a table with a Zoom H1 Handy Recorder in the center. The recordings were collected at a sample rate of 22 kHz, which, while lower than the typical professional standard, still provides adequate sound quality. The audio was recorded in mono; therefore, instances of overlapped speech are not split into two channels corresponding to each interlocutor. This poses challenges for isolating the audio of the interruptor from the interruptee.

Human annotators transcribed the audio data using ELAN annotation software (version 65), as depicted in Figure 1. In transcribing, the research team observed that spontaneously produced speech does not divide neatly into structured, sentence-based language. Consequently, they chose to segment the speech into intention-based utterances. The GAP Dataset defines an utterance as a segment of speech where a speaker intends to communicate one piece of information. This approach significantly aids in the task of identifying interruptions. Rather than interleaving words from different speakers during overlapping speech, it preserves each speaker’s contribution. Moreover, detecting overlapped speech becomes more straightforward, as one can examine utterances’ start and end timestamps for potential overlaps. The timestamps of each utterance’s start and end times are human-annotated and, as a result, show imperfections despite being broadly accurate.

2.2. Dataset Characteristics

Among the 28 groups, there were 84 speakers, 76% of which were female and 24% male. Participants were recruited through the University of the Fraser Valley; consequently, there were broad demographic similarities among participants regarding age, educational level, and socioeconomic status. The majority of participants were native English speakers, with only 12% being non-native.

3. Dataset Description

This dataset consists of overlapping utterances and splits them into the following two classes: true interruptions and false interruptions. False interruptions are cases of overlapping speech that are not deemed interruptions.

The term interruption lacks consistency in linguistics literature. In certain instances within the spoken dialogue system literature, it is not accompanied by a clear definition or appropriate reference. When a sufficiently precise meaning is not provided, the term becomes subjective, and for this study, it is essential to avoid this. In this section, we aim to provide a clear definition of an interruption that this study can consistently use.

Linguistics literature sees interruptions as violations of the turn-taking rule set [6]. This rule set defines how an addressee should receive verbal and non-verbal cues from a speaker, determining an appropriate place to take over a turn with minimal or no overlap. In this dataset, we align with Crook et al. (2012) [4] in requiring an overlap in speech for an interruption. We draw from Lin et al. (2022) [7] in centring our definition around the intention of the interrupter. Our definition is as follows:

Definition 1.

An interruption is an instance where an interrupting party intentionally attempts to take over a turn of the conversation from an interruptee and, in doing so, creates an overlap in speech.

Clearly, as we define it, the set of interruptions is a subset of the set of overlapped utterances. A rigorous definition of what we consider overlapped utterances can be found in Section 4.

False interruptions can be of many types, and a common type is backchannelling. This can be seen as an affirmation by a listener to what a speaker is saying through words (for example, ‘agreed’, ‘sure’) or noises (for example, ‘mhmm’). It is widely agreed that this does not constitute an interruption [4,8]. Our definition of an interruption aligns with this, given the lack of intent to take the turn of the conversation.

The folder structure of the dataset is shown in Listing 1.

Listing 1. Group Affect Performance Dataset (snippet of Group 1 transcript).

|-- data.json

|-- audio-and-transcripts

| |-- audio

| | |-- Group 1: 00:23.9 - 00:25.0.wav

| | |-- Group 1: 00:24.3 - 00:25.4.wav

| | |-- Group 1: 00:35.4 - 00:37.8.wav

| | …

| | ‘-- Group 14: 09:39.3 - 09:39.7.wav

| ‘-- original-dataset

| |-- group-audio

| | |-- MP4 Group 1 Feb 8 429.mp4.wav

| | |-- MP4 Group 2 Feb 8 553.mp4.wav

| | …

| | ‘-- MP4 Group 14 Oct 15 1234.mp4.wav

| ‘-- transcripts

| |-- Transcript Group 1 Feb 8 429.txt

| |-- Transcript Group 2 Feb 8 553.txt

| …

| ‘-- Transcript Group Oct 14 15 1234.txt

‘-- LICENSE.txt

Details of each true and false interruption data point are found in the root directory inside data.json. These instances of overlapped utterances are uniquely identified across the GAP dataset by the combination of groupNumber, speakerId, and startTime. We see an item of overlapped speech displayed in Figure 2.

The corresponding overlapped utterance audio snippets are in the audio-and-transcripts/audio directory directory. They follow a naming convention that outlines the times in the GAP dataset audio they represent, namely [Group number]: [Start time] - [End time].wav. The start and end times refer to the utterances that overlap; in the case of a true interruption, this would be the interrupter’s utterance.

4. Methods

Our methodology for creating an interruption classification dataset consists of the following two main phases: (1) sifting through the GAP dataset’s transcripts to identify instances of overlapping utterances and (2) audibly reviewing these instances and, if fitting our definitions, manually labelling them as either true interruptions or false interruptions. (3) The final step is to extract the audio of the overlapping utterance in each case suitable for the dataset. We adhered to a strict protocol to ensure data accuracy, maintain consistency, and avoid unwanted edge cases.

4.1. Building out Robust Definitions

The definitions of an overlapped utterance and interruption are paramount to our process. These definitions directly impact the dataset; overlapped utterances inform which data points are included in our dataset, and interruptions inform the classes assigned to each sample.

Our definition of interruption from Section 3 is used, where the intent of the interrupter and overlapped speech are the two necessary conditions. As a consequence of intent being a core factor of an interruption, the following two important caveats are introduced:

Cases of misplaced speech: Cases with a near-instant overlap are discarded. Since the average human reaction time to articulate a vowel in response to a speech stimulus is 213 ms, we borrow from Yang et al. (2008) and set any overlap that begins within 300 ms as a case of misplaced speech [9]. This is based on our definition’s requirement an interruption be intentional. Therefore, any overlapped speech occurring within the minimum reaction threshold is deemed accidental.
Early-onset responses: Instances where the overlap begins at the very end of an utterance are not intentional attempts to take over the turn of the conversation, as acknowledged by Selfridge et al. 2013 [10]. Hence, these are not included. Like Selfridge et al. 2013, we define these as instances where the overlap begins in the final 10% of an interlocutor’s utterance.

In defining an overlapped utterance, our primary aim is to create an overarching class encompassing interruptions and all instances deemed interruptions. For this reason, we omit cases where the timing indicates misplaced speech and early-onset responses.

Definition 2.

An overlapped utterance is an instance where one interlocutor provides speech or noise during another interlocutor’s speech, creating an overlap that may be deemed a possible interruption when considering its timing alone.

When constructing the dataset, each occurrence of an overlapped utterance must be distinctly identified. This involves recording each instance’s group number, start timestamp, and speaker ID. By doing this, we remove ambiguity should two speakers in a group begin an utterance simultaneously.

Since interruptions are a subset of overlapping utterances, at a high level, our dataset creation process involves two key steps. First, we identify instances of overlapped speech to populate the dataset. Then, we label the interruptions by identifying instances of overlapped speech where there is an intent to take over the conversation. For example, we would not label cases of backchannelling or coughing as interruptions.

4.2. Methodology Step 1: Parse Transcripts

Our first step is to extract relevant instances of overlapped utterances. The GAP dataset transcripts are formatted as .txt files, with utterances separated by newline characters (see Listing 2 for a snippet of the transcript from Group 1’s audio). The format of the Participant column takes the form of [GroupNumber].[Speaker ID].[Speaker-specific utterance count], where each speaker in the group is uniquely identified by colour. The transcript uses special characters to represent non-speech information, with ‘$’ for laughter, ‘%’ for coughing, and ‘#’ for other noises.

Listing 2. Group Affect Performance Dataset (snippet of Group 1 transcript).

Participant Start End Sentence

1.Pink.1 00:02.0 00:03.5 "So, what did everyone do as one?"

1.Blue.1 00:04.0 00:05.7 "I did, uh, cigarette lighter."

1.Blue.2 00:06.4 00:07.2 "For one."

1.Pink.2 00:07.3 00:09.3 "Mm okay, I did knife."

Intending to extract instances of overlapped speech, we parse each utterance in the transcript. In parsing through each utterance, the following three conditions have to be met: (1) there is, indeed, an overlap; (2) it is not misplaced speech (i.e., the audio occurred 300 ms following the beginning of the previous utterance); and (3) it is not an early-onset answer and, hence, does not occur within the last 10% of the previous utterance.

Our parsing script outputs an intermediate transcript containing each potential instance of overlapped speech for each group’s transcript. This allows us to begin the next step.

4.3. Methodology Step 2: Manual Audio Annotations

Although many overlapped utterances are seemingly straightforward to classify by viewing text alone, it is important to listen to the audio, as many cases are ambiguous when solely viewing the text. Take, for example, the following passage:

Speaker 1: And do we want to do the ball of steel wool cause that’s like the fire starter if you have the batteries.

Speaker 2: Okay.

Audio is needed to confirm if Speaker 2 is backchannelling to encourage Speaker 1 to explain their reasoning as to why to select ‘ball of steel wool’ or whether this comes as confirmation to select ‘ball of steel wool’. In the latter case, it becomes an interruption by forming an early answer. The timing of Speaker 2’s speech relative to Speaker 1’s is another essential factor gained from listening to the audio. If it were to come before ‘ball of steel wool’, it would not be an answer to the question.

Although many instances of misplaced speech are filtered out through our processing methodology, many were not picked up due to slightly misaligned utterance-level timestamps. Others had both utterances beginning outside of a 300 ms time interval. This issue was prevalent in groups of 3 or 4 people. It often occurs when one participant poses a question and two other participants rush to provide an answer, ending up speaking simultaneously. Such instances were duly excluded from the dataset. Many cases were excluded, as they did not fit the definition of overlapped utterances that could be deemed a possible interruption. These cases mostly comprised non-speech, such as laughter, which later had speech overlapping. Such cases do not fit our definition of overlapped speech, as they could never be interruptions. As such, they do not merit inclusion in the dataset.

Annotations from the GAP dataset were completed for Groups 1 to 14 (inclusive), yielding 355 data points. A single annotator carefully carried out the annotations based on a rigorous, non-subjective protocol, so there was no need to cross-check the definitions between different annotations.

4.4. Methodology Step 3: Extract Audio

Once we have our manually annotated data points in data.json, we need to extract the audio for the instances in our database and process them. For audio extraction, we use pydub [11], an open-source python-based audio processing library built on top of the popular open-source ffmpeg library [12]. This leaves us with our final dataset, with audio snippets of all instances of overlapped speech coupled with references to their place in the original dataset transcript. Since the dataset references the source GAP dataset, we include the original transcripts and full group audio recordings for Groups 1 to 14.

5. Dataset Characteristics

The total length of audio from the 14 groups is 115 min, with the durations of each group’s meeting ranging from 3 min to 13 min. The interruption dataset includes 41 participants from the 14 contributing groups, 28 of whom are female. This results in a higher relative male representation of 32% in this subset compared to 24% in the entire GAP dataset. Of the participants, six are non-native, maintaining a similar proportion to the full GAP dataset. Among the total of 355 data points assessed, 200 represent true interruptions, translating to a proportion of 56%. This ensures a relatively balanced composition of the dataset. In Figure 3 and Figure 4, we provide the number of true interruption and false interruption entries contributed by participants split by gender.

Analysis of Figure 3 and Figure 4 shows that the interruption dataset contains diverse speakers. Of the 41 participants, only 3 speakers did not contribute a true interruption, and 2 did not contribute a false interruption. As a result of having more female participants, we have more female contributions to the dataset. In addition to this, the participants who contributed the most true interruptions (between 11 and 14) were all female. This may result in potential biases within the dataset towards female true interruptions.

Another potential bias is introduced by the fact that participants were recruited from the University of the Fraser Valley. All of the participants were undergraduate students, and thus, there is demographic similarity in terms of age, education level, and socioeconomic status.

6. Results

This study explores the dataset’s utility by performing an interruption classification task. We randomly split the dataset into a 70%/15%/15% train/validation/test split. As a pre-processing step, we used HuBERT to produce a sequence of 1024 dimension embeddings for each data point’s audio. Each sequence was then averaged across the sequence dimension, which allowed us to evaluate the audio independently of its length, as the average length of a true interruption is much greater than that of a false interruption. Our binary classifier is a neural network with hidden layers of [768, 512, 256] corresponding ReLU activation functions. The model was trained for 10 epochs with the Adam optimiser. Table 1 presents the results of applying this approach with 0.5 as a threshold to classify an interruption as true or false based on the model’s output probability. In binary classification tasks, conventional reporting metrics are primarily used for the dominant class. However, in our study, the ’false interruption’ class is particularly interesting. Given the imbalance in our dataset, a model that indiscriminately favours the positive class would yield a misleadingly high accuracy score. Therefore, we also report metrics for the false interruption class to evaluate the model’s performance more accurately across both classes. The reported scores demonstrate the value of the dataset in allowing the model to learn differentiating features between genuine interruptions and other noises, such as laughter, coughing, and backchannelling. This is a simple approach to this problem. We encourage researchers to experiment with multi-modal approaches and approaches that identify an interruption without needing all of the audio. In that case, a model could detect an interruption before it has finished.

7. Conclusions and Future Work

In conclusion, we successfully created an audio dataset of interruptions and developed a classification model that accurately identifies interruptions based on this dataset. The results demonstrate the model’s ability to generalize well, confirming the robustness of our dataset for interruption classification. We employed specific definitions of interruptions as delineated by Crook et al. (2012) [4] and Lin et al. (2022) [7], and our findings validate these definitions, indicating their clarity and applicability in real-world scenarios.

We encourage several directions for future research. One avenue is to develop multi-modal approaches that leverage the provided text modality, which may enhance model performance. Another direction involves creating innovative methods to identify interruptions without requiring complete audio segments, enabling real-time interruption detection and improving the model’s utility in a chatbot environment. Additionally, integrating the interruption dataset with the GAP dataset could help explore the broader implications of interruptions, such as their impact on task performance. Finally, incorporating sentiment analysis into the dataset would allow for a more detailed examination of the frequency or likelihood of interruptions based on the interlocutor’s sentiment.

Author Contributions

Conceptualization, D.D. and O.Ş.; methodology, D.D. and O.Ş.; software, D.D.; validation, D.D. and O.Ş.; data curation, D.D.; writing—original draft preparation, D.D.; writing—review and editing, O.Ş.; supervision, O.Ş. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects during the Group Affect and Performance study.

Data Availability Statement

Data and metadata are available on Zenodo at https://zenodo.org/records/8318812.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ström, N.; Seneff, S. Intelligent barge-in in conversational systems. In Proceedings of the INTERSPEECH, Beijing, China, 16–20 October 2000; pp. 652–655. [Google Scholar]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Bekal, D.; Srinivasan, S.; Bodapati, S.; Ronanki, S.; Kirchhoff, K. Device Directedness with Contextual Cues for Spoken Dialog Systems. arXiv 2022, arXiv:2211.13280. [Google Scholar]
Crook, N.; Field, D.; Smith, C.; Harding, S.; Pulman, S.; Cavazza, M.; Charlton, D.; Moore, R.; Boye, J. Generating context-sensitive ECA responses to user barge-in interruptions. J. Multimodal User Interfaces 2012, 6, 13–25. [Google Scholar] [CrossRef]
Braley, M.; Murray, G. The Group Affect and Performance (GAP) Corpus. In Proceedings of the Group Interaction Frontiers in Technology, GIFT’18, New York, NY, USA, 16 October 2018. [Google Scholar] [CrossRef]
Sacks, H.; Schegloff, E.A.; Jefferson, G. A Simplest Systematics for the Organization of Turn-Taking for Conversation. Language 1974, 50, 696–735. [Google Scholar] [CrossRef]
Lin, T.E.; Wu, Y.; Huang, F.; Si, L.; Sun, J.; Li, Y. Duplex Conversation: Towards Human-like Interaction in Spoken Dialogue Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 3299–3308. [Google Scholar]
Schegloff, E.A. Accounts of Conduct in Interaction: Interruption, Overlap, and Turn-Taking. In Handbook of Sociological Theory; Turner, J.H., Ed.; Springer: Boston, MA, USA, 2001; pp. 287–321. [Google Scholar] [CrossRef]
Yang, F.; Heeman, P.A. Initiative conflicts in task-oriented dialogue. Comput. Speech Lang. 2010, 24, 175–189. [Google Scholar] [CrossRef]
Selfridge, E.; Arizmendi, I.; Heeman, P.A.; Williams, J.D. Continuously predicting and processing barge-in during a live spoken dialogue task. In Proceedings of the SIGDIAL 2013 Conference, Metz, France, 22–24 August 2013; pp. 384–393. [Google Scholar]
Robert, J. Pydub. 2011. Available online: https://github.com/jiaaro/pydub (accessed on 10 July 2023).
FFmpeg: A Complete, Cross-Platform Solution to Record, Convert and Stream Audio and Video. 2000. Available online: https://www.ffmpeg.org/ (accessed on 5 July 2023).

Figure 1. Segmentation and transcription with ELAN software.

Figure 2. Snippet from data.json.

Figure 3. True interruptions contributed by participants broken down by gender basis.

Figure 4. False interruptions contributed by participants broken by gender.

Table 1. Results of the chosen model on the test set, detailing each class’s precision, recall, and F1 score. ’Support’ refers to the number of data points in the test set used for the relevant metric. Additionally, we present the macro average, which calculates the metric independently for each class, then takes the average (thus, treating all classes equally) and the weighted average, which weighs each class’s metric according to its support in the dataset (thus, accounting for class imbalance).

	Precision	Recall	F1 Score	Support
True interruption	0.84	0.82	0.83	33
False interruption	0.73	0.76	0.74	21
Accuracy			0.80	54
Macro average	0.79	0.79	0.79	54
Weighted average	0.80	0.80	0.80	54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Doyle, D.; Şerban, O. Interruption Audio & Transcript: Derived from Group Affect and Performance Dataset. Data 2024, 9, 104. https://doi.org/10.3390/data9090104

AMA Style

Doyle D, Şerban O. Interruption Audio & Transcript: Derived from Group Affect and Performance Dataset. Data. 2024; 9(9):104. https://doi.org/10.3390/data9090104

Chicago/Turabian Style

Doyle, Daniel, and Ovidiu Şerban. 2024. "Interruption Audio & Transcript: Derived from Group Affect and Performance Dataset" Data 9, no. 9: 104. https://doi.org/10.3390/data9090104

APA Style

Doyle, D., & Şerban, O. (2024). Interruption Audio & Transcript: Derived from Group Affect and Performance Dataset. Data, 9(9), 104. https://doi.org/10.3390/data9090104

Article Menu

Interruption Audio & Transcript: Derived from Group Affect and Performance Dataset

Abstract

1. Introduction

2. Source Dataset

2.1. Data Collection Protocol

2.2. Dataset Characteristics

3. Dataset Description

4. Methods

4.1. Building out Robust Definitions

4.2. Methodology Step 1: Parse Transcripts

4.3. Methodology Step 2: Manual Audio Annotations

4.4. Methodology Step 3: Extract Audio

5. Dataset Characteristics

6. Results

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI