1. Introduction
The goal of developing effective, ’human-like’ chatbots is wide-ranging and touches on many broad fields of research. In this study, we focus on detecting interruptions—one of the less studied areas. Historically, approaches to interruption detection were either not data-driven or focused on high-latency natural language processing approaches [
1]. The emergence of novel speech-based, self-supervised learning techniques [
2] applied to audio has opened up a new avenue of research with the promise of higher accuracy and lower latency [
3]. While this has been a subject of active research, participation has been limited to those with access to proprietary interruption datasets. There has been a lack of publicly available, audio-based interruption datasets. In this study, we derive a new dataset for interruption classification from the Group Affect and Performance (GAP) dataset. In doing so, we open up this topic for broader research. A full audio transcript is inherited from the source dataset, which can be leveraged for multi-modal approaches. This dataset aids in the development of chatbots that discriminate interruptions from backchannels and background noise. (Backchannels are actions sharing some characteristics with interruptions). Such utterances can be seen as affirmations by listeners in response to what a speaker is saying through words (for example, ‘agreed’, ‘sure’) or noises (for example, ‘mhmm’). It is widely agreed that such utterances do not constitute interruptions [
4]). Ultimately, this allows for timely pauses before finishing the response in the case of an interruption, thereby creating a more natural and human-like interaction.
Although we focus mainly on the task of interruption classification, the value of this dataset extends beyond this. The dataset can be used to study interruptions in language and, more broadly, within group interactions.
The rest of this paper is organised as follows.
Section 2 summarises the pertinent aspects of the GAP dataset.
Section 3 describes the dataset, while
Section 4 outlines the definitions used for this dataset and the subsequent procedure of its creation. Finally, we demonstrate the usefulness of the dataset in
Section 6, showing results from an interruption classification experiment.
2. Source Dataset
The GAP dataset [
5], from which our interruption dataset is defined, was curated by the University of the Fraser Valley, Canada, and consists of 28 group meetings, totaling 252 min of conversational audio. Collected to study computational approaches of small group interactions, each group in the dataset had 2 to 4 participants. Groups completed a winter survival task scenario where the objective was to rank 15 items regarding usefulness in a hypothetical plane crash scenario. Participants completed a ranking task individually and were subsequently recorded performing the same task jointly.
Since each group member had previously accomplished the task independently, discussions were enriched by comparing answers. The nature of these discussions welcomed frequent interruptions, with participants readily sharing their perspectives on the ranking of items. This aligned beneficially with our objective and made the setup of the GAP dataset ideal for our purposes.
2.1. Data Collection Protocol
In recording the dataset, each group was placed around a table with a Zoom H1 Handy Recorder in the center. The recordings were collected at a sample rate of 22 kHz, which, while lower than the typical professional standard, still provides adequate sound quality. The audio was recorded in mono; therefore, instances of overlapped speech are not split into two channels corresponding to each interlocutor. This poses challenges for isolating the audio of the interruptor from the interruptee.
Human annotators transcribed the audio data using ELAN annotation software (version 65), as depicted in
Figure 1. In transcribing, the research team observed that spontaneously produced speech does not divide neatly into structured, sentence-based language. Consequently, they chose to segment the speech into intention-based utterances. The GAP Dataset defines an utterance as a segment of speech where a speaker intends to communicate one piece of information. This approach significantly aids in the task of identifying interruptions. Rather than interleaving words from different speakers during overlapping speech, it preserves each speaker’s contribution. Moreover, detecting overlapped speech becomes more straightforward, as one can examine utterances’ start and end timestamps for potential overlaps. The timestamps of each utterance’s start and end times are human-annotated and, as a result, show imperfections despite being broadly accurate.
2.2. Dataset Characteristics
Among the 28 groups, there were 84 speakers, 76% of which were female and 24% male. Participants were recruited through the University of the Fraser Valley; consequently, there were broad demographic similarities among participants regarding age, educational level, and socioeconomic status. The majority of participants were native English speakers, with only 12% being non-native.
3. Dataset Description
This dataset consists of overlapping utterances and splits them into the following two classes: true interruptions and false interruptions. False interruptions are cases of overlapping speech that are not deemed interruptions.
The term interruption lacks consistency in linguistics literature. In certain instances within the spoken dialogue system literature, it is not accompanied by a clear definition or appropriate reference. When a sufficiently precise meaning is not provided, the term becomes subjective, and for this study, it is essential to avoid this. In this section, we aim to provide a clear definition of an interruption that this study can consistently use.
Linguistics literature sees interruptions as violations of the turn-taking rule set [
6]. This rule set defines how an addressee should receive verbal and non-verbal cues from a speaker, determining an appropriate place to take over a turn with minimal or no overlap. In this dataset, we align with Crook et al. (2012) [
4] in requiring an overlap in speech for an interruption. We draw from Lin et al. (2022) [
7] in centring our definition around the intention of the interrupter. Our definition is as follows:
Definition 1. An interruption is an instance where an interrupting party intentionally attempts to take over a turn of the conversation from an interruptee and, in doing so, creates an overlap in speech.
Clearly, as we define it, the set of interruptions is a subset of the set of overlapped utterances. A rigorous definition of what we consider overlapped utterances can be found in
Section 4.
False interruptions can be of many types, and a common type is backchannelling. This can be seen as an affirmation by a listener to what a speaker is saying through words (for example, ‘agreed’, ‘sure’) or noises (for example, ‘mhmm’). It is widely agreed that this does not constitute an interruption [
4,
8]. Our definition of an interruption aligns with this, given the lack of intent to take the turn of the conversation.
The folder structure of the dataset is shown in Listing 1.
Listing 1. Group Affect Performance Dataset (snippet of Group 1 transcript). |
|-- data.json |-- audio-and-transcripts | |-- audio | | |-- Group 1: 00:23.9 - 00:25.0.wav | | |-- Group 1: 00:24.3 - 00:25.4.wav | | |-- Group 1: 00:35.4 - 00:37.8.wav | | … | | ‘-- Group 14: 09:39.3 - 09:39.7.wav | ‘-- original-dataset | |-- group-audio | | |-- MP4 Group 1 Feb 8 429.mp4.wav | | |-- MP4 Group 2 Feb 8 553.mp4.wav | | … | | ‘-- MP4 Group 14 Oct 15 1234.mp4.wav | ‘-- transcripts | |-- Transcript Group 1 Feb 8 429.txt | |-- Transcript Group 2 Feb 8 553.txt | … | ‘-- Transcript Group Oct 14 15 1234.txt ‘-- LICENSE.txt |
Details of each true and false interruption data point are found in the root directory inside
data.json. These instances of overlapped utterances are uniquely identified across the GAP dataset by the combination of
groupNumber,
speakerId, and
startTime. We see an item of overlapped speech displayed in
Figure 2.
The corresponding overlapped utterance audio snippets are in the audio-and-transcripts/audio directory directory. They follow a naming convention that outlines the times in the GAP dataset audio they represent, namely [Group number]: [Start time] - [End time].wav. The start and end times refer to the utterances that overlap; in the case of a true interruption, this would be the interrupter’s utterance.
4. Methods
Our methodology for creating an interruption classification dataset consists of the following two main phases: (1) sifting through the GAP dataset’s transcripts to identify instances of overlapping utterances and (2) audibly reviewing these instances and, if fitting our definitions, manually labelling them as either true interruptions or false interruptions. (3) The final step is to extract the audio of the overlapping utterance in each case suitable for the dataset. We adhered to a strict protocol to ensure data accuracy, maintain consistency, and avoid unwanted edge cases.
4.1. Building out Robust Definitions
The definitions of an overlapped utterance and interruption are paramount to our process. These definitions directly impact the dataset; overlapped utterances inform which data points are included in our dataset, and interruptions inform the classes assigned to each sample.
Our definition of interruption from
Section 3 is used, where the intent of the interrupter and overlapped speech are the two necessary conditions. As a consequence of intent being a core factor of an interruption, the following two important caveats are introduced:
Cases of misplaced speech: Cases with a near-instant overlap are discarded. Since the average human reaction time to articulate a vowel in response to a speech stimulus is 213 ms, we borrow from Yang et al. (2008) and set any overlap that begins within 300 ms as a case of misplaced speech [
9]. This is based on our definition’s requirement an interruption be intentional. Therefore, any overlapped speech occurring within the minimum reaction threshold is deemed accidental.
Early-onset responses: Instances where the overlap begins at the very end of an utterance are not
intentional attempts to take over the turn of the conversation, as acknowledged by Selfridge et al. 2013 [
10]. Hence, these are not included. Like Selfridge et al. 2013, we define these as instances where the overlap begins in the final 10% of an interlocutor’s utterance.
In defining an overlapped utterance, our primary aim is to create an overarching class encompassing interruptions and all instances deemed interruptions. For this reason, we omit cases where the timing indicates misplaced speech and early-onset responses.
Definition 2. An overlapped utterance is an instance where one interlocutor provides speech or noise during another interlocutor’s speech, creating an overlap that may be deemed a possible interruption when considering its timing alone.
When constructing the dataset, each occurrence of an overlapped utterance must be distinctly identified. This involves recording each instance’s group number, start timestamp, and speaker ID. By doing this, we remove ambiguity should two speakers in a group begin an utterance simultaneously.
Since interruptions are a subset of overlapping utterances, at a high level, our dataset creation process involves two key steps. First, we identify instances of overlapped speech to populate the dataset. Then, we label the interruptions by identifying instances of overlapped speech where there is an intent to take over the conversation. For example, we would not label cases of backchannelling or coughing as interruptions.
4.2. Methodology Step 1: Parse Transcripts
Our first step is to extract relevant instances of overlapped utterances. The GAP dataset transcripts are formatted as .txt files, with utterances separated by newline characters (see Listing 2 for a snippet of the transcript from Group 1’s audio). The format of the Participant column takes the form of [GroupNumber].[Speaker ID].[Speaker-specific utterance count], where each speaker in the group is uniquely identified by colour. The transcript uses special characters to represent non-speech information, with ‘$’ for laughter, ‘%’ for coughing, and ‘#’ for other noises.
Listing 2. Group Affect Performance Dataset (snippet of Group 1 transcript). |
Participant Start End Sentence 1.Pink.1 00:02.0 00:03.5 "So, what did everyone do as one?" 1.Blue.1 00:04.0 00:05.7 "I did, uh, cigarette lighter." 1.Blue.2 00:06.4 00:07.2 "For one." 1.Pink.2 00:07.3 00:09.3 "Mm okay, I did knife." |
Intending to extract instances of overlapped speech, we parse each utterance in the transcript. In parsing through each utterance, the following three conditions have to be met: (1) there is, indeed, an overlap; (2) it is not misplaced speech (i.e., the audio occurred 300 ms following the beginning of the previous utterance); and (3) it is not an early-onset answer and, hence, does not occur within the last 10% of the previous utterance.
Our parsing script outputs an intermediate transcript containing each potential instance of overlapped speech for each group’s transcript. This allows us to begin the next step.
4.3. Methodology Step 2: Manual Audio Annotations
Although many overlapped utterances are seemingly straightforward to classify by viewing text alone, it is important to listen to the audio, as many cases are ambiguous when solely viewing the text. Take, for example, the following passage:
Speaker 1: And do we want to do the ball of steel wool cause that’s like the fire starter if you have the batteries.
Speaker 2: Okay.
Audio is needed to confirm if Speaker 2 is backchannelling to encourage Speaker 1 to explain their reasoning as to why to select ‘ball of steel wool’ or whether this comes as confirmation to select ‘ball of steel wool’. In the latter case, it becomes an interruption by forming an early answer. The timing of Speaker 2’s speech relative to Speaker 1’s is another essential factor gained from listening to the audio. If it were to come before ‘ball of steel wool’, it would not be an answer to the question.
Although many instances of misplaced speech are filtered out through our processing methodology, many were not picked up due to slightly misaligned utterance-level timestamps. Others had both utterances beginning outside of a 300 ms time interval. This issue was prevalent in groups of 3 or 4 people. It often occurs when one participant poses a question and two other participants rush to provide an answer, ending up speaking simultaneously. Such instances were duly excluded from the dataset. Many cases were excluded, as they did not fit the definition of overlapped utterances that could be deemed a possible interruption. These cases mostly comprised non-speech, such as laughter, which later had speech overlapping. Such cases do not fit our definition of overlapped speech, as they could never be interruptions. As such, they do not merit inclusion in the dataset.
Annotations from the GAP dataset were completed for Groups 1 to 14 (inclusive), yielding 355 data points. A single annotator carefully carried out the annotations based on a rigorous, non-subjective protocol, so there was no need to cross-check the definitions between different annotations.
4.4. Methodology Step 3: Extract Audio
Once we have our manually annotated data points in
data.json, we need to extract the audio for the instances in our database and process them. For audio extraction, we use pydub [
11], an open-source python-based audio processing library built on top of the popular open-source ffmpeg library [
12]. This leaves us with our final dataset, with audio snippets of all instances of overlapped speech coupled with references to their place in the original dataset transcript. Since the dataset references the source GAP dataset, we include the original transcripts and full group audio recordings for Groups 1 to 14.
5. Dataset Characteristics
The total length of audio from the 14 groups is 115 min, with the durations of each group’s meeting ranging from 3 min to 13 min. The interruption dataset includes 41 participants from the 14 contributing groups, 28 of whom are female. This results in a higher relative male representation of 32% in this subset compared to 24% in the entire GAP dataset. Of the participants, six are non-native, maintaining a similar proportion to the full GAP dataset. Among the total of 355 data points assessed, 200 represent true interruptions, translating to a proportion of 56%. This ensures a relatively balanced composition of the dataset. In
Figure 3 and
Figure 4, we provide the number of true interruption and false interruption entries contributed by participants split by gender.
Analysis of
Figure 3 and
Figure 4 shows that the interruption dataset contains diverse speakers. Of the 41 participants, only 3 speakers did not contribute a true interruption, and 2 did not contribute a false interruption. As a result of having more female participants, we have more female contributions to the dataset. In addition to this, the participants who contributed the most true interruptions (between 11 and 14) were all female. This may result in potential biases within the dataset towards female true interruptions.
Another potential bias is introduced by the fact that participants were recruited from the University of the Fraser Valley. All of the participants were undergraduate students, and thus, there is demographic similarity in terms of age, education level, and socioeconomic status.
6. Results
This study explores the dataset’s utility by performing an interruption classification task. We randomly split the dataset into a 70%/15%/15% train/validation/test split. As a pre-processing step, we used HuBERT to produce a sequence of 1024 dimension embeddings for each data point’s audio. Each sequence was then averaged across the sequence dimension, which allowed us to evaluate the audio independently of its length, as the average length of a true interruption is much greater than that of a false interruption. Our binary classifier is a neural network with hidden layers of [768, 512, 256] corresponding ReLU activation functions. The model was trained for 10 epochs with the Adam optimiser.
Table 1 presents the results of applying this approach with 0.5 as a threshold to classify an interruption as true or false based on the model’s output probability. In binary classification tasks, conventional reporting metrics are primarily used for the dominant class. However, in our study, the ’false interruption’ class is particularly interesting. Given the imbalance in our dataset, a model that indiscriminately favours the positive class would yield a misleadingly high accuracy score. Therefore, we also report metrics for the false interruption class to evaluate the model’s performance more accurately across both classes. The reported scores demonstrate the value of the dataset in allowing the model to learn differentiating features between genuine interruptions and other noises, such as laughter, coughing, and backchannelling. This is a simple approach to this problem. We encourage researchers to experiment with multi-modal approaches and approaches that identify an interruption without needing all of the audio. In that case, a model could detect an interruption before it has finished.
7. Conclusions and Future Work
In conclusion, we successfully created an audio dataset of interruptions and developed a classification model that accurately identifies interruptions based on this dataset. The results demonstrate the model’s ability to generalize well, confirming the robustness of our dataset for interruption classification. We employed specific definitions of interruptions as delineated by Crook et al. (2012) [
4] and Lin et al. (2022) [
7], and our findings validate these definitions, indicating their clarity and applicability in real-world scenarios.
We encourage several directions for future research. One avenue is to develop multi-modal approaches that leverage the provided text modality, which may enhance model performance. Another direction involves creating innovative methods to identify interruptions without requiring complete audio segments, enabling real-time interruption detection and improving the model’s utility in a chatbot environment. Additionally, integrating the interruption dataset with the GAP dataset could help explore the broader implications of interruptions, such as their impact on task performance. Finally, incorporating sentiment analysis into the dataset would allow for a more detailed examination of the frequency or likelihood of interruptions based on the interlocutor’s sentiment.
Author Contributions
Conceptualization, D.D. and O.Ş.; methodology, D.D. and O.Ş.; software, D.D.; validation, D.D. and O.Ş.; data curation, D.D.; writing—original draft preparation, D.D.; writing—review and editing, O.Ş.; supervision, O.Ş. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Informed consent was obtained from all subjects during the Group Affect and Performance study.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Ström, N.; Seneff, S. Intelligent barge-in in conversational systems. In Proceedings of the INTERSPEECH, Beijing, China, 16–20 October 2000; pp. 652–655. [Google Scholar]
- Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Bekal, D.; Srinivasan, S.; Bodapati, S.; Ronanki, S.; Kirchhoff, K. Device Directedness with Contextual Cues for Spoken Dialog Systems. arXiv 2022, arXiv:2211.13280. [Google Scholar]
- Crook, N.; Field, D.; Smith, C.; Harding, S.; Pulman, S.; Cavazza, M.; Charlton, D.; Moore, R.; Boye, J. Generating context-sensitive ECA responses to user barge-in interruptions. J. Multimodal User Interfaces 2012, 6, 13–25. [Google Scholar] [CrossRef]
- Braley, M.; Murray, G. The Group Affect and Performance (GAP) Corpus. In Proceedings of the Group Interaction Frontiers in Technology, GIFT’18, New York, NY, USA, 16 October 2018. [Google Scholar] [CrossRef]
- Sacks, H.; Schegloff, E.A.; Jefferson, G. A Simplest Systematics for the Organization of Turn-Taking for Conversation. Language 1974, 50, 696–735. [Google Scholar] [CrossRef]
- Lin, T.E.; Wu, Y.; Huang, F.; Si, L.; Sun, J.; Li, Y. Duplex Conversation: Towards Human-like Interaction in Spoken Dialogue Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 3299–3308. [Google Scholar]
- Schegloff, E.A. Accounts of Conduct in Interaction: Interruption, Overlap, and Turn-Taking. In Handbook of Sociological Theory; Turner, J.H., Ed.; Springer: Boston, MA, USA, 2001; pp. 287–321. [Google Scholar] [CrossRef]
- Yang, F.; Heeman, P.A. Initiative conflicts in task-oriented dialogue. Comput. Speech Lang. 2010, 24, 175–189. [Google Scholar] [CrossRef]
- Selfridge, E.; Arizmendi, I.; Heeman, P.A.; Williams, J.D. Continuously predicting and processing barge-in during a live spoken dialogue task. In Proceedings of the SIGDIAL 2013 Conference, Metz, France, 22–24 August 2013; pp. 384–393. [Google Scholar]
- Robert, J. Pydub. 2011. Available online: https://github.com/jiaaro/pydub (accessed on 10 July 2023).
- FFmpeg: A Complete, Cross-Platform Solution to Record, Convert and Stream Audio and Video. 2000. Available online: https://www.ffmpeg.org/ (accessed on 5 July 2023).
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).