HearIt: Auditory-Cue-Based Audio Playback Control to Facilitate Information Browsing in Lecture Audio

: Students often utilize audio media during online or ofﬂine courses. However, lecture audio data are mostly unstructured and extensive, so they are more challenging in information browsing (i.e., chaining, linking, extraction, and evaluation of relevant information). Conventional time-level skip control is limited in auditory information browsing because it is hard to identify the current position and context. This paper presents HearIt, which provides semantic-level skip control with auditory cues for auditory information browsing. With HearIt, users can efﬁciently change the playback position in the paragraph-level. Furthermore, two auditory cues (positional cue and topical cue) help grasp the current playback and its context without additional visual support. We conducted a pilot study with the prototype of HearIt, and the results show its feasibility and design implications for future research.


Introduction
Nowadays, audio contents contain various helpful information. For example, in many classrooms, students record lectures and use them for their active learning. Especially, such audio recording and re-playing are essential for visually-impaired people. Many people with visual impairment benefit from auditory guidance for their daily activities such as studying [1], filling out a form [2], taking a pictures [3,4]. There have been many studies to support information access of visually-impaired people. Some studies have proposed tools that enable users to process information by converting written texts to spoken text. For example, screen readers provide an audio-based interface to navigate a screen [5][6][7]. FingerReader [8] reads aloud printed-texts to help blind users aware of the information.
As audio content has become prevalent and its accessibility has been improved, there have been studies to process auditory information effectively. For example, some studies have proposed audio skimming based on the structure of original texts [1], text summarization [9], and audio processing [10]. However, auditory information browsing has been relatively under-studied. Information browsing is a kind of information exploration and represents the behaviors that combine (chain or link) relevant information, extract, and evaluate it. Information browsing is also related to active reading [11] that frequently involves seeking, highlighting, comparison, and non-sequential navigation. Conventionally, an audio seekbar helps users explore information in audio by providing playback time and a progress bar. However, it basically supports time-level playback control (e.g., skip 10 s), so the user needs to know the specific playback time positions where the relevant information is placed. Especially, this can be more challenging to browse lecture audio because it mostly has a long length and contains several parts explaining similar but different ones.
This work proposes HearIt, a tool for auditory information browsing based on semantic-level playback control with auditory cues. HearIt is designed to support key components in the behavioral model of information browsing: (1) linking and chaining and (2) extraction and evaluation. First, to facilitate users' linking and chaining data, it provides the paragraph-level skip control that repositions the current audio playback based on semantic chunks. Such semantic-based audio playback control can help users find relevant contexts for linking and chaining, effectively more than existing time-level skip controls (e.g., skip forward in 10 s). Second, to help users quickly grasp the context of the current playback and evaluate it without any visual supports, HearIt provides two auditory cues: positional cue and topical cue. The positional cue is the paragraph number to offer the perception of the current position. The topical cue is a set of spoken keywords representing a paragraph and helps users determine whether to hear it more.
We conducted a user study to evaluate the effectiveness of HearIt by the within-subject experiment. Twelve participants were asked to perform an auditory information browsing task that explores a lecture audio and points out the specific positions relevant to given contexts. The participants used three browsing methods to perform the task: (1) HearIt, (2) Partial-HearIt, which provides the same function as HearIt, except for topical cue, and (3) Baseline, without the paragraph-level skip and auditory cues. We statistically compared the efficiency (task completion time and accuracy) and the usability of the three methods. We also conducted a survey that contains open-ended questions about the experiences with the browsing methods.
The results show that the proposed method is significantly efficient and effective in auditory information browsing without visual supports. The participants with HearIt mostly found the correct answer in a shorter time compared to those with Baseline. Most of the participants said that the paragraph-level skip is very efficient to reach the targeted position by ignoring irrelevant parts of the audio. Furthermore, the two auditory cues play a crucial role in enabling users to quickly grasp the current context. Our study addresses the needs of studying auditory information browsing and provides design implications for further research.

Audio-Based Information Behavior
Audio data are closely related to information behaviors in online learning. Many online courses provide lecture audio, and students often record offline lectures by themselves. There have been many studies about the use of lecture audio data. O'Callaghan et al. [12] addressed the advantages and necessity of recorded lecture audio. Lecture audio can effectively supplement the conventional offline learning process. For example, with lecture audio, the students can overcome constraints due to space and time, and lecturers can improve their teaching methods by monitoring the lecture and analyzing their vocabulary selection [13]. In addition, audio media is often used for note-taking. Nakayama et al. [14] examined the effectiveness of audio-based note-taking during the online course. Similarly, Audio Notebook [15] aimed at capturing knowledge directly from conversations as a physical device to facilitate note-taking by audio structuring techniques coupled with the note-taking behavior.
Earlier studies revealed that audio media could improve general asynchronous communication [16]. For example, Voicelist was designed to overcome widespread communication challenges such as cost, textual literacy, and data connections through an interactive voice response [17]. Especially, there have been studies that auditory information can help academic interactions during learning. Auditory information can support interaction and enhance learning opportunities [18]. Merry et al. [19] investigated the modality of academic feedback in e-learning environments and found that the students prefer auditory feedback more than written texts. That was because auditory feedback is mostly more detailed and personalized, so it is easier to understand. In addition, auditory feedback can support more versatile communication, so the instructors can reduce their cognitive and physical loads when recording the feedback [20].
Finally, audio-only media can create spaces for information sharing. Ackerman et al. [21] addressed that audio-only media have much potential of creating a useful social space. From an audio usage perspective in a collaborative space, Metatla et al. [22] found empirical evidence that audio can support non-visual collaboration and collaborators' interactions by helping them extract information about others' actions and current positions. Wang et al. [23] use audio as the interaction medium for a wiki. It is useful for users to navigate the linear structure of the wiki using the audio version.

Accessibility to Auditory Information
Many studies have improved information accessibility via audio media. For example, Text-to-Speech (TTS) [1] increases the accessibility to textual content in smart devices. Screen readers convert digital text on the screen into spoken text, enabling users to navigate the web interface and access the text in it, such as documents, menus, icons, and web pages without sight [1,7,24]. BlindReader [25] is designed to help a visually-impaired reader understand the materials effectively by using haptic feedback that provides a sense of touch.
In addition, there has been paid much attention to the accessibility of printed textual materials like books or newspapers. Many studies have proposed user interfaces for delivering printed information to visually-impaired people. For example, finger-worn designs for the blinds have been proposed to control reading in comfortable ways [8,26,27]. FingerReader [8], a wearable device with a small finger-worn form, helps blind users read the printed texts by scanning a single line and reading out the words as synthesized speech, along with the finger. Furthermore, there have been studies to interpret other types of visual materials (e.g., graph, map, and picture). OrCam [28] is a voice-activated device that attaches to virtually any glasses. It helps blind users live a more independent life by processing a book text, smartphone screen, and recognize faces. Access Lens [29] harnesses computer vision technologies to enable users to use accessible gestures on paper documents and other physical objects, such as product packages. Given the printed image material, Access Lens locates the text and reads specific content where the fingertips touch.

Auditory Information Processing
People often use skimming, which quickly identifies the gist or general idea of a large volume of contents, and it has been known to be helpful in learning [30][31][32]. However, unlike the visual contents, skimming on the audio contents is more challenging because information is processed linearly and sequentially. Combining and extracting relevant information is very limited [33].
There have been computational tools for efficient information processing. For example, some studies proposed audio skip controls based on the structures of the audio content. Tyflos [34] utilizes the pyramid structure of the document for guiding users from overview to the details. In addition, Digital Accessible Information SYstem (DAISY) [1] uses the structure of the documents like paragraphs, headings, or sections. With DAISY, users can control the current playback position by the level of the document components, as long as the input audio can be formatted. Similarly, Job Access With Speech by Freedom Scientific (JAWS) is a screen reader that reads out the first sentence of each paragraph sequentially to understand the entire text quickly [35,36]. However, according to the study results [35,36], each paragraph's first sentence is not sufficient for understanding the context.
Alternatively, it has been studied to summarize the audio contents by highlighting essential parts. Summate [9] is a FireFox-based tool that summarizes web pages and presents the summary in an alert box for blind individuals. AcceSS [37] removes the clutter and retains the important sections to give the user a preview of the page. Some studies proposed a skimming method based on audio processing; for example, Imai et al. [10] proposed a method of extracting essential parts of audio content based on voice pitch. However, from the perspective of information browsing, the prior studies have some limitations. For information browsing (e.g., studying over an audio lecture), it is important to find the relevant details and link them to a specific context by exploring throughout audio content. As a result that information browsing contains a dynamic process, the existing methods that mainly focus on quickly delivering entire content could be limited. In addition, using the text structure is limited because all the audio contents cannot be easily in a specific format. Furthermore, the summarization of the entire text may not be effective because it cannot preserve the mood and style of the original text [36] and did not itself offer playback controls to respond to the users' dynamic information needs. This work focuses on auditory information browsing and explores the supportive design for it.

Design Space for Auditory Information Browsing
Waterworth and Chignell [38] explained information exploration behavior by searching and browsing. Searching is a behavior that starts with target identification followed by query formulation, search, extraction, and evaluation. On the other hand, browsing is a behavior that begins with context, followed by chaining or linking, and then by extraction and evaluation. The existence of the explicit query (specificity of information needs) distinguishes searching and browsing.
This work aims to support information browsing behavior on lecture audio data. Figure 1 presents the design space for an auditory information browser. We considered four design subspaces according to two dimensions: (1) the level of the playback skip control and (2) the type of cue about the current playback. First, the design for the auditory information browser can be specified by the level of the playback skip control. The user should control the audio playback position to find and link relevant information. The time-level skip control (e.g., skip 10 s) has been widely used in conventional audio seekbar, but it does not consider the current playback contexts. Some studies have proposed semantic-level skip control. For example, Yang [39] presented the segment-level keyword search function by the timeline representing the linear-structure of the audio content and visualizing relevant keywords.
Second, providing appropriate cues is important in the design to increase awareness of the current playback. Audio content has diverse advantages, but the audio is often limited to identifying the current playback's position and surrounding contexts because the user depends on hearing only and process the content sequentially. Therefore, the auditory information browser should provide appropriate feedback about the audio playback.
We classified the awareness cue into two modality types: (1) visual cue and (2) auditory cue. There are many graphical user interfaces for audio playback. The most common one is an audio seekbar. It can help users monitor the current playback and easily move to the targeted position using its slider bar. In addition, there have been graphical interfaces for visualizing keywords in audio [39]. Most of the studies in this design subspace utilize texts corresponding to the audio content.
On the other hand, the auditory cue can be helpful in particular situations. The auditory cue can be delivered easily, even on a simple device without a screen, and also it allows users to have multitasking. Especially, it would be helpful for the visually-impaired people's information behavior. However, providing auditory cues for information browsing is more challenging. Information browsing behavior requires more dynamic processes, and the limitations become more serious when audio content is not well structured or does not have available meta-data.
In this design space, HearIt aims to provide semantic-level audio skip control with auditory cues. The following section describes the details about the implementation of HearIt. Figure 2 represents the prototype of HearIt. HearIt has a pen shape to reduce the interface's complexity for audio controls by utilizing intuitive gestures instead of more buttons. HearIt has a play/pause button and a scroll wheel for the playback control. In addition, its pen point is used to recognize gestures of drawing lines. HearIt is designed to support key components of information browsing behavior [38]: (1) linking and chaining and (2) extraction and evaluation. First, HearIt provides paragraph-level skip control to facilitate efficient chaining and linking by enabling the user to skip forward or backward in semantic chunks. In addition, it provides auditory cues that enable the user to extract and evaluate the information based on a quick understanding of its positional and topical context.

Paragraph-Level Skip Control
The audio seekbar is a standard interface for playback control. With the seekbar, the user can monitor the current position and quickly change it to a specific position by laying a bar there. However, such GUI-based audio seekbar is not suitable for visuallyimpaired people because controlling the slider bar is limited. Therefore, the user with visual impairment mostly depends on the time-level skip control (e.g., skip forward in five seconds). The time-level skip control takes a long time to move the playback position corresponding to the context where the user wants to go, and this can cause difficulty in chaining and linking process. Figure 3 shows the time-level skip control and paragraph-level skip control in HearIt. HearIt provides the paragraph-level skip control that allows the user to move to the beginning of the other paragraph, so the user can quickly reach out to the targeted playback position by skipping irrelevant audio parts. Specifically, a HearIt user can begin to play the audio by the play button. Next, scrolling the wheel changes the playback position to the beginning of the previous or next paragraph. It is also possible to skip multiple paragraphs by holding the scroll wheel in a direction for two or more seconds. HearIt also supports the time-level skipping for sophisticated controls like a typical seekbar. The gesture of drawing a line controls the time-level skip. The direction of the line determines whether to skip forward or backward, and its length determines the length of the skip interval.

Auditory Cues
Without a visual audio seekbar, the awareness of the current playback position is limited. It is also hard to know how far the targeted position that the user wants to check is from the current position. Moreover, unlike visual processing, nearby information cannot be simultaneously processed, and this is not effective for information browsing, especially in extracting key ideas and evaluating their relevance.
In HearIt, two auditory cues, (1) positional cue and (2) topical cue, give hints about the context of the current position. The positional cue is a paragraph number to inform where the current playback is, and the topical cue is a set of keywords extracted from the current paragraph. This is to help the user overview the current paragraph, evaluate its relevance, and determine whether to hear more. The keywords for each paragraph are extracted by [40]. First, all the nouns were extracted by morphological analysis of the konlpy module. Next, each term is weighted by TF-IDF (term frequency and Inverse document frequency), and the top seven terms, which have the highest weights, are selected for each paragraph. The formula for the weight w t,p of the term t in the paragraph p is as follows: where TF t,p is the frequency of t in p and DF t is the number of paragraphs containing t, and N is the total number of paragraphs. Note that the term is excluded if it only occurs once. When the paragraph-level skip control moves the playback position, HearIt reads out the two auditory cues before the main audio. The positional cue is spoken aloud first, and a set of keywords is followed as the topical cue. The user can hear the topical cue's keywords one by one, starting with the highest weighted term. It is also possible to stop the cue playback and jump to the main audio by drawing a line from left to right.

Methodology
We conducted a pilot study to evaluate the feasibility of HearIt. In this study, we compared three variants: (1) HearIt, (2) Partial-HearIt, which provides the same function as HearIt, except for topical cue, and (3) Baseline, without the paragraph-level skip and auditory cues. We designed the repeated measures experiment that each participant experiences the three methods. The participant was asked to explore lecture audio and find the specific positions related to the given contexts (searching without explicit queries). We prepared three audio files of a graduate lecture on AI (about 20 min for each). As a result that we simply recorded actual slide-based lectures in a University, it did not have explicit structure at first. To structuralize the audio, we first partitioned the audio by using the page numbers of the slide that the audio covers. Next, several partitioned audio intervals have a longer length than four minutes, so we further divided them based on the context. Finally, the partitioned audio interval was regarded as the paragraph, and there were 31 paragraphs from the three audio lectures. Under the lecturer's supervision, three contexts per lecture audio to be browsed and playback positions to each context (ground truth) were selected. An example of the contexts is "Differences between rule-based systems and machine learning".

Procedure
First, we had an orientation to explain the procedure and the methods. The participants had their own time to get used to controlling audio playback by the three methods. After the orientation, the participants performed three sessions, and in each session, they were asked to use one of the methods for auditory information browsing. At the beginning of the session, the participant listened to one of the audio without any interruption. Next, the participant browsed the audio by a given method and was asked to find specific playback positions corresponding to each context within five minutes. We blocked the participants' sight by an eyepatch to force them to perform auditory information browsing without any visual support. A session was finished if the participants pointed out all the positions or the time was up. After three sessions had been repeated with a different method and different lecture audio, the participant was asked to respond to the survey.

Participants
We recruited 12 participants via the school bulletin board as shown in Table 1. The participants were 2 males and 10 females, and their age was 22.9 on average. They were undergraduate students and majored in IT-related, but never took the course we used for this study. We conducted the randomization to control the difficulty of each audio and the order of browsing method. We note that the number of participants in our pilot study still satisfied the minimum requirement for the statistical test. G*Power (http://www.gpower.hhu.de/, accessed on 10 September 2020), a power analysis application, revealed that a sample size of 12 is enough to provide statistical significance with 80% power (1 − β) for large effect size ( f 2 > 0.4) and with an alpha level set at 0.05. In addition, the earlier HCI studies [41,42] used 12 participants to perform a within-subject (repeated measure) experiment to evaluate the task performance with the apparatus.

Data Analysis
We analyzed experimental data in quantitative and qualitative ways. First, we quantitatively compared the following.

•
Task completion time: The seconds to point out all playback positions corresponding to three given contexts. • Accuracy: The percentage of the correct playback positions against the ground truth. We regard the position marked by the participant when it is within five sentences from the ground truth. • Usability: System Usability Scale (SUS) score on 10 5-point Likert scale items [43].
We conducted one-way repeated measures ANOVA (RM ANOVA) to compare the three methods on the task completion time, accuracy, and usability. By Levene's test, we found all the variables do not violate the assumption on the homogeneity of variance. When the ANOVA result is significant, we conducted a series of paired t-test (two-tailed) with Bonferroni correction as a posthoc test. All the hypothesis testing was performed with a significance level of 5%.
On the other hand, the survey responses to the open-ended questions were qualitatively analyzed. In the survey, the participants were asked to describe limitations in auditory information browsing without visual supports and compare their browsing experience with the browsing methods.

Auditory Information Browsing without Visual Supports
In the survey, all the participants mentioned that they felt considerable limitations in auditory information browsing without visual support. Many participants reported that tracking the current playback is difficult. For example, P7 said, "It was hard to figure out how much the playback position needs to be moved to where I wanted to go." Another participant (P10) mentioned, "(When skipping multiple intervals) It was difficult to identify how far the position was moved from the start". In addition, it tended to be challenging to grasp the current context quickly. For example, P8 commented, "It was confusing because I had to deal with both what I was hearing and what I wanted to go". P11 said, "I had to be highly concentrated because it requires to process the parts of the lectures sequentially". P4 also said, "It was hard to memorize audio contexts because I could not map each context to the slider bar's position (in an audio seekbar)".

Efficiency: Task Completion Time and Accuracy
RM ANOVA revealed a significant and considerable effect of the browsing method on task completion time. As shown in Figure 4, the task completion time with Baseline is significantly longer than Partial-HearIt and HearIt. This indicates that paragraph-level skip control increases the speed in browsing. However, there was no significant difference between Partial-HearIt and HearIt, possibly meaning that the type of auditory cue we considered does not affect the time. HearIt achieved the highest accuracy on average (91.7%), followed by Partial-HearIt (88.9%) and Baseline (80.6%), even though there was no significant difference.    Our qualitative data supplement the results. First, all the participants preferred the paragraph-level skip control over the time-level skip control. There were many responses that the paragraph-level skip control is convenient to explore the information quickly. For example, P1 said, "It helped me skip the unnecessary information when I remember the content to some extent". P11 also commented, "I could organize the content of information easily in my head when I use paragraph-level skip control". One participant (P7) mentioned that the time-level skip is still needed by saying, "For the more sophisticate playback control, the time-level skip control is still useful".

Usability: SUS Score and Survey Responses
Second, all the participants liked to use the auditory cues in their auditory information browsing. The positional cue is helpful when the user approximately remembers the targeted position related to the given context. P4 said "I could skip the unnecessary paragraph quickly based on the positional cue". Similarly, P8 commented, "Because I listened to the entire audio at first, the paragraph number is enough for me to know what topics were covered here".
However, most of the participants (91.6%) preferred using both additional cues. The topical cue helped the participant quickly evaluate the relevance of the current playback position. P12 said, "With the topical cue, I could be more confident in guessing the topics of the following part". P5 commented, "Topical cue helped me explore what I forgot". Finally, the participants mostly heard the top three keywords in the topical cues and decided whether to skip or play its main audio.

Discussion
The study results reveal the limitations in auditory information browsing without visual support in terms of monitoring the playback position and understanding the current context. HearIt helps the users overcome these limitations by reducing the task completion time but maintaining accuracy. Furthermore, the users felt more comfortable with controlling the audio playback as what they wanted to. The users also became confident in their guesses about the following audio content to decide whether to keep playing.
In this study, we focused on supporting information browsing, while information exploration should be a flexible combination of browsing and querying [38]. Further studies for supporting and balancing both browsing and querying would be helpful. For example, using the audio acceleration and bookmarking functions in [1] can be considered with the paragraph-level skip and the auditory cues. In addition, the current HearIt uses auditory feedback, but it is possible to utilize other modalities, such as leveraging haptic or tactile interaction. For example, [44] proposed wearable devices utilizing vibration that helps visually-impaired people communicate with information. We think that there is much potential for utilizing multimodal interaction in information exploration.
In addition, the current topical cue can be improved in several ways. First of all, it is possible to examine the ideal number and structure of keywords for topical cue to save the time to play them. The keywords' extraction can also be improved, for example, utilizing text features with audio features (e.g., voice pitch [10]). Lastly, the current term weighting is based on the original text's static structure (e.g., paragraph), but it would be possible to use semantic structures by advanced text analysis techniques [45,46].
Finally, we note that the results of this study should be carefully interpreted. The number of participants in the pilot study may not be sufficient to prove the general effectiveness. In addition, we conducted the lab-controlled experiment, so observing the in-situ behavior with the methods would be necessary further. Nevertheless, we believe that our study results show the feasibility of the proposed method and directions for further improvement.

Conclusions
This paper presents HearIt which supports auditory information browsing by providing the paragraph-level skip control and two auditory cues. Our pilot study with 12 participants showed the potential of the proposed browsing methods and possible research directions. In recent years, auditory information access has been considerably improved, thanks to the advances in smart handheld devices and AI technologies. However, there is still much room for improvement for fostering specific information behavior. We believe that it will be helpful to pay attention to understanding detailed types of information behavior and designing specialized supports for each.

Conflicts of Interest:
The authors declare no conflict of interest.