The Novice, the Expert, and the Algorithm: A Comparative Analysis of Human Expertise Transfer and AI Performance in Audio-Only Gaming Environments

Khan, Ibrahim; Nguyen, Thai Van; Juraj, Cvetković Tijan; Thawonmas, Ruck

doi:10.3390/app152111594

Open AccessArticle

The Novice, the Expert, and the Algorithm: A Comparative Analysis of Human Expertise Transfer and AI Performance in Audio-Only Gaming Environments

¹

Graduate School of Information Science and Engineering, Ritsumeikan University, Ibaraki 567-8570, Japan

²

Laboratory of Agent Based Social Simulation, Institute of Cognitive Science and Technology, Consiglio Nazionale delle Ricerche Roma, 00185 Roma, Italy

³

College of Information Science and Engineering, Ritsumeikan University, Ibaraki 567-8570, Japan

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(21), 11594; https://doi.org/10.3390/app152111594

Submission received: 30 September 2025 / Revised: 26 October 2025 / Accepted: 28 October 2025 / Published: 30 October 2025

(This article belongs to the Special Issue Applications of Artificial Intelligence and Machine Learning in Games: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

This study provides a symmetrical, cross-genre comparison of human expertise transfer and “blind” artificial intelligence (AI) performance in audio-only gaming environments. Although previous research has focused on human performance in audio games and the feasibility of blind agents trained on auditory inputs separately, a direct comparison of these two forms of expertise is missing. We fill this gap with a robust experimental design, involving 37 human players (aged 18–44), grouped by gaming experience and specialized blind AI agents. We measured key performance variables, including win ratios, health differences, and task completion times across two genres: a fighting game (DareFightingICE) and a first-person shooter (SonicDoom). Our findings show a complex, task-dependent relationship. In DareFightingICE, expert humans (73.0% win ratio) significantly outperformed the AI (54.0% win ratio), demonstrating effective cognitive transfer. Meanwhile, the AI’s performance matched the overall human average (54.0% vs. 53.0%). Conversely, in SonicDoom, AI achieved superhuman speed in simple tasks (1.55 s vs. 5.35 s) but underperformed compared to expert humans in complex scenarios, highlighting that the AI’s proficiency is specialized but fragile, whereas human expertise is more robust and adaptable. The results provide practical insights for audio-rich game design and highlight the crucial need for AI models beyond reactive policies.

Keywords:

audio-only gaming; game accessibility; reinforcement learning; human–AI comparison; expertise transfer; auditory display; player performance

1. Introduction

The auditory dimension of interactive entertainment has evolved from a supplementary feature to a critical element of gameplay, immersion, and accessibility [1,2]. Most players find that sound improves their visual experience, but for a growing community of players with visual impairments and in the early stages of audio-only games, sound becomes the main way they perceive and navigate [3]. This reliance on non-visual modalities creates a unique problem space that not only drives innovation in accessible design [3], but also offers a powerful, constrained environment for investigating the fundamental principles of auditory cognition, skill acquisition, and intelligent decision making in both humans and machines.

In this audio-centric paradigm, two distinct forms of expertise emerge. The first is human cognitive transfer, where experienced video game players leverage visually honed skills, such as spatial awareness, timing, and strategic planning, and adapt them to an audio-only context. Research shows that expert players develop robust cross-modal representations of game mechanics, allowing them to retain effectiveness even when deprived of vision [4,5,6]. However, the robustness and generalizability of this transfer remain open questions. It is unclear to what extent the cognitive advantages of gaming experience are universal or tightly coupled to the visual modality, a critical distinction for designing audio systems that are intuitive for players with pre-existing gaming skill sets.

The second form of expertise is that of artificial intelligence, specifically “blind” reinforcement learning (RL) agents trained exclusively on auditory input. While RL has achieved superhuman performance in visually complex games, like Atari and StarCraft II [7,8], the development of agents operating on raw audio represents a specialized frontier [9]. These agents learn auditory policies natively, without prior visual experience, offering a fundamentally different model of skill acquisition. While such AIs are expected to excel at tasks requiring frame-perfect reactions and pattern recognition, their proficiency in tasks demanding abstract spatial reasoning and adaptive strategy is less certain [10,11].

A significant gap exists in the literature at the confluence of these two domains. While studies have explored human performance in audio games and separate research has demonstrated the feasibility of blind AI agents, a direct, symmetrical comparison is conspicuously absent. We lack a holistic understanding of how the transferred skills of human experts compare to the natively learned policies of specialized AIs, particularly across genres demanding different cognitive and reflexive abilities. This paper addresses this gap by positing a core question: Do humans and AI excel at the same types of audio-based tasks, and where do the strengths of human cognitive transfer and machine-learned policies diverge?

In this sense, we introduce a fully symmetrical, two-genre comparative framework. We evaluate the performance of human players, stratified into Somewhat familiar, Familiar, and Very familiar groups, and two specialized blind AI agents in both a fighting game (DareFightingICE [12]) and a first-person shooter (FPS) (ViZDoom [13]). These genres were purposefully selected to probe distinct skill sets: the fighting game emphasizes precise timing and reaction to discrete audio cues. At the same time, the FPS demands continuous spatial reasoning and navigation within a complex soundscape. This dual-genre design enables us to move beyond simply asking if an AI can play with audio to examining how its performance profile compares to that of humans and whether these profiles are genre-dependent.

This study makes the following primary contributions to the fields of game accessibility, human–computer interaction, and artificial intelligence:

We present the first symmetrical, cross-genre experimental design that directly compares the performance of blind AI agents against human players of varying skill levels in audio-only gaming environments.
We test the generalizability of human gaming expertise to the auditory domain, providing evidence on how high-level skills transfer across sensory modalities.
We offer insights into a set of actionable design strategies to improve the effectiveness and accessibility of audio-first games.

To investigate this, our study’s primary independent variables (those manipulated by the researchers) are the Player Type (Human vs. Artificial Intelligence) and the Game Genre (a fighting game, DareFightingICE, vs. a first-person shooter, SonicDoom). For the human component of this study, we also manipulated the sensory modality as a within-subjects factor (a full vision condition vs. an audio-only condition). A key quasi-independent variable we measured was the self-reported gaming experience of our human participants, categorizing them as ‘Somewhat familiar,’ ‘Familiar,’ or ‘Very familiar’ with the genres. The primary dependent variables (those measured to assess performance) were objective in-game metrics, specifically the win ratio and average health point difference in the fighting game, as well as the task completion time, survival time, and kill/death counts in the first-person shooter. These were supplemented by subjective data from the Game User Experience Satisfaction Scale questionnaire.

2. Related Work

2.1. Sound in Video Games

Video games are increasingly accepted as complex artistic creations that incorporate a variety of sensory and design elements [14]. Fundamentally, they operate as audio–visual systems that form interactive relationships with players [15,16]. Although visuals are usually the first element encountered, games rely on various mechanisms to maintain engagement, such as narrative, level design, and progression systems [17]. Notably, “sound” has emerged as an important element in sustaining engagement. The role of sound design in gaming extends far beyond simply providing background music [18]. It actively shapes the experience by reinforcing what is happening in the moment, hinting at what is to come, and guiding the player’s focus. This dual role strengthens immersion and facilitates a personal connection between the player and the world of the game [19].

Game audio’s role in creating a sense of being immersed in a virtual world is a widely recognized contribution. This is commonly defined as a state of deep mental involvement, where players lose awareness of the real world and experience a heightened sense of presence in the fictional one [20]. Moreover, studies show that audio plays a crucial role in sensory immersion, whereby the combination of sound and visuals overwhelms real-world stimuli and directs player focus toward the game [20,21]. When designed effectively, audio can make the virtual environment seem more realistic, giving players the feeling that they are genuinely “inside” the game.

Beyond immersion, sound has a considerable impact on the broader player experience, which includes enjoyment, engagement, flow, and social interaction [22]. Designers have significantly advanced the quality and complexity of game audio in recent years, recognizing that a richer auditory environment contributes directly to a more compelling and memorable gaming experience. A key way in which audio contributes to this experience is through its impact on enjoyment and engagement. Sound effects heighten excitement, satisfaction, and memorability during gameplay. In addition, the literature shows that background sound effects, in particular, enhance immersion and elicit positive emotions [22,23]. Thus, an engaging soundtrack or precisely timed sound effects can transform a routine sequence into an emotionally resonant highlight, amplifying the immediacy and long-term satisfaction of play.

2.2. Sound and Performance in First-Person Shooter and Fighting Games

In high-intensity genres, such as first-person shooter (FPS) and fighting games, sound serves to enhance performance. In FPS games, for example, sounds such as footsteps, weapon fire, and ambient noise are central to spatial awareness and situational judgment. Spatial audio techniques, including the Head-Related Transfer Function (HRTF), have been shown to enhance the ability to localize sounds in competitive games, such as Overwatch and Counter-Strike: Global Offensive [24]. Further research highlights that spatialized sound enhances player ability to detect threats and improves situational perception [25]. As FPS performance relies on rapid decision making, reaction time, and stress management, even subtle improvements in perceptual clarity can impact outcomes. Studies also show that physiological responses, such as elevated heart rate and increased cortisol levels, affect performance during competitive play [26], suggesting that future integration of adaptive audio and biofeedback systems could optimize skill performance. In addition, the performance of players in FPS games is closely related to cognitive and perceptual demands. Real-time decision making, rapid reaction speed, and environmental monitoring are crucial to success [27]. Even minor improvements in perceptual clarity through enhanced audio and visual feedback can influence gameplay outcomes [28]. This parallels findings in traditional sports, where high performers rely on multisensory integration and practiced responses to optimize performance under pressure [29].

Fighting games also make extensive use of sound to reinforce gameplay dynamics and enhance player performance. Sounds are often used to distinguish between characters, highlight attack sequences, or indicate successful hits. Games such as Tekken 7 [30], Mortal Kombat 11 [31], and Killer Instinct [32] incorporate a wide range of sound effects to enhance player responsiveness, including distinct character-based audio features and stage completion indicators. More recently, Street Fighter 6 has introduced advanced audio design elements, such as distance indicators and sound variations that correspond to attack strength. This feature helps players to gauge combat situations more precisely. These auditory layers improve clarity and add depth to the competitive experience by reinforcing timing, spacing, and tactical decision making.

Beyond static sound effects, recent studies have explored adaptive approaches to sound in fighting games. For example, Khan et al. [33] investigated the use of adaptive background music (BGM) as a dynamic gameplay element, demonstrating that a deep learning agent performed significantly better when supported by adaptive BGM than when relying on static sound design. Such findings highlight the potential of responsive, context-sensitive audio systems to influence performance outcomes and facilitate deeper engagement.

Together, FPS and fighting games demonstrate how sound functions as one of the main gameplay mechanics rather than as a supplementary enhancement. Although FPS research emphasizes situational awareness and perceptual clarity, fighting game design showcases the role of sound in timing, spacing, and feedback.

2.3. Accessibility in Video Games

The implementation of sound as a crucial component in video games has long been seen as a way of making them more accessible, particularly for players with visual impairments [34]. As well as improving immersion, sound can also be used for navigation and providing feedback, enabling players to engage with game environments without relying purely on visual cues. For many visually impaired players (VIPs), audio is the main way information about the spatial layout, how to interact with objects, and the mechanics of the game is conveyed. This effectively transforms sound from a supportive element into a fundamental part of how the game is played (REF). Early accessible games, such as Shades of Doom [35], AudioQuake [36], and Terraformers [37], led the way in demonstrating how auditory design could effectively communicate spatial orientation, object distance, and environmental details. They relied on techniques such as stereo panning, directional tones, sonar-like pulses and musical earcons to translate three-dimensional virtual spaces into auditory experiences. These experiments demonstrated that gameplay could be fully realized through audio alone, challenging the long-standing assumption that video games are primarily visual media.

As accessible design practices matured, researchers and developers introduced more advanced systems to refine these auditory strategies. Tools such as NavStick [38] and Surveyor [39] built on earlier methods by providing real-time spatial feedback and line-of-sight guidance to help players identify unexplored areas within virtual environments. These innovations reduced cognitive load for VIPs and supported more efficient navigation for sighted players, demonstrating that inclusive design can enhance the overall gaming experience [40]. Similarly, the integration of voice-over narration and layered audio cues in mainstream titles has made menus, tutorials and complex environments more navigable for a wider audience.

2.4. Audio and Player Performance

The literature has shown that sound design extends beyond aesthetic immersion to directly impact player performance. Audio cues can take the role of reaction triggers, navigational aids, or timing mechanisms that enable faster and more accurate responses to in-game events. In rhythm action games, for instance, players rely heavily on musical synchronization to optimize timing and performance [41]. In fast-paced genres, such as fighting games and first-person shooters, auditory feedback plays a similar role by helping players anticipate attacks, locate opponents, and make split-second tactical decisions [42]. The effectiveness of audio in performance contexts is further supported by work on training and expertise. Experienced players demonstrate improved auditory recognition and response times compared to novices, suggesting that auditory expertise develops through repeated exposure and practice [43]. Moreover, studies have found that variations in audio presentation, such as distinctiveness of the cue or adaptivity, can influence both reaction speed and decision accuracy [44].

These findings highlight that audio design is not only an accessibility consideration, but also a performance-critical factor. However, although previous studies have demonstrated the importance of sound for human players, the potential of auditory channels for artificial agents has received significantly less attention. Our study addresses this gap by comparing human and AI performance when constrained to auditory information, revealing the role of auditory expertise in shaping effectiveness and immersion across different game genres.

3. Methodology

In this study, we investigated the performance of human players and AI agents in audio-only gaming environments. The experimental setup followed a 2 (Player Type: Human, AI) × 2 (Genre: Fighting, FPS) comparative design. For the human component, a within-subjects factor of 2 (Modality: Vision, Audio-Only) was used to directly measure the effect of visual deprivation. This design facilitates a rigorous, symmetrical comparison of natively learned AI policies against transferred human skills across two distinct gaming contexts.

3.1. Experimental Platforms and Tasks

In order to probe a diverse range of auditory skills, we selected two research platforms from genres that emphasize fundamentally different abilities: an FPS that demands spatial navigation and reasoning, and a fighting game that prioritizes timing and reaction.

3.1.1. FPS Testbed: SonicDoom

An FPS is a game genre centered around combat from a first-person perspective. The player navigates a 3D environment, viewing the world from the protagonist’s own eyes. The core gameplay typically involves spatial navigation, aiming, and reacting to enemies and events within this 3D space.

For the FPS genre, we employed the SonicDoom [45] framework, an auditory enhanced version of the research platform VizDoom [13]. SonicDoom features sound source localization, distinct audio cues for enemies and items, footsteps sounds, and wall collision sounds. The specific tasks for this study were the “Basic” and “Deadly Corridor” scenarios from the VizDoom environment. Crucially, the game’s audio is 3D, and the ‘audio listener’ is attached directly to the player’s camera. This means the player can differentiate sound sources from all directions (e.g., in front, behind, left, or right) based on standard stereo panning and volume attenuation, which is essential for navigation and combat without vision.

The Basic scenario involves a straightforward setup wherein the agent or player is situated within a square room, facing a monster positioned on the opposite side. The player’s objective is to eliminate the monster before the allotted time expires. The Deadly Corridor scenario requires the agent or player to navigate a narrow, linear hallway while avoiding or neutralizing enemies positioned on both sides of the corridor to retrieve a green vest at the end. A round of the Basic scenario lasted 8 s, while a round of Deadly Corridor lasted 60 s, which is the default length of these scenarios. Figure 1 shows an example of both scenarios.

3.1.2. Fighting Game Testbed: DareFightingICE

A fighting game is a genre typically defined by close-quarters combat between two characters in a 2D or 2.5D plane. Players use a variety of attacks and defensive maneuvers, and success relies on precise timing, spacing, and the ability to predict opponents’ actions—all of which must be conveyed through sound in an audio-only version.

Our fighting game platform is DareFightingICE [12], a Java-based framework with a sound design focusing on VIPs. The platform is open source and provides a controlled environment for AI and human–computer interaction research. The task for all participants and agents was to compete in a best-of-three match against a standardized MCTS (Monte Carlo Tree Search) opponent [46]. Figure 2 shows an example of the DareFightingICE platform.

The audio system was a critical component of this testbed. We used the winning sound design from the 2022 DareFightingICE Sound Design and AI competition [47]. This sound design features 52 distinct, high-clarity sound effects that encode gameplay information, including character movement (walking, jumping, etc.), attack types (light, heavy, and special), character state (damage taken, stun, etc.), and the relative horizontal distance to the opponent, which is conveyed through stereo panning. The sound design also has three special sound effects designed to help the non-visual play.

Heartbeat: This sound plays when the player’s health drops below 50. For Player One, the sound plays through the left speaker, and for Player Two, it plays through the right.
Energy Increase: This sound plays when the player’s energy rises by 50 from the previous value. For Player One, the sound is heard on the left speaker, and for Player Two, it is on the right.
Border Alert: This sound plays when a player reaches the end of the stage on either side. The sound plays on the left side when a player reaches the left end and on the right side when they reach the right end.

This sound design is essential for non-visual play. The system uses a fixed ‘audio-listener’ positioned in the absolute center of the 2D map. This means all sounds are ‘world-centric’; a sound event happening on the left side of the stage is always heard in the left speaker, and an event on the right is heard in the right speaker. This simple, effective model allows the player to use stereo panning to track their opponent’s absolute position on the stage, and it provides unambiguous absolute cues, such as the ‘Border Alert’ playing on the far left when the player reaches the left edge of the stage.

3.2. AI Agent Implementation

A unified reinforcement learning (RL) approach was used to develop blind AI agents for both platforms, ensuring that performance differences could be attributed to the demands of the genre rather than the agent’s core architecture.

Agents were trained using the Proximal Policy Optimization (PPO) algorithm [48], a widely adopted policy gradient method in recent years. PPO extends previous approaches by incorporating a trust region-like objective that stabilizes policy updates. It remains one of the state-of-the-art reinforcement learning algorithms. For implementation, we used PyTorch 2.3.0 to train PPO agents on the DareFightingICE platform and sample-factory 2.1.1 (https://pypi.org/project/sample-factory/, accessed on 27 October 2025) for VizDoom (as it offers high-throughput training optimized for that environment). Python 3.12 was used to train the agents for both games.

3.2.1. VizDoom

For VizDoom, we used a deep reinforcement learning blind AI agent introduced in [45], and we trained it using only sound as the input. For the sake of a self-contained presentation, an overview is given below, summarized from previous work.

Audio observation: VizDoom operates at 35 frames per second, so each frame spans approximately 29 ms in real time. The audio signal is sampled at 22,050 Hz, and we aggregated four consecutive frames, resulting in 2520 samples per both left and right channels. Consequently, at each timestep, the sound waveform provided by VizDoom can be represented as a vector

s \in R^{2 \times 2520}

, where each element

s_{i}

lies within the range

[- 1, 1]

. A Fast Fourier Transform (FFT) encoder, similar to the approach in [45], was employed. This encoder comprises an FFT block, followed by a max-pooling layer and two fully connected layers. Both channels were processed separately before being concatenated into a single vector, which was then fed into a policy network, consisting of a fully connected layer network and a Gated Recurrent Unit. The overall architecture of the agent and the audio encoder is shown in Figure 3.

Training: We trained AI agents using Sample Factory [49], a reinforcement learning framework used in a previous work [45]. To process the in-game audio, we used the Torchaudio 2.6.0 (https://docs.pytorch.org/audio/stable/index.html, accessed on 27 October 2025) library. We used the same hyperparameter set as in [49]. All experiments were run on a single server, which had a 116-core CPU with an NVIDIA A100 GPU. The scenarios in use were Basic and Deadly Corridor.

3.2.2. DareFightingICE

For DareFightingICE, we employed the deep reinforcement learning Blind AI [9] to perform the evaluation. This agent provides three alternative methods for processing the audio data from the DareFightingICE platform, corresponding to three distinct audio encoders: a one-dimensional Convolutional Neural Network (1D-CNN), a Fast Fourier Transform, and a Mel spectrogram (Mel).

1D-CNN: The input audio is first downsampled by selecting every eighth sample, thereby reducing computational complexity. The downsampled signal is then processed through two one-dimensional convolutional layers. This procedure yields a 32 × 5 audio-feature vector.

Fast Fourier Transform: An FFT transforms the input audio signal into the frequency domain, after which the magnitudes are converted to their natural logarithms. The resulting data are subsequently downsampled, producing a one-dimensional audio feature vector with 256 elements.

Mel: A short-time Fourier transform (STFT) converts the input audio into a frequency-domain spectrogram by applying a series of Fourier transforms to overlapping, windowed segments of the signal. The resulting spectrogram is then mapped onto the Mel scale. For our configuration, we used a 25 ms window, a 10 ms hop size, and 80 Mel-frequency components. Finally, the Mel spectrogram is passed through two two-dimensional convolutional layers, yielding a 32 × 40 × 1 audio-feature vector.

Network architecture: The blind AI model begins with an encoder chosen from the three aforementioned options. Its output is passed to a Gated Recurrent Unit (GRU), which captures temporal dependencies in the audio sequence. Finally, three fully connected layers produce 40 output units, each corresponding to a possible action. The three encoders’ architecture are shown in Figure 4.

Training: The Blind AI was trained for 900 rounds on DareFightingICE (with each game consisting of three rounds) against MCTSAI23i, a weakened Monte Carlo Tree Search (MCTS) agent [46]. The Torchaudio 2.3.0 library was used to process in-game audio. We used the same hyperparameter as in a previous study [9]. Blind AI’s performance was then assessed over 90 evaluation rounds against the same opponent. We reported the win ratio and the average hit-point (HP) difference at the end of each round.

3.3. Human Participant Study

3.3.1. Procedure

All participants conducted the study on a standard desktop PC. For game inputs, participants used a keyboard to control DareFightingICE and a keyboard and mouse to control SonicDoom, which is consistent with the standard controls for these genres. To ensure a consistent and immersive audio experience, which is crucial for the audio-only condition, all participants used a set of closed-back, over-ear headphones for audio feedback.

In this study, we selected to utilize sighted participants as proxies for visually impaired individuals to collect data on our game. Employing sighted participants under conditions that simulate blindness is a widely accepted practice in the field of game accessibility research. Previous studies have consistently used blindfolded sighted users to assess early versions of accessible games, as recruiting a substantial number of VIPs can pose significant challenges [50,51]. Furthermore, sound and game design should be refined before testing with such an audience. Nonetheless, we recognize the limitation of our study in not incorporating VIPs at this stage. However, research indicates that visually impaired individuals tend to perform better than sighted individuals when blindfolded, implying that any performance-related findings are likely to improve when tested with visually impaired individuals [50,52]. This study was conducted in accordance with the principles of the Declaration of Helsinki and its subsequent amendments, as well as with the research guidelines of the American Psychological Association. Ethical committee approval was not required under Japan’s national regulations regarding privacy and informed consent, specifically the Act on the Protection of Personal Information (APPI).

This study followed the same structure for both games. At the start, participants were given a tutorial period to familiarize themselves with the platform and controls. After the tutorial phase, participants played in vision mode and then in non-vision mode. In the non-vision mode, the device screen was turned black to simulate the absence of visuals. The flowchart of both games is illustrated in Figure 5. This procedure follows the within-subjects design, where each participant served as their own control, and their ‘vision mode’ performance acted as a baseline to directly measure the impact of visual deprivation in ‘non-vision mode’.

3.3.2. Participants

We recruited 37 participants (5 females, 31 males, and 1 non-binary individual, aged 18–44), who were categorized into three groups based on a self-reported gaming experience: Somewhat familiar, Familiar, and Very familiar. This study followed a counterbalanced design, where each participant played both games. Within each game, they completed two blocks: one with full vision and one in an audio-only condition. The order of games and conditions was randomized to mitigate learning effects. Participants received a brief training session before each audio-only block to familiarize themselves with the sound design. Out of 37 participants, 40% were Very familiar, 38% were Familiar, and 22% were Somewhat familiar with FPS games; additionally, 27% were Very familiar, 35% were Familiar, and 38% were Somewhat familiar with fighting games. Due to the modest number of participants, we combined the familiarity groups into the above three.

3.3.3. Measures

Participants completed a Game User Experience Satisfaction Scale (GUESS) [53] questionnaire after each game type. Each participant played both game genres in a randomized order to ensure fairness. The selected GUESS factors were Audio Aesthetics, Usability/Playability, Play Engrossment, and Enjoyment. We chose three relevant questions for each of these factors. GUESS uses a 7-point Likert scale, with Cronbach’s alpha values ranging from 0.75 to 0.88, indicating strong reliability across the subscales. Moreover, objective performance data were logged for all sessions. The evaluation metrics for DareFightingICE included the win rate and the average health difference between the two players, while those for SonicDoom comprised the average time-to-completion, average kills, total deaths, and average remaining health. Subjective experience was assessed using the GUESS questionnaire [53].

3.4. Data Analysis

In order to compare the participants’ evaluations across the three expertise groups, we used non-parametric tests, as the Shapiro–Wilk normality tests indicated several significant deviations from normality and the group sizes were unbalanced. Although averaged Likert-type scores are often treated as continuous, non-parametric methods provide a more conservative and robust choice. Accordingly, we ran Kruskal–Wallis tests separately for each GUESS subscale, and this was followed by Dunn post hoc tests with Holm correction [54]. We reported

η^{2} H

as an omnibus effect size and Cliff’s

δ

for pairwise contrasts. For the cross-game comparison, where both game and group were included as factors, we applied an Aligned Rank Transform (ART) ANOVA, which is specifically designed for factorial designs with non-normal data. We used the conventional

α

= 0.05 for all the statistical tests. The analysis was conducted in R version 4.3.2.

4. Results

This section presents the empirical findings from our study. We detail the objective performance metrics for both human participants and blind AI agents across the two genres of games, followed by an analysis of the subjective player experience data from the GUESS questionnaires. All comparisons between AI and human players were made against the human performance in the audio-only (blind) condition.

4.1. DareFightingICE

Performance in the DareFightingICE environment was evaluated using two primary quantitative metrics: the (I) Win Ratio, defined as the proportion of matches won by participants; and the (II) average Health Point (The maximum HP in DareFightingICE is 400) (HP) difference, which is calculated as the mean disparity in the remaining health values across all completed matches against the standardized MCTS opponent. The individual performances of each player are given in Appendix A.

4.1.1. Human Performance Across Modalities

The analysis of the human performance in DareFightingICE highlights the impact of expertise and sensory conditions. According to Table 1, player expertise was a significant performance factor, and the removal of vision uniformly decreased player effectiveness. In the vision condition, win ratios were high across all groups, ranging from 54.0% (Somewhat familiar) to 74.0% (Very familiar). In the Blind condition, a clear performance drop was observed for all players. Very familiar players demonstrated the most successful skill transfer, maintaining a 73.0% win ratio, while the Somewhat familiar players’ win ratios fell to 36.0%. The Very familiar and Familiar group performed the same in the vision condition; however, the performance of Familiar players declined significantly from 74% to 56% in blind mode.

4.1.2. Human vs. AI Comparison

The comparative analysis between the human players and blind AI provides further insight into the performance differences across expertise levels. As shown in Table 2, specifically Sections A and B, the performance of the blind AI in DareFightingICE was comparable to the overall average of human players in the blind condition. The overall human average win ratio was 53.0%. However, a more detailed comparison of Table 1 and Table 2 reveals a key insight: the Very familiar (73.0%) human player group individually achieved a higher win ratio than the AI’s average performance (53.0%). The Familiar (56.0%) group performed similarly to average AI performance (53.0%) across all encoders. Only the Somewhat familiar group (36.0%) performed worse.

4.2. SonicDoom

Performance in SonicDoom was quantified through separate measurements across the scenarios. For the Basic scenario, the average completion time and average kill count were employed as metrics, given that it is a straightforward task where speed correlates with better performance. In the Deadly Corridor scenario, we utilized the average survival time, average health, average kills, and average deaths as indicators of performance. The shift from the average completion time to average survival time was motivated by observations that, even with increased HP in non-vision mode, nearly all participants were unable to reach the green vest at the endpoint. Consequently, the Deadly Corridor scenario evolved into a survival challenge, where longer survival indicated greater progress. This, along with average kills and deaths, provides nuanced insights into adaptability within the non-vision mode. The individual performances of each player are given in Appendix A.

Human Performance Across Modalities

In order to measure the impact of sensory deprivation, we analyzed the human performance across both blind and non-blind conditions. It is critical to note that players were given a much larger health pool in the blind “Deadly Corridor” scenario (500 HP) compared to the vision condition (100 HP). Therefore, a direct comparison of raw survival times is misleading; instead, we focused on combat effectiveness, as measured by kills and deaths.

As shown in Table 3 and Table 4, in the complex “Deadly Corridor” scenario, the loss of vision clearly degraded the combat effectiveness of experienced players. The average kill count for Expert (‘Very familiar’) players dropped significantly from 4.48 with vision to 3.19 in the blind condition. A similar decline was observed for Intermediate (‘Familiar’) players, whose kills fell from 3.30 to 1.98. The same was observed for Novice (‘Somewhat familiar’) players, who achieved more kills in non-blind mode (2.44) than blind mode (0.78).

In the “Basic” scenario, where starting health was not a factor, the impact of removing vision was unambiguous: it uniformly hindered performance, as illustrated in Table 3 and Table 4. Expert players were over three times slower in the blind condition, with their completion time increasing from 1.56 s with vision to 5.35 s with audio alone. This trend of significantly slower completion times was consistent across all skill levels. Overall, the findings indicate that, while a larger health pool can offset some immediate risks in audio-only combat, the loss of vision demonstrably reduces offensive performance for skilled players and severely hampers navigational efficiency for everyone.

4.3. Human vs. AI Comparison in SonicDoom

A comparative analysis of the AI agent and human participants reveals a stark divergence in strategy and effectiveness, with the AI’s performance profile shifting dramatically depending on the availability of visual data and the complexity of the task.

4.3.1. Blind (Audio-Only) Condition Analysis

In the audio-only condition, the data clearly shows that the experienced humans significantly outperformed the blind AI in complex, high-threat situations. This information is illustrated in Table 3. In the “Deadly Corridor” scenario, the blind AI’s performance was critically deficient. It survived for an average of only 5.57 s, which was less than a fifth of the survival time of the expert (“Very familiar”) human players (29.27 s). While the AI’s kill count (2.5) was higher than that of the intermediate (“Familiar”) and novice (“Somewhat familiar”) players, it was substantially lower than the experts’ 3.19 kills. This demonstrates that the AI’s reactive audio-based policy is insufficient for navigating dynamic combat, whereas expert humans leverage superior cognitive skills to survive longer and perform more effectively.

Conversely, in the simpler “Basic” scenario, the blind AI showcased its specialized efficiency. The agent completed the level in an average of 1.55 s, a time that is over three times faster than the quickest human experts (5.35 s). This suggests that the AI has developed a highly optimized policy for less complex tasks but lacks the adaptive and strategic capabilities required for more complex combat scenarios.

4.3.2. Non-Blind (Vision) Condition Analysis

When provided with visual data, the AI’s performance surpassed that of all human players, dramatically exceeding their capabilities. This information is illustrated in Table 4. In the “Deadly Corridor,” the Blind AI demonstrated near-invincibility. Its average health of +154.78 was an order of magnitude higher than the expert players’ +16.51, and its death rate was a mere 0.20. Interestingly, expert humans still achieved a higher kill count (4.48) compared to the AI (3.02), suggesting the AI adopted a strategy of perfect evasion and defense. In contrast, humans employed a more aggressive, high-risk/high-reward approach. Nonetheless, the AI’s ability to preserve its health showcases a level of mastery unattainable by humans.

This superhuman capability was most evident in the “Basic” scenario. The Blind AI completed the level in just 0.15 s. This is ten times faster than the expert human players (1.56 s), highlighting the AI’s profound superiority in reaction speed and execution efficiency when unconstrained by sensory limitations.

4.4. Subjective User Experience (GUESS Scores)

As seen in Table 5, across both games, none of the subscales showed significant group differences after correction. For SonicDoom, Usability/Playability came closest to significance compared to the other subscales (

H (2) = 3.42

,

p = 0.181

, Holm-adjusted

p = 0.724

), but it still did not reach the threshold of

α

set to 0.05; all other subscales were clearly non-significant. Effect sizes were small (

η^{2} H \leq 0.04

), and Cliff’s

δ

values were at most small-to-moderate, with wide confidence intervals crossing zero. For DareFightingICE, the overall pattern was similar: no significant Kruskal–Wallis tests, with the largest trend in Audio Aesthetics (

H (2) = 3.45

,

p = 0.178

, Holm-adjusted

p = 0.712

). Again, effect sizes were small and pairwise contrasts non-significant.

In the cross-game ART ANOVA, the results confirm that Audio Aesthetics significantly distinguished the two games (

p = 0.003

), with one game being rated higher overall across groups. This information is illustrated in Figure 6. Usability/Playability showed a trend toward a game × group interaction (

p = 0.093

), suggesting that familiarity might differentially shape perceptions of usability/playability across games, although this did not reach significance. No other main or interaction effects were detected.

In plain terms, participants rated the two games similarly in terms of usability, engrossment, and enjoyment, but they differed systematically in their evaluation of Audio Aesthetics. The raw scores and descriptive statistics of GUESS subscales are reported in Appendix B.

5. Discussion

The following central research question guided this study: Do humans and AI excel at the same types of audio-based tasks, and where do the strengths of human cognitive transfer and machine-learned policies diverge? Our findings provide a clear, nuanced answer: No, they do not excel at the same tasks; the superiority of either is fundamentally task-dependent. Specifically, our results show that human expertise, particularly from ‘Very familiar’ players, is more robust and adaptable, allowing them to significantly outperform the AI in complex, strategic scenarios like the SonicDoom ‘Deadly Corridor’. Conversely, the AI’s strength lies in its specialized, reactive efficiency, allowing it to achieve superhuman performance on simple, well-defined tasks like the ‘Basic’ scenario. This divergence highlights that human expertise translates to robust abstract reasoning, while the AI’s machine-learned policies are highly optimized but fragile.

The main contribution of this study is a symmetrical comparison of human expertise transfer and “blind” AI performance in audio-only gaming environments, revealing a nuanced, genre-dependent relationship between cognitive adaptability and computational efficiency. Our findings move beyond the simple question of whether AI can play games with audio, instead interrogating how its performance profile diverges from that of human players and what this implies for the future of AI development and accessible game design. The finding is that the superiority of either human or AI is not absolute but is contingent on the nature of the task, specifically the balance between reactive execution and abstract, strategic reasoning.

5.1. The Divergence of Human and AI Expertise

In the DareFightingICE setup, highly experienced human players significantly outperformed the reinforcement learning agent. This result suggests that the abstract skills honed through years of visual gameplay, such as understanding spacing, predicting opponent actions, and managing resources, are effectively transferable to the auditory domain. Expert players are not merely reacting to audio cues; they are integrating these cues into pre-existing, sophisticated mental models of fighting game dynamics. This finding aligns with [55], which notes that, while AI excels at managing explicit knowledge, humans retain an advantage in tacit, ‘human-centered judgment’ and predictive know-how. This also aligns with research on predictive processing in expert gameplay, where top players rely on internal models to anticipate events rather than simply reacting to them [56]. The AI, trained via PPO on only audio, developed a competent reactive policy, but it lacked this deeper, predictive understanding, leading to a performance ceiling that seasoned humans could surpass.

Conversely, the SonicDoom FPS setup starkly illustrated the AI’s brittle but powerful expertise. In the Basic scenario, the AI achieved superhuman speed, executing a straightforward task with an efficiency that no human could match. This reflects the core strength of RL: optimizing a policy for a well-defined problem space through millions of iterations [7]. However, this strength became a critical weakness in the complex “Deadly Corridor” scenario. The AI’s catastrophic failure to navigate this dynamic, high-threat environment highlights its inability to form a robust cognitive map or engage in the kind of spatial reasoning that humans perform implicitly. While expert humans could leverage sound to navigate, prioritize threats, and survive, the AI’s policy collapsed, unable to generalize from its training to a task requiring long-term planning and situational awareness. This finding corroborates recent studies in auditory navigation, which show that standard RL agents struggle to build coherent spatial representations from raw audio streams without specialized architectures, such as those incorporating attention mechanisms or external memory [57].

A noteworthy finding is the performance parity between the blind AI and the overall human average in DareFightingICE, with both achieving a win ratio of approximately 53–54%. This result is significant because it establishes a clear benchmark for AI competence in this audio-only task. It suggests that the AI’s natively learned policy is as effective as the general skill set that the average human player can transfer to an audio-only context. However, this average masks the full story; the AI’s performance was comparable to the “Familiar” player group but was decisively surpassed by “Very familiar” experts, who achieved a 73% win rate. This implies that while the AI successfully mastered a baseline reactive strategy sufficient to match an average player, it could not replicate the advanced predictive and abstract reasoning that distinguishes expert human performance. The AI learned to react to the sounds, but the experts understood the fight. This stark contrast demonstrates that the AI’s policy is purely reactive, lacking the adaptive, long-term strategic planning that a more sophisticated memory-based architecture might provide.

5.2. Implications for Accessible Game Design and AI Development

These findings carry significant implications for the design of audio-first games. The success of human players in DareFightingICE underscores the efficacy of a semantically rich sound design, where discrete, high-clarity audio cues map directly to specific gameplay events and states. This “semantic sonification” appears more conducive to human learning and skill transfer than the more naturalistic, but also more ambiguous, continuous soundscape of an FPS. For designers aiming to create accessible experiences, particularly for VIPs, this suggests prioritizing clarity and informativeness over pure realism. In practice, this means designers should focus on creating an unambiguous auditory ‘language’ for critical gameplay events, even if that comes at the cost of a less naturalistic soundscape. As our study also showed that novice players struggled significantly in the audio-only condition across both genres, there is a clear need for structured audio-centric tutorials that explicitly teach players how to interpret the game’s soundscape.

For AI researchers, our results serve as a benchmark. The performance gap between the AI’s success in simple tasks and its failure in complex ones indicates that current “blind” agents possess reactive intelligence but lack abstract reasoning. To create agents that can truly master complex, audio-based environments, future work must move beyond standard policy optimization. Promising avenues include memory-augmented networks that can build and maintain an internal state of the environment over time, curriculum learning that trains agents on progressively more difficult tasks, and integrating world models that allow the agent to “imagine” and plan for future outcomes based on auditory input [58]. Furthermore, integrating other methods to enhance adaptive strategy, such as evolutionary game–theoretical approaches combined with deep reinforcement learning [59], could help bridge the gap between the AI’s reactive policy and the adaptive, predictive strategies of human experts.

5.3. Limitations and Future Directions

This study, while providing a robust framework, has several limitations that open doors for future research. First, we utilized blindfolded sighted participants as a proxy for VIPs, a common methodology in the field. While necessary for early-stage research, VIPs often possess heightened auditory processing skills and unique cognitive strategies. Future studies should aim to include VIPs to validate and extend our findings. Second, our AI was based on a PPO architecture; more advanced models, such as those leveraging Transformers for audio processing, which are particularly well suited for capturing the complex temporal dependencies in an audio stream, might yield superior performance and narrow the gap with human experts. Third, our human study was conducted in a controlled environment following a set procedure. Real-world gaming is often subject to external factors, such as player fatigue, stress, and distractions, which were not measured in this study. Future work could investigate the robustness of both human expertise transfer and AI performance under these more variable conditions, which would be crucial for real-world applications.

Our demographic data for participant age was collected in categorical ranges rather than as a continuous variable. This prevented the calculation of a precise mean and standard deviation for age, limiting the granularity of our demographic analysis. Finally, our investigation was confined to two specific genres and single-player scenarios. The scalability of these ‘blind’ AI models to larger, more dynamic environments, such as open-world or multiplayer games, remains an open question. Such tasks would require agents to process a much denser and more complex audio soundscape, posing significant computational and architectural challenges.

Finally, our investigation was confined to two specific genres. Future studies should expand this comparative framework to other domains, such as real-time strategy or puzzle games, which require different cognitive skills, including long-term planning and logical deduction. Such research would further extend the distinct contours of human and artificial intelligence, pushing us closer to creating both more capable AI and more universally accessible virtual worlds. Moreover, future studies can integrate various body signals, such as the electroencephalogram [60], electrocardiography [61], and eye tracking [2], to strengthen the capacity of AI agents to adapt in real time to the complex interplay of human emotions, behaviors, and physiological states, thereby enabling more personalized and context-aware interactions.

6. Conclusions

Our symmetrical, cross-genre comparison reveals that the performance relationship between human players and AI agents in audio-only environments is fundamentally task-dependent. In scenarios requiring complex, adaptive strategy, such as the SonicDoom “Deadly Corridor,” the predictive and cognitive skills of expert humans significantly surpassed the reactive policy of the blind AI (e.g., achieving an average survival time of 29.27 s vs. the AI’s 5.57 s). Conversely, in tasks demanding optimized navigational efficiency or pure reaction speed, such as the SonicDoom “Basic” scenario, or when granted full vision, the AI demonstrated superhuman capabilities. This divergence confirms that while visually honed gaming skills are transferable to the auditory domain, they manifest as a different, more robust form of expertise compared to the AI’s specialized but brittle proficiency. Furthermore, the performance parity in DareFightingICE, where the blind AI’s win rate (54.0%) matched the overall human average (53.0%), serves as a crucial benchmark. It demonstrates that current reinforcement learning methods can produce an agent with a level of competence equivalent to the average human player in a reaction-based audio-only task. Yet, the AI’s failure to challenge expert players (who achieved a 73.0% win rate) underscores that its policy, while effective, is less sophisticated than the deeply ingrained, predictive mental models of seasoned humans.

Our statistical analysis of the subjective user experience (GUESS scores) revealed that player expertise did not significantly impact perceived usability, engrossment, or enjoyment for either game. This suggests that the audio-only experience was equally (un)intuitive for both novices and experts. However, the analysis did show a statistically significant difference in perceived ‘Audio Aesthetics’ between the two games, with DareFightingICE’s clear, semantic audio cues being rated more favorably, reinforcing the importance of sound design clarity.

These findings yield actionable insights for both accessible game design and AI research. For developers, our work validates the use of blind AI agents as powerful tools for benchmarking the clarity of discrete audio cues and simple navigation paths, while underscoring the irreplaceable role of expert human testers in validating complex, strategic gameplay. For the AI community, this study highlights the critical need to advance reinforcement learning beyond reactive policies toward models that incorporate predictive processing and internal spatial mapping. Ultimately, bridging the gap between the AI’s narrow proficiency and the adaptability of human expertise represents the next frontier for creating truly intelligent agents for complex, real-world interactive tasks [62,63,64].

Author Contributions

I.K.: Conceptualization, Methodology, Software, Formal Analysis, Investigation, Writing—Original Draft Preparation, Review and Editing, Data Curation, and Visualization. I.K. is the main author responsible for overall coordination and execution of the research. He designed the methodology, developed the software, performed formal analyses, and conducted investigations. He also curated the dataset, created visualizations, and prepared the original draft. T.V.N.: Methodology, Software, Writing—Review and Editing, and Formal Analysis. T.V.N. supported the methodological design and validated the results of this research to ensure that the findings were accurately represented. In addition, he contributed to reviewing and editing the manuscript to improve its clarity and quality. C.T.J.: Validation, Writing—Review and Editing, and Formal Analysis: C.T.J. validated the results of this research to ensure that the findings were accurately represented. In addition, C.T.J. contributed to reviewing and editing the manuscript to improve its clarity and quality. R.T.: Conceptualization, Methodology, Resources, Writing—Review and Editing, Supervision, and Funding Acquisition. R.T. supervised the research to ensure it remained on track with academic standards and best practices. He supported the conceptual and methodological design and provided necessary resources for the study. He also offered guidance throughout the research process and contributed to reviewing and editing the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was conducted in accordance with the principles of the Declaration of Helsinki and its subsequent amendments, as well as the research guidelines of the American Psychological Association. Ethical committee approval was not required under Japan’s national regulations regarding privacy and informed consent, specifically the Act on the Protection of Personal Information (APPI).

Informed Consent Statement

Informed consent was obtained from all the subjects involved in this study.

Data Availability Statement

The data presented in this study are not publicly available due to ethical and privacy restrictions. The informed consent agreement provided to all participants stipulated that the data collected would be used for research purposes only and would not be shared publicly to protect their privacy. Therefore, the dataset generated and analyzed during the current study cannot be deposited in a public repository.

Acknowledgments

The authors wish to express their sincere gratitude to all the individuals who participated in this study; their time and effort were essential to this research. We would also like to extend our special thanks to Mustafa Can Gursesli for his valuable contributions and insightful feedback throughout the course of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

1D-CNN	One-Dimensional Convolutional Neural Network
AI	Artificial Intelligence
ANOVA	Analysis of Variance
APPI	Act on the Protection of Personal Information
ART	Aligned Rank Transform
BGM	Background Music
FFT	Fast Fourier Transform
FPS	First-person shooter
GRU	Gated Recurrent Unit
GUESS	Game User Experience Satisfaction Scale
HP	Health Point
HRTF	Head-Related Transfer Function
MCTS	Monte Carlo Tree Search
Mel	Mel Spectrogram
PPO	Proximal Policy Optimization
RL	Reinforcement Learning
STFT	Short-Time Fourier Transform
VIPs	Visually Impaired Players

Appendix A. Individual Player Performance Data

This appendix provides a detailed breakdown of the performance of each participant across both the DareFightingICE and SonicDoom experiments. The data is presented by player familiarity (“Very familiar,” “Familiar,” and “Somewhat familiar”) and the experimental condition.

Appendix A.1. DareFightingICE: Individual Performance Metrics

Table A1. Very familiar: Blind vs. Non-Blind.

Player No.	Blind		Non-Blind
Player No.	Win Rate (%)	HP Diff.	Win Rate (%)	HP Diff.
1	78	52.78	78	68.33
2	56	20.78	67	28.89
3	100	151.22	100	219.00
4	78	115.89	89	100.33
5	67	14.56	67	13.11
6	78	93.11	100	145.56
7	89	64.33	100	128.67
8	67	18.67	33	−32.89
9	44	8.67	33	−35.00
10	78	60.89	78	61.78
Average	73	60.09	74	69.78

Table A2. Familiar: Blind vs. Non-Blind.

Player No.	Blind		Non-Blind
Player No.	Win Rate (%)	HP Diff.	Win Rate (%)	HP Diff.
1	56	6.22	100	130.11
2	67	44.44	78	99.00
3	44	14.11	67	44.78
4	89	83.56	100	89.67
5	44	1.33	67	63.22
6	67	66.56	56	26.89
7	33	−12.67	67	94.00
8	44	21.56	56	24.33
9	78	50.89	78	23.11
10	56	12.78	89	49.44
11	33	−35.67	78	42.11
12	56	−6.89	67	63.56
13	67	38.22	67	45.67
Average	56	21.88	74	61.22

Table A3. Somewhat familiar: Blind vs. Non-Blind.

Player No.	Blind		Non-Blind
Player No.	Win Rate (%)	HP Diff.	Win Rate (%)	HP Diff.
1	44	−28.11	73	88.31
2	22	−22.00	89	112.89
3	44	8.44	56	20.89
4	33	−36.44	89	140.22
5	11	−38.44	56	35.67
6	67	38.22	67	45.67
7	11	−40.00	89	147.67
8	56	25.78	56	53.44
9	22	−30.89	40	−50.00
10	44	−17.56	56	30.22
11	22	−27.00	22	−59.78
12	22	−20.89	0	−100.22
13	57	−6.89	33	−32.89
14	44	1.33	33	−32.89
Average	36	−13.89	54	28.51

Appendix A.2. SonicDoom: Individual Performance Metrics

Table A4. Very familiar: Blind Mode.

Player No.	Basic		Deadly Corridor
Player No.	Time Taken	Enemies Killed	Time Taken	Health	Enemies Killed	Deaths
1	2.350	1.000	32.340	−8.000	3.670	1.00
2	11.230	1.000	34.730	34.000	5.330	0.67
3	3.410	1.000	27.810	−6.000	3.330	1.00
4	7.950	0.330	18.650	−14.000	1.000	1.00
5	8.820	0.000	12.830	36.000	2.670	0.67
6	8.820	0.000	20.170	−14.000	2.670	1.00
7	4.820	0.670	34.810	−14.000	4.000	1.00
8	5.950	1.000	46.740	10.000	4.330	0.67
9	1.540	1.000	37.800	−12.000	2.670	1.00
10	2.540	1.000	35.800	−12.000	2.670	0.67
11	6.740	0.330	15.070	66.000	3.670	0.33
12	4.393	0.593	30.217	15.055	3.021	1.00
13	4.152	0.581	30.454	16.564	2.979	0.67
14	3.912	0.569	30.691	18.073	2.937	1.00
15	3.672	0.557	30.928	19.582	2.895	1.00
Average	5.353	0.642	29.269	9.018	3.189	0.84

Table A5. Familiar: Blind Mode.

Player No.	Basic		Deadly Corridor
Player No.	Time Taken	Enemies Killed	Time Taken	Health	Enemies Killed	Deaths
1	8.820	0.000	12.150	−10.000	0.330	1.00
2	3.720	1.000	30.430	−12.000	4.670	1.00
3	7.950	0.333	18.650	−14.000	1.000	1.00
4	5.390	1.000	13.120	−16.000	2.000	1.00
5	5.340	0.667	36.820	−20.000	2.000	1.00
6	5.950	1.000	36.740	10.000	4.330	1.00
7	8.950	0.333	12.650	−14.000	1.000	1.00
8	5.080	0.333	24.550	−10.000	0.330	1.00
9	5.810	0.667	39.700	−10.000	2.670	1.00
10	5.884	0.648	35.605	−8.000	1.926	1.00
11	5.794	0.659	37.508	−7.467	1.904	1.00
12	5.704	0.670	39.411	−6.933	1.882	1.00
13	5.614	0.681	31.314	−6.400	1.859	1.00
14	5.524	0.693	33.217	−5.867	1.837	1.00
Average	6.109	0.620	28.705	−9.333	1.981	1.00

Table A6. Somewhat familiar: Blind Mode.

Player No.	Basic		Deadly Corridor
Player No.	Time Taken	Enemies Killed	Time Taken	Health	Enemies Killed	Deaths
1	3.880	0.670	23.880	−14.000	2.330	1.00
2	5.290	1.000	16.480	−22.000	2.000	1.00
3	7.410	0.330	18.130	−12.000	2.670	1.00
4	8.250	0.330	14.760	−14.000	0.330	1.00
5	10.015	0.160	11.885	−13.000	0.500	1.00
6	11.538	0.000	9.314	−12.000	0.000	1.00
7	13.061	0.172	6.743	−11.000	0.544	1.00
8	14.584	0.340	4.172	−10.000	1.072	1.00
Average	9.254	0.375	13.171	−13.500	1.181	1.00

Table A7. Very familiar: Non-Blind Mode.

Player No.	Basic		Deadly Corridor
Player No.	Time Taken	Enemies Killed	Time Taken	Health	Enemies Killed	Deaths
1	1.120	1.000	10.310	34.000	5.330	0.33
2	1.270	1.000	4.810	−16.000	3.330	1.00
3	1.410	1.000	11.660	22.000	6.000	0.00
4	4.570	1.000	3.110	−12.000	1.000	1.00
5	1.000	1.000	7.860	24.000	5.000	0.33
6	1.370	1.000	8.780	−12.000	3.670	1.00
7	2.230	1.000	8.810	2.000	3.330	0.67
8	1.960	1.000	7.140	−12.000	3.000	0.00
9	0.950	1.000	11.470	46.000	6.000	1.00
10	0.950	1.000	15.800	56.000	6.000	0.67
11	1.390	1.000	12.130	8.000	4.670	0.33
12	1.367	1.000	19.660	24.073	4.832	0.33
13	1.319	1.000	11.763	25.964	4.921	1.00
14	1.271	1.000	9.957	27.855	5.009	0.00
15	1.223	1.000	10.150	29.745	5.097	1.00
Average	1.560	1.000	10.227	16.509	4.479	0.58

Table A8. Familiar: Non-Blind Mode.

Player No.	Basic		Deadly Corridor
Player No.	Time Taken	Enemies Killed	Time Taken	Health	Enemies Killed	Deaths
1	2.590	1.000	14.240	−4.000	4.670	0.67
2	1.480	1.000	8.020	−12.000	4.330	1.00
3	4.570	1.000	3.110	−12.000	1.000	1.00
4	1.390	1.000	3.120	6.000	3.670	0.67
5	5.820	1.000	3.280	−10.000	1.000	1.00
6	1.960	1.000	7.140	−12.000	3.000	1.00
7	4.570	1.000	3.110	−12.000	1.000	0.33
8	1.750	1.000	13.130	24.000	5.000	0.33
9	1.390	1.000	10.970	34.000	5.000	0.67
10	2.551	1.000	7.869	20.389	3.407	0.67
11	2.494	1.000	7.974	24.422	3.452	1.00
12	2.437	1.000	8.078	28.456	3.496	1.00
13	2.380	1.000	8.183	32.489	3.540	0.67
14	2.323	1.000	8.287	36.522	3.585	1.00
Average	2.690	1.000	7.610	10.310	3.300	0.79

Table A9. Somewhat familiar: Non-Blind Mode.

Player No.	Basic		Deadly Corridor
Player No.	Time Taken	Enemies Killed	Time Taken	Health	Enemies Killed	Deaths
1	1.580	1.000	7.810	20.000	5.000	1.00
2	1.260	1.000	10.870	16.000	4.670	1.00
3	1.780	1.000	6.220	−4.000	2.330	1.00
4	8.820	0.000	3.200	−16.000	0.670	1.00
5	8.920	1.000	2.405	−28.000	0.665	0.67
6	11.144	1.000	0.557	−40.800	2.198	1.00
7	13.368	0.733	1.291	−53.600	1.731	1.00
8	15.592	0.705	3.139	−66.400	2.264	1.00
Average	7.810	0.800	4.440	−21.600	2.440	0.96

Appendix A.2.1. Insights from Individual Performance

A closer examination of the individual player data reveals several key insights into skill transfer, adaptability, and the inherent variability of player performance. The averages presented in the main body of this study are clarified by these individual results, highlighting specific trends and noteworthy outliers.

Expertise Directly Correlates with Adaptability in DareFightingICE

Resilience of Experts: The “Very familiar” group’s average win rate barely changed between conditions (74% Non-Blind vs. 73% Blind). Player 3 exemplifies this, achieving a perfect 100% win rate in both modes, indicating a near-perfect transfer of fighting game fundamentals to the auditory domain.
Performance Cliff for Non-Experts: In contrast, the “Familiar” group’s average win rate dropped sharply from 74% to 56% when deprived of vision. The “Somewhat familiar” group saw their performance decline from a winning average (54%) to a losing one (36%), with their average HP difference shifting from +28.51 to −13.89. This suggests that, while basic competence can be achieved with vision, adapting to audio-only gameplay is a distinct skill that relies heavily on deep-seated expertise.
Anomalous Performances: Interestingly, two players in the “Very familiar” group (Players 8 and 9) achieved a higher win rate in the blind condition than in the non-blind condition. This suggests that, for some individuals, focusing solely on audio cues may be more effective than processing combined audio–visual information, potentially reducing cognitive load or eliminating visual distractions.

Visual Deprivation Severely Hampers Navigational and Combat Efficiency in SonicDoom

Drastic Slowdown in Simple Tasks: In the “Basic” scenario, completion times increased dramatically for all groups in blind mode. “Very familiar” players, who averaged a lightning-fast 1.56 s with vision, slowed to 5.35 s when relying on audio alone. This highlights a severe reduction in target acquisition and movement speed.
Reduced Combat Effectiveness: While survival times in the “Deadly Corridor” were longer in blind mode, this is a reflection of the increased health pool provided in the experiment. A more accurate measure of performance, the average number of enemies killed, decreased across the board. “Very familiar” players’ kill count dropped from 4.48 to 3.19, and “Familiar” players’ from 3.30 to 1.98. This demonstrates that players were less capable of engaging threats effectively without visual cues.

High Degree of Performance Variability Within Experience Groups

Standout Performers and Struggling Players: In the “Somewhat familiar” DareFightingICE group, Player 6 performed exceptionally well, achieving a 67% win rate in both conditions—on par with the “Familiar” group average. Conversely, Player 12 from the same group had a 0% win rate in the non-blind condition, and Player 9 struggled with a negative HP difference even with vision.
Navigational Disparities in SonicDoom: The blind “Deadly Corridor” scenario revealed a wide gap in navigational aptitude. Within the “Somewhat familiar” group, Player 1 survived for an average of 23.88 s, approaching the average of more experienced groups, while Player 8 survived for only 4.17 s. This indicates that the ability to interpret spatial audio for navigation and survival varies greatly from person to person, independent of their general FPS experience.

Appendix B. GUESS Scores

The raw scores and descriptive statistics of the GUESS subscales are reported below.

Table A10. Descriptive statistics of the GUESS subscale scores (all participants).

Subscale	n	Mean ± SD	Median [IQR]	Min–Max
Usability/Playability	74	15.18 ± 3.34	15.00 [13.00–18.00]	4.00–21.00
Play Engrossment	74	13.65 ± 3.49	14.00 [11.00–16.00]	6.00–21.00
Enjoyment	74	14.74 ± 3.78	15.00 [13.00–18.00]	4.00–21.00
Audio Aesthetics	74	16.22 ± 2.73	17.00 [15.00–18.00]	7.00–21.00

Table A11. Descriptive statistics of the GUESS subscales by game and group (median [IQR]).

Game	Subscale	VeryFamiliar	Familiar	SomewhatFamiliar
SonicDoom	Usability/Playability	15.50 [11.00–17.00]	14.00 [13.00–16.00]	16.00 [15.00–18.00]
SonicDoom	Play Engrossment	14.00 [12.50–15.00]	12.50 [10.25–16.00]	12.00 [10.00–15.00]
SonicDoom	Enjoyment	16.50 [15.25–18.00]	15.50 [14.00–17.75]	16.00 [13.00–18.00]
SonicDoom	Audio Aesthetics	16.00 [15.00–18.00]	14.50 [13.25–15.75]	16.00 [15.00–18.00]
DareFightingICE	Usability/Playability	17.00 [14.50–19.00]	15.00 [13.00–17.50]	13.00 [13.00–14.25]
DareFightingICE	Play Engrossment	15.00 [12.00–17.00]	13.00 [12.25–14.00]	16.00 [11.00–18.25]
DareFightingICE	Enjoyment	15.00 [13.50–19.00]	15.00 [10.00–16.00]	15.00 [14.25–15.75]
DareFightingICE	Audio Aesthetics	17.00 [16.00–20.00]	17.00 [16.25–18.00]	18.00 [15.00–18.00]

References

Collins, K. From Pac-Man to Pop Music: Interactive Audio in Games and New Media; Routledge: Abingdon, UK, 2017. [Google Scholar]
Gursesli, M.C.; Tarchi, P.; Guazzini, A.; Duradoni, M.; Yaras, A.T.; Akinci, E.; Yurdakul, U.; Calà, F.; Tonacci, A.; Vilone, D.; et al. Silent vs. Sound: The Impact of Uniform Auditory Stimuli on Eye Movements and Game Performance. In Proceedings of the 2025 IEEE Gaming, Entertainment, and Media Conference (GEM), Kaohsiung, Taiwan, 16–18 July 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Andrade, R.; Rogerson, M.J.; Waycott, J.; Baker, S.; Vetere, F. Playing Blind: Revealing the World of Gamers with Visual Impairment. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–14. [Google Scholar] [CrossRef]
Donohue, S.E.; Woldorff, M.G.; Mitroff, S.R. Video game players show more precise multisensory temporal processing abilities. Atten. Percept. Psychophys. 2010, 72, 1120–1129. [Google Scholar] [CrossRef] [PubMed]
Massiceti, D.; Hicks, S.L.; van Rheede, J.J. Stereosonic vision: Exploring visual-to-auditory sensory substitution mappings in an immersive virtual reality navigation paradigm. PLoS ONE 2018, 13, e0199389. [Google Scholar] [CrossRef] [PubMed]
Connors, E.C.; Yazzolino, L.A.; Sánchez, J.; Merabet, L.B. Development of an audio-based virtual gaming environment to assist with navigation skills in the blind. J. Vis. Exp. JoVE 2013, 50272. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
Van Nguyen, T.; Dai, X.; Khan, I.; Thawonmas, R.; Pham, H.V. A deep reinforcement learning blind AI in DareFightingICE. In Proceedings of the 2022 IEEE Conference on Games (CoG), Beijing, China, 21–24 August 2022; IEEE: New York, NY, USA, 2022; pp. 632–637. [Google Scholar]
Lake, B.M.; Ullman, T.D.; Tenenbaum, J.B.; Gershman, S.J. Building machines that learn and think like people. Behav. Brain Sci. 2017, 40, e253. [Google Scholar] [CrossRef]
Gursesli, M.C.; Lombardi, S.; Duradoni, M.; Bocchi, L.; Guazzini, A.; Lanata, A. Facial emotion recognition (FER) through custom lightweight CNN model: Performance evaluation in public datasets. IEEE Access 2024, 12, 45543–45559. [Google Scholar] [CrossRef]
Khan, I.; Van Nguyen, T.; Dai, X.; Thawonmas, R. Darefightingice competition: A fighting game sound design and ai competition. In Proceedings of the 2022 IEEE Conference on Games (CoG), Beijing, China, 21–24 August 2022; IEEE: New York, NY, USA, 2022; pp. 478–485. [Google Scholar]
Kempka, M.; Wydmuch, M.; Runc, G.; Toczek, J.; Jaśkowski, W. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In Proceedings of the 2016 IEEE Conference on Computational Intelligence and Games (CIG), Santorini, Greece, 20–23 September 2016; IEEE: New York, NY, USA, 2016; pp. 1–8. [Google Scholar]
Smuts, A. Are video games art? Contemp. Aesthet. (J. Arch.) 2005, 3, 6. [Google Scholar]
Gursesli, M.C.; Martucci, A.; Mattiassi, A.D.; Duradoni, M.; Guazzini, A. Development and validation of the psychological motivations for playing video games scale (PMPVGs). Simul. Gaming 2024, 55, 856–885. [Google Scholar] [CrossRef]
Gursesli, M.C.; Guazzini, A.; Thawonmas, R.; Valenti, C.; Duradoni, M.; Thawonmas, R. Internet Gaming Disorder and Psychological Distress: A PRISMA systematic review. Heliyon 2025, 11, e43518. [Google Scholar] [CrossRef]
Bostan, B. Player motivations: A psychological perspective. Comput. Entertain. (CIE) 2009, 7, 1–26. [Google Scholar] [CrossRef]
Przybylski, A.K.; Rigby, C.S.; Ryan, R.M. A motivational model of video game engagement. Rev. Gen. Psychol. 2010, 14, 154–166. [Google Scholar] [CrossRef]
Thiparpakul, P.; Mokekhaow, S.; Supabanpot, K. How can video game atmosphere affect audience emotion with sound. In Proceedings of the 2021 9th International Conference on Information and Education Technology (ICIET), Okayama, Japan, 27–29 March 2021; IEEE: New York, NY, USA, 2021; pp. 480–484. [Google Scholar]
Guillen, G.; Jylha, H.; Hassan, L. The role sound plays in games: A thematic literature study on immersion, inclusivity and accessibility in game sound research. In Proceedings of the 24th International Academic Mindtrek Conference, Virtual, Finland, 1–3 June 2021; pp. 12–20. [Google Scholar]
Paterson, N.; Naliuka, K.; Jensen, S.K.; Carrigy, T.; Haahr, M.; Conway, F. Spatial audio and reverberation in an augmented reality game sound design. In Proceedings of the 40th AES Conference: Spatial Audio, Tokyo, Japan, 8–10 October 2010. [Google Scholar]
Haehn, L.; Schlittmeier, S.J.; Böffel, C. Exploring the impact of ambient and character sounds on player experience in video games. Appl. Sci. 2024, 14, 583. [Google Scholar] [CrossRef]
Nacke, L.E.; Grimshaw, M.N.; Lindley, C.A. More than a feeling: Measurement of sonic user experience and psychophysiology in a first-person shooter game. Interact. Comput. 2010, 22, 336–343. [Google Scholar] [CrossRef]
Broderick, J.; Duggan, J.; Redfern, S. The importance of spatial audio in modern games and virtual environments. In Proceedings of the 2018 IEEE Games, Entertainment, Media Conference (GEM), Galway, Ireland, 15–17 August 2018; IEEE: New York, NY, USA, 2018; pp. 1–9. [Google Scholar]
Semionov, K.; McGregor, I. Effect of various spatial auditory cues on the perception of threat in a first-person shooter video game. In Proceedings of the 15th International Audio Mostly Conference, Graz, Austria, 15–17 September 2020; pp. 22–29. [Google Scholar]
Sadowska, D.; Sacewicz, T.; Rębiś, K.; Kowalski, T.; Krzepota, J. Examining physiological changes during counter-strike: Global offensive (CS: GO) performance in recreational male esports players. Appl. Sci. 2023, 13, 11526. [Google Scholar] [CrossRef]
Ng, P.; Nesbitt, K.; Blackmore, K. Sound improves player performance in a multiplayer online battle arena game. In Proceedings of the Australasian Conference on Artificial Life and Computational Intelligence, Newcastle, NSW, Australia, 5–7 February 2015; Springer: Cham, Switzerland, 2015; pp. 166–174. [Google Scholar]
Ng, P.; Nesbitt, K. Informative sound design in video games. In Proceedings of the 9th Australasian Conference on Interactive Entertainment: Matters of Life and Death, Melbourne, Australia, 30 September–1 October 2013; pp. 1–9. [Google Scholar]
Schaffert, N.; Mattes, K.; Effenberg, A.O. A sound design for acoustic feedback in elite sports. In International Symposium on Computer Music Modeling and Retrieval, Proceedings of the 6th International Symposium, CMMR/ICAD 2009, Copenhagen, Denmark, 18–22 May 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 143–165. [Google Scholar]
Bandai Namco Entertainment. Tekken 7, Bandai Namco Entertainment: Minato, Tokyo, 2015.
NetherRealm Studios. Mortal Kombat 11, NetherRealm Studios: Chicago, IL, USA, 2019.
Double Helix Games. Killer Instinct, Double Helix Games: Irvine, CA, USA, 2013.
Khan, I.; Van Nguyen, T.; Thawonmas, R. Fighting to the beat: Multi-instrumental adaptive background music approach. Entertain. Comput. 2025, 55, 100985. [Google Scholar] [CrossRef]
Agrimi, E.; Battaglini, C.; Bottari, D.; Gnecco, G.; Leporini, B. Game accessibility for visually impaired people: A review. Soft Comput. 2024, 28, 10475–10489. [Google Scholar] [CrossRef]
Games, G. Shades of Doom. 2001. Available online: https://www.gmagames.com/sod.html (accessed on 27 October 2025).
AGRIP. AudioQuake. 2003. Available online: https://www.igdb.com/games/audioquake (accessed on 27 October 2025).
The Terraformers Team. Terraformers. 2003. Available online: https://terraformers.nu/ (accessed on 27 October 2025).
Nair, V.; Karp, J.L.; Silverman, S.; Kalra, M.; Lehv, H.; Jamil, F.; Smith, B.A. Navstick: Making video games blind-accessible via the ability to look around. In Proceedings of the 34th Annual ACM Symposium on User Interface Software and Technology, Virtual, 10–14 October 2021; pp. 538–551. [Google Scholar]
Nair, V.; Zhu, H.; Song, P.; Wang, J.; Smith, B.A. Surveyor: Facilitating Discovery Within Video Games for Blind and Low Vision Players. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–15. [Google Scholar]
Marples, D.; Gledhill, D.; Carter, P. The effect of lighting, landmarks and auditory cues on human performance in navigating a virtual maze. In Proceedings of the Symposium on Interactive 3D Graphics and Games, San Francisco, CA, USA, 5–7 May 2020; pp. 1–9. [Google Scholar]
Tan, S.L.; Baxa, J.; Spackman, M. The role of music in video games. In Playing Video Games: Motives, Responses, and Consequences; Routledge: Abingdon, UK, 2010; pp. 141–157. [Google Scholar]
Nacke, L.E.; Grimshaw, M.N.; Lindley, C.A. The effects of sound and graphics on immersion in video games. In Proceedings of the Audio Mostly Conference, Piteå, Sweden, 15–17 September 2010; pp. 1–7. [Google Scholar]
Mirza, B.; Hoque, M.E.; Mahmud, H. The impact of audio-visual stimuli on gaming performance. Entertain. Comput. 2019, 31, 100308. [Google Scholar]
Lipscomb, S.D.; Zehnder, S.M. Interactive music: Crossing boundaries. In Proceedings of the 2005 International Computer Music Conference, Barcelona, Spain, 4–10 September 2005; pp. 747–750. [Google Scholar]
Khan, I.; Van Nguyen, T.; Gursesli, M.C.; Thawonmas, R. Sonic Doom: Enhanced Sound Design and Accessibility in a First-Person Shooter Game. In Proceedings of the 2025 IEEE Conference on Games (CoG), Lisbon, Portugal, 26–29 August 2025; IEEE: New York, NY, USA, 2025; pp. 1–8. [Google Scholar]
Ishihara, M.; Miyazaki, T.; Chu, C.Y.; Harada, T.; Thawonmas, R. Applying and improving Monte-Carlo Tree Search in a fighting game AI. In Proceedings of the 13th International Conference on Advances in Computer Entertainment Technology, Osaka, Japan, 9–12 November 2016; pp. 1–6. [Google Scholar]
Organizers, D.C. The DareFightingICE Sound Design and AI Competition Winners Details, R22. Results of a Competition Held at the IEEE Conference on Games (CoG) 2022. Available online: https://www.ice.ci.ritsumei.ac.jp/~ftgaic/index-R22.html (accessed on 9 September 2025).
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Petrenko, A.; Huang, Z.; Kumar, T.; Sukhatme, G.; Koltun, V. Sample factory: Egocentric 3d control from pixels at 100000 fps with asynchronous reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 13–18 July 2020; pp. 7652–7662. [Google Scholar]
Cheiran, J.F.; Nedel, L.; Pimenta, M.S. Inclusive games: A multimodal experience for blind players. In Proceedings of the 2011 Brazilian Symposium on Games and Digital Entertainment, Salvador, Brazil, 7–9 November 2011; IEEE: New York, NY, USA, 2011; pp. 164–172. [Google Scholar]
Smith, B.A.; Nayar, S.K. The RAD: Making racing games equivalently accessible to people who are blind. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; pp. 1–12. [Google Scholar]
Zeng, L.; Weber, G. A pilot study of collaborative accessibility: How blind people find an entrance. In Proceedings of the 17th International Conference on Human-Computer Interaction with Mobile Devices and Services, Copenhagen, Denmark, 24–27 August 2015; pp. 347–356. [Google Scholar]
Phan, M.H.; Keebler, J.R.; Chaparro, B.S. The development and validation of the game user experience satisfaction scale (GUESS). Hum. Factors 2016, 58, 1217–1247. [Google Scholar] [CrossRef]
Ruxton, G.D.; Beauchamp, G. Time for some a priori thinking about post hoc testing. Behav. Ecol. 2008, 19, 690–693. [Google Scholar] [CrossRef]
He, X.; Burger-Helmchen, T. Evolving knowledge management: Artificial intelligence and the dynamics of social interactions. IEEE Eng. Manag. Rev. 2024, 53, 215–231. [Google Scholar] [CrossRef]
Clark, A. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci. 2013, 36, 181–204. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Schissler, C.; Garg, S.; Kobernik, P.; Clegg, A.; Calamia, P.; Batra, D.; Robinson, P.; Grauman, K. Soundspaces 2.0: A simulation platform for visual-acoustic learning. Adv. Neural Inf. Process. Syst. 2022, 35, 8896–8911. [Google Scholar]
Ha, D.; Schmidhuber, J. World models. arXiv 2018, arXiv:1803.10122. [Google Scholar] [PubMed]
Cheng, L.; Wei, X.; Li, M.; Tan, C.; Yin, M.; Shen, T.; Zou, T. Integrating Evolutionary Game-Theoretical Methods and Deep Reinforcement Learning for Adaptive Strategy Optimization in User-Side Electricity Markets: A Comprehensive Review. Mathematics 2024, 12, 3241. [Google Scholar] [CrossRef]
Stein, A.; Yotam, Y.; Puzis, R.; Shani, G.; Taieb-Maimon, M. EEG-triggered dynamic difficulty adjustment for multiplayer games. Entertain. Comput. 2018, 25, 14–25. [Google Scholar] [CrossRef]
Soares, R.T.; Sarmanho, E.; Miura, M.; Barros, T.; Jacobi, R.; Castanho, C. Biofeedback sensors in electronic games: A practical evaluation. In Proceedings of the 2017 16th Brazilian symposium on computer games and digital entertainment (SBGames), Curitiba, Brazil, 2–4 November 2017; IEEE: New York, NY, USA, 2017; pp. 56–65. [Google Scholar]
Hunicke, R. The case for dynamic difficulty adjustment in games. In Proceedings of the 2005 ACM SIGCHI International Conference on Advances in Computer Entertainment Technology, Valencia, Spain, 15–17 June 2005; pp. 429–433. [Google Scholar]
Gursesli, M.C.; Selek, M.E.; Samur, M.O.; Duradoni, M.; Park, K.; Guazzini, A.; Lanatà, A. Design of cloud-based real-time eye-tracking monitoring and storage system. Algorithms 2023, 16, 355. [Google Scholar] [CrossRef]
Gursesli, M.C.; Calà, F.; Tarchi, P.; Frassineti, L.; Guazzini, A.; Lanatà, A. Eyetracking correlated in the matching pairs game. In Proceedings of the 2023 IEEE Gaming, Entertainment, and Media Conference (GEM), Bridgetown, Barbados, 19–22 November 2023; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]

Figure 1. Basic and Deadly Corridor scenarios in VizDoom.

Figure 2. An example of the DareFightingICE platform.

Figure 3. Illustration of (A) the audio encoder and (B) the agent architecture.

Figure 4. DareFightingICE’s blind AI encoders: 1D-CNN (top), Fast Fourier Transform (middle), and Mel (bottom).

Figure 5. The diagram presents the flow of the user study for both games.

Figure 6. Audio Aesthetics: the game effect from the ART model (

p = 0.003

).

Figure 6. Audio Aesthetics: the game effect from the ART model (

p = 0.003

).

Table 1. DareFightingICE: Human performance by expertise and modality (values shown as the mean ± SD).

	Non-Vision Mode (Audio-Only)		Vision Mode (Vision + Audio)
Expertise	Win Ratio (%)	HP Difference	Win Ratio (%)	HP Difference
Very familiar	73.0 ± 0.15	+60.09 ± 45.34	74.0 ± 0.24	+69.78 ± 76.73
Familiar	56.0 ± 0.16	+21.88 + 37.00	74.0 ± 0.14	+61.22 ± 31.71
Somewhat familiar	36.0 ± 0.17	−13.89 ± 33.63	54.0 ± 0.26	+28.51 ± 73.91
Overall Avg.	53.0 ± 0.22	+18.13 ± 52.45	67.0 ± 0.24	+51.31 ± 66.74

Table 2. Comparison of the performances (A: Blind AI vs. Human and B: Blind AI with Sound Design, using the winner sound design from the 2022 DareFightingICE Sound Design and AI Competition [47]).

A. Blind AI vs. Human			B. Blind AI on All Encoders
Type	Win (%)	HP Diff.	Type	Win (%)	HP Diff.
Blind AI	54.0	24.64	1D-CNN	54.0	12.66
Human	53.0	18.13	FFT	53.0	27.03
			Mel-spec	56.0	34.23

Table 3. SonicDoom performance metrics: AI vs. Human (Blind Condition) (values shown as the mean ± SD).

Deadly Corridor Scenario
Player Type	Avg. Survival Time (s) ↑	Avg. Health ↑	Avg. Kills ↑	Avg. Deaths ↓
Very familiar	29.27 ± 8.82	9.02 ± 22.91	3.19 ± 0.94	0.84 ± 0.21
Familiar	28.70 ± 10.05	−9.33 ± 6.63	1.98 ± 1.21	1.0 ± 0.0
Somewhat familar	13.17 ± 6.04	−13.50 ± 3.46	0.78 ± 1.30	1.0 ± 0.0
Overall	25.57 ±10.96	−2.79 ± 18.15	2.21 ± 1.46	0.94 ± 1.16
Blind AI Agent	5.57	−1.02	2.5	0.92
Basic Scenario
Player Type	Avg. Completion Time (s) ↓	-	Avg. Kills ↑	-
Very familiar	5.35 ± 2.71	-	0.64 ± 0.35	-
Familiar	6.11 ± 1.41	-	0.62 ± 0.28	-
Somewhat familar	9.25 ± 3.49	-	0.37 ± 0.30	-
Overall	6.48 ± 2.93	-	0.58 ± 0.33	-
Blind AI Agent	1.55	-	0.31	-

Table 4. SonicDoom performance metrics: AI vs. Human (Non-Blind Condition) (values shown as the mean ± SD).

Deadly Corridor Scenario
Player Type	Avg. Survival Time (s) ↑	Avg. Health ↑	Avg. Kills ↑	Avg. Deaths ↓
Very familiar	10.23 ± 3.91	16.51 ± 21.68	4.48 ± 1.33	0.58 ± 0.39
Familiar	7.61 ± 3.45	10.36 ± 19.24	3.30 ± 1.34	0.79 ± 0.24
Somewhat familar	3.31 ± 4.43	−21.60 ± 29.42	2.44 ± 1.52	0.96 ± 0.22
Overall	7.74 ± 4.65	5.92 ± 27.10	3.59 ± 1.59	0.74 ± 0.33
Blind AI Agent	1.22	154.78	3.02	0.20
Basic Scenario
Player Type	Avg. Completion Time (s) ↓	-	Avg. Kills ↑	-
Very familiar	1.56 ± 0.87	-	1.0 ± 0.0	-
Familiar	2.69 ± 1.30	-	1.0 ± 0.0	-
Somewhat familar	7.81 ± 5.28	-	0.80 ± 0.33	-
Overall	3.34 ± 3.57	-	0.96 ± 0.17	-
Blind AI Agent	0.15	-	1.0	-

Table 5. GUESS group comparisons per game and subscale (Kruskal–Wallis with Holm correction).

	Kruskal–Wallis		Effect Size
Subscale	KW H(df)	p (Holm)	eta²[H] (95% CI)
SonicDoom
Usability/Playability	2.59 (2)	0.274 (0.822)	0.02 [−0.05, 0.29]
Play Engrossment	0.67 (2)	0.715 (1.000)	−0.04 [−0.06, 0.22]
Enjoyment	0.83 (2)	0.661 (1.000)	−0.03 [−0.06, 0.22]
Audio Aesthetics	3.45 (2)	0.178 (0.712)	0.04 [−0.05, 0.34]
DareFightingICE
Usability/Playability	3.42 (2)	0.181 (0.724)	0.04 [−0.05, 0.36]
Play Engrossment	0.81 (2)	0.666 (1.000)	−0.03 [−0.06, 0.26]
Enjoyment	1.32 (2)	0.518 (1.000)	−0.02 [−0.06, 0.25]
Audio Aesthetics	0.79 (2)	0.674 (1.000)	−0.04 [−0.06, 0.23]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khan, I.; Nguyen, T.V.; Juraj, C.T.; Thawonmas, R. The Novice, the Expert, and the Algorithm: A Comparative Analysis of Human Expertise Transfer and AI Performance in Audio-Only Gaming Environments. Appl. Sci. 2025, 15, 11594. https://doi.org/10.3390/app152111594

AMA Style

Khan I, Nguyen TV, Juraj CT, Thawonmas R. The Novice, the Expert, and the Algorithm: A Comparative Analysis of Human Expertise Transfer and AI Performance in Audio-Only Gaming Environments. Applied Sciences. 2025; 15(21):11594. https://doi.org/10.3390/app152111594

Chicago/Turabian Style

Khan, Ibrahim, Thai Van Nguyen, Cvetković Tijan Juraj, and Ruck Thawonmas. 2025. "The Novice, the Expert, and the Algorithm: A Comparative Analysis of Human Expertise Transfer and AI Performance in Audio-Only Gaming Environments" Applied Sciences 15, no. 21: 11594. https://doi.org/10.3390/app152111594

APA Style

Khan, I., Nguyen, T. V., Juraj, C. T., & Thawonmas, R. (2025). The Novice, the Expert, and the Algorithm: A Comparative Analysis of Human Expertise Transfer and AI Performance in Audio-Only Gaming Environments. Applied Sciences, 15(21), 11594. https://doi.org/10.3390/app152111594

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Novice, the Expert, and the Algorithm: A Comparative Analysis of Human Expertise Transfer and AI Performance in Audio-Only Gaming Environments

Abstract

1. Introduction

2. Related Work

2.1. Sound in Video Games

2.2. Sound and Performance in First-Person Shooter and Fighting Games

2.3. Accessibility in Video Games

2.4. Audio and Player Performance

3. Methodology

3.1. Experimental Platforms and Tasks

3.1.1. FPS Testbed: SonicDoom

3.1.2. Fighting Game Testbed: DareFightingICE

3.2. AI Agent Implementation

3.2.1. VizDoom

3.2.2. DareFightingICE

3.3. Human Participant Study

3.3.1. Procedure

3.3.2. Participants

3.3.3. Measures

3.4. Data Analysis

4. Results

4.1. DareFightingICE

4.1.1. Human Performance Across Modalities

4.1.2. Human vs. AI Comparison

4.2. SonicDoom

Human Performance Across Modalities

4.3. Human vs. AI Comparison in SonicDoom

4.3.1. Blind (Audio-Only) Condition Analysis

4.3.2. Non-Blind (Vision) Condition Analysis

4.4. Subjective User Experience (GUESS Scores)

5. Discussion

5.1. The Divergence of Human and AI Expertise

5.2. Implications for Accessible Game Design and AI Development

5.3. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Individual Player Performance Data

Appendix A.1. DareFightingICE: Individual Performance Metrics

Appendix A.2. SonicDoom: Individual Performance Metrics

Appendix A.2.1. Insights from Individual Performance

Expertise Directly Correlates with Adaptability in DareFightingICE

Visual Deprivation Severely Hampers Navigational and Combat Efficiency in SonicDoom

High Degree of Performance Variability Within Experience Groups

Appendix B. GUESS Scores

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI