4.1. Spatial Perception Differences and Coexistence Mechanisms of Natural–Cultural Soundscapes: The Role of Spatial Expectations
This study collected data on visitors’ sound source recognition, subjective experiences, and sound intensity using questionnaires and sound level meters. We propose that the influence of sound source type and sound intensity on subjective perception is significantly shaped by spatial context and cultural expectations—a dimension underemphasized in existing soundscape research on historic areas [
49,
55]. Tourists’ expectations regarding different spatial functions can amplify or buffer the impact of the acoustic environment on their subjective evaluations. For example, in industrial heritage sites, tourists’ expectations of mechanical sounds can enhance their sense of immersion [
57]. This is consistent with the “Expectation–Perception Matching Theory” [
49], but we have extended it to the context of cultural heritage. Based on the empirical evidence from the historical–natural interface zone, this study reveals two unique soundscape perception mechanisms, which supplement and refine the existing theories.
First, visitors’ context-specific expectations (quietness for forests, liveliness for historic streets) influence their subjective evaluations, potentially overriding objective sound intensity levels. For example, in our research and analysis, the L
Aeq were higher in the forest than in the street area, visitors rated the street as noisier than the forest. This contradiction can be explained by the expectation mechanism in environmental psychology, which refers to the positive impact of people’s expectations on their perceptions through “cognitive–emotional consistency” [
58]. In the study area, tourists form a dual reference framework based on spatial functions: when entering a forest, they take “natural tranquility” as a reference (for example, expecting the sound of wind blowing through bamboo forests and distant water flows), so even high-intensity natural sounds are perceived as “consistent with the natural environment” and will not be labeled as “noise”. On the contrary, in historical streets, the reference shifts to “orderly cultural vitality” (for example, expecting folk performances and the cries of vendors), so suddenly appearing peak sounds (such as the sharp broadcast sounds of vendors) are regarded as “disrupting the expected order” and amplified as “noise”. This finding is similar to the results of the traffic noise reduction experiment conducted by Levenhagen et al. (2021) in U.S. national parks: reducing traffic noise levels did not significantly improve visitors’ soundscape pleasantness, and visitors’ evaluation of the soundscape depends on their pre-existing psychological expectations of natural sounds rather than the physical sound pressure level [
59]. Liu et al. (2017) found in their study on the soundscape of natural parks that tourists’ tolerance for aircraft noise increased by 15% when they were informed in advance that “aircraft often pass by here [
50]”. However, our study further specifies that this effect is strengthened in historic–natural interface areas—where natural sounds (e.g., wind through bamboo in Wuhou Shrine’s forest) are culturally framed as “complementary to historical ambiance”, whereas anthropogenic sounds (e.g., vendor calls in Jinli Street) are perceived as “disruptive to historical immersion” despite lower L
Aeq. In other words, while the street was objectively quieter, it was perceived as noisier—likely due to a mismatch between acoustic features and spatial expectations
Second, the difference in tourists’ perceptual evaluations between forests and urban blocks may be attributed to the composition of the dominant sound sources. According to the survey results, forest soundscapes are dominated by geophysical and biological sounds, while street soundscapes are dominated by biological and anthropogenic sounds. Based on this difference, this survey report shows that people can perceive a higher degree of pleasure, comfort, and harmony in forests. This indicates that in historical areas, people’s tolerance for high-intensity natural sounds may be higher than that for high-intensity cultural sounds. This finding is similar to the results of recent studies, mainly reflected in two aspects. On the one hand, researchers generally believe that natural soundscapes lead to higher satisfaction: a cross-cultural study by Purves and Wartmann (2023) shows that natural sounds are more often regarded as environmental backgrounds, while cultural sounds are to some extent considered invasive activities [
53]; a study by Ye et al. (2024) also found that even if natural sound sources (such as the sound of flowing water) have a high sound pressure level (>60 dBA), they are still acceptable because they meet people’s expectations for the natural characteristics of historical environments [
3]; at the same time, natural environments can reduce the annoyance caused by high sound pressure levels, thereby enhancing the pleasure of soundscapes [
60,
61]. Vegetation in urban green spaces can also soften sound perception, reduce negative emotions caused by loudness, and the reduction of human-related sounds in natural spaces can also improve people’s evaluation of soundscapes [
3,
62,
63,
64]. On the other hand, researchers believe that crowd noise is more likely to cause disgust: such sounds will destroy the spirit of the place [
3]. In historical environments, human sounds, which are classified as biological sounds in this study but are essentially anthropogenic, are generally considered to have a negative impact on pleasure [
44] because they destroy the authenticity of the historical soundscape that tourists pursue [
49]; in addition, the density of street crowds may increase subjective loudness, which is consistent with the study by Meng et al. (2015) [
65], and the dominant sound will affect tourists’ overall perception of the loudness of the space, and its impact is usually greater than that of background sounds or insignificant sounds [
44]—for example, tourists in street areas regard conversation sounds and broadcast sounds as the dominant sound sources, while in forests, geophysical sounds (such as wind and water sounds) are often regarded as the dominant sound sources. Therefore, even if the sound pressure level in forests is higher than that in urban blocks, tourists still think that forests are quieter than urban blocks.
In conclusion, people’s expectations of soundscapes are shaped by their previous experiences in similar environments. When the actual soundscape aligns with these expectations, people are less likely to form negative impressions and may even be less aware of the existence of the sound environment [
49,
57]. For example, in studies, tourists rated the harmony, comfort, and pleasantness of street areas higher than their quietness. This indicates that despite the relatively high noise levels, tourists have certain expectations about the liveliness of the area, which helps to mitigate negative perceptions. Our research extends this insight to historical areas, showing that the “expected sense of liveliness” stems from cultural associations related to heritage (for instance, Jinli Street’s reputation as a “traditional commercial block”), making it a more important predictor of subjective experience than objective quietness.
4.2. Soundscape Conflict at the Historic–Forest Interface: Characteristics, Mechanisms, and Theoretical Implications
The Wuhou Shrine Museum’s forest (Cultural Relics and Western Areas) and historic district (Jinli Ancient Street) are directly adjacent, with multiple pathways on the north side facilitating sound interaction between the two spaces. Kernel density analysis of sound intensity, sound source recognition, and subjective experience revealed a critical phenomenon: the interface zone (Points 1, 3, 4, 8) exhibited higher soundscape diversity but lower subjective evaluations (comfort score = 3.1) compared to both the forest and the historic street. This high diversity-low satisfaction paradox defines the core of soundscape conflict at the historic–forest interface—a pattern not fully addressed in existing research on urban interface soundscapes [
66].
In terms of sound intensity, the interface zone had lower L
Aeq than both the forest and the street, yet visitors reported more perceived noise (quietness score = 2.7, vs. forest 3.43 and street 2.41;
Table 7). This discrepancy stems from sound masking by dominant anthropogenic sounds: peak sounds from the street (e.g., sudden vendor broadcasts, tourist chatter) masked subtle natural sounds in the forest (e.g., wind through bamboo, distant water flow), reducing visitors’ ability to perceive coherent soundscapes. This is consistent with recent studies, which show that anthropogenic noise can impair humans’ ability to recognize biological sounds and low-frequency sounds through spectral masking and cognitive interference [
67,
68]. This aligns with the study of Liu et al. (2014), who demonstrated that dominant sound sources in interface zones disrupt the detectability of complementary sounds, leading to perceived acoustic disorder [
66]. Notably, the interface lacked shared activity nodes (e.g., plazas, green pavilions) to buffer sound transmission, unlike the forest’s water features or the street’s performance areas, which further exacerbated sound incoherence.
Regarding sound source recognition, the interface’s high SDI (driven by mixed geophony, biophony, and anthropophony) did not enhance perceptual richness but instead increased confusion. This finding is also consistent with that of Liu et al. (2014) [
66], who found in their study that a higher SDI impairs the perception of bird songs (biological sounds). This is because anthropogenic sounds (e.g., street broadcasts, PO = 4.2) frequently overlapped with natural sounds (e.g., birdcalls, PO = 3.9) at the interface, creating “acoustic competition” that violated visitors’ context-specific expectations [
66,
67]. This finding extends the research of Schreckenberg et al. (2010) on landscape-soundscape interactions, which noted that “visual cues shape soundscape expectations, and mismatches intensify dissatisfaction” in heritage sites [
69].
Subjective evaluations at the boundary are also lower than those on both sides. This phenomenon may be related to the influence of tourists’ visual perception on soundscape evaluation, and its underlying reasons can be further explained through the expectation mechanism [
49,
58]. Specifically, when the soundscape matches tourists’ cultural memories or experiential emotional memories, it triggers positive emotional responses; conversely, mismatched sounds break this connection, leading to negative evaluations. In the context of Wuhou Temple, tourists form two core emotional connections based on cultural cognition: (1) For the forest area (cultural relics area), the emotional connection is ‘historical tranquility’—tourists associate the forest with the cultural background of the Three Kingdoms, so natural sounds (such as the rustling of bamboo leaves in the wind) strengthen this connection and enhance comfort. (2) For the street area (Jinli Ancient Street), the emotional connection is ‘traditional liveliness’—tourists associate the street with ‘ancient commercial scenes’ (such as folk performances, traditional vendors’ cries), so moderate cultural sounds strengthen this connection and offset the negative impact of low quietness. However, in the interface area, mixed sounds (such as overlapping street broadcasts and bird songs) break these two emotional connections: artificial sounds disrupt the ‘historical tranquility’ of the forest, while faint natural sounds cannot support the ‘traditional liveliness’ of the street, ultimately resulting in a low harmony score. In addition, the high crowd density at path intersections also reduces tourists’ subjective evaluations [
34]. It should be particularly noted that unlike the street area—where crowd noise is often regarded as “vibrant cultural vitality” [
70]—the crowd sounds at the junction conflict with tourists’ expectation of a “peaceful transition between history and nature”, further exacerbating the evaluation contradiction.
Existing studies on historic or forest soundscapes often treat the two as isolated systems [
31,
45], but this study highlights that their interface is not a “neutral transition” but a dynamic zone of conflict shaped by three interrelated mechanisms: sound masking, audiovisual mismatch, and transient crowding. This found specifying that in historic–natural contexts, conflict arises not from high sound intensity alone, but from the “misalignment between sound composition, visual cues, and cultural expectations.” For example, the interface’s mixed sounds failed to align with either the street’s “cultural liveliness” or the forest’s “natural tranquility”, leaving visitors without a coherent perceptual frame. The revelation of the laws governing soundscape integration by this special type of space not only compensates for the limitation of existing studies that mostly focus on single spaces and ignore the complexity of boundary zones, but also provides a supplementary perspective of “cultural–natural composite scenes” for mainstream soundscape theories. For example, existing international studies are mostly based on the single logic of “natural soundscapes promoting restoration” or “cultural soundscapes enhancing identity”. However, this study, through case studies of boundary zones, confirms that the integration of these two types of soundscapes is not a simple choice between “natural sounds dominating” or “cultural sounds leading”, but rather requires building an appropriate integration model based on the “dual historical–ecological attributes” of the space itself. For heritage sites like Wuhou Temple, which are centered on the Three Kingdoms culture, the optimal state of soundscape integration in the boundary zone is “taking natural sounds (flowing water, bamboo forest sounds) as the base, and low-intensity cultural sounds (traditional musical instrument performances, soft explanations) as embellishments”. This not only retains the restorative value of the natural environment but also does not weaken the cultural immersion of historical scenes. This law not only provides a reusable theoretical framework for soundscape planning of similar “forest–historical boundary zones” (such as Suzhou gardens and surrounding urban green spaces, Kyoto’s historical districts and suburban mountains and forests) but also contributes empirical experience from the Chinese context to international soundscape research on heritage sites. It helps promote the improvement and deepening of the theory of “cultural–natural soundscape integration” in cross-cultural contexts, and further enhances the international academic attention and practical reference value of this study.
4.3. Factors Influencing Soundscape Perception: Sound Intensity, Sound Source Recognition, and Individual Characteristics
To explore how sound intensity, sound source recognition, and subjective experience interact, this study constructed a SEM to examine the effects of these variables on visitors’ perceptual outcomes. The results of the SEM (after modification: χ
2/DF = 1.62, GFI = 0.9, CFI = 0.921, RMSEA = 0.075;
Table 6) confirm that the three factors of sound intensity, sound source identification, and individual characteristics jointly explain subjective perception, and each has a different influence mechanism. Hypotheses Ha (sound intensity has a positive impact on subjective experience), Hb (sound source identification has a negative impact on subjective experience), and Hc (individual characteristics have a negative impact on subjective experience) all passed the significance test (
Table 8). This result provides key empirical support for the multi-factor interaction research on soundscape perception and echoes and expands upon previous studies.
Hypothesis Ha (sound intensity has a positive effect on subjective experience) was supported. Among the sound intensity metrics (L
10, L
Aeq, L
50, L
90), the SEM regression coefficients showed a clear hierarchy: L
10 > L
Aeq > L
50 > L
90 (
Table 6). This indicates that peak sounds have the strongest impact on subjective experience, while background sounds have minimal influence. This finding contrasts with Axelsson et al. (2010), who identified L
Aeq as the primary predictor of comfort in general urban spaces [
71], yet aligns with the observation that “peak sounds dominate perceptual evaluations in heritage sites” because they disrupt the “historical immersion” visitors seek [
3,
44]. In addition, previous studies have shown that loud mechanical noises (frequently appearing on the street) can affect human auditory capacity, impede stress recovery, and lead to negative emotions [
72,
73,
74]. This also explains why visitors perceive the street as noisier, as commercial streets often feature harsh, disruptive sounds such as vendor calls. For instance, even though the forest had higher L
Aeq than the street, the street’s more frequent peak sounds led visitors to rate it as “noisier”, highlighting that peak sound control, not just average intensity reduction, is critical for historic soundscape optimization.
Assume that Hab (there is a significant interaction between sound intensity and sound source identification) is not supported: the covariance coefficient between sound intensity and sound source identification is not significant (
Table 8), which reflects the spatial segmentation of soundscapes in the study area. Historical streets are dominated by medium-to-low intensity anthropogenic sounds, while forests are characterized by high-intensity natural sounds. As found by Liu et al. (2014), when soundscapes are spatially separated (for example, separated by the bamboo forest buffer zone of Wuhou Temple), intensity and sound source type become “decoupled”—this explains why no direct interaction was observed [
66]. This finding adds nuance to the model by Aletta et al. (2016), which assumes a universal interaction between intensity and sound source identification [
17], whereas this study shows that this relationship depends on the spatial context. Therefore, the value of the non-significant result of Hab lies in that it reveals the spatial boundary conditions of the “sound intensity–sound source identification association”—when there is an obvious spatial separation of soundscapes, the interaction between the two becomes ineffective, which has certain significance for revising the universal model of soundscape perception.
In addition, Hypothesis Hb, which examines the effect of sound source recognition (including anthropogenic and biological sounds) on subjective experience, shows that sound source recognition (P
O) has a negative impact on subjective experience, with biological sounds having the most significant effect. This result differs from our initial assumption that biological sounds, particularly non-human biophony like bird songs, would exert a neutral or positive influence on subjective experience, as supported by prior studies linking natural biophony to restorative effects [
32,
33,
34,
35,
36]. The key explanation lies in the composition of “biological sounds” in our dataset: while bird songs were present, they were outnumbered by human conversations (classified as biological sounds in our framework but inherently anthropogenic in nature). Biological sounds, such as conversations, have a notable negative impact on visitors’ subjective experience when they frequent occurrence [
3,
70]. Studies have found that human-dominated biological sounds have destroyed “historical authenticity” [
70]. The kernel density analysis (
Figure 5) also reveals that areas with higher perception of biological sounds tend to have lower subjective ratings. In the forest, although there are positive sound sources such as the chirping of birds, the human conversation has a strong masking effect, which may be the reason why the high occurrence of biological sounds leads to a decrease in people’s subjective experience. Therefore, in forests located in historical areas, it is necessary to control the flow density of people to a certain extent and set up quiet visiting areas [
38], so that biological sounds such as bird calls can be more distinct.
Hypothesis Hc (individual information has a significant effect on subjective experience) was supported: Individual characteristics exerted a significant negative overall effect on subjective perception (
Table 8), but subgroup differences revealed critical nuances tied to the study’s questionnaire data (
Section 2.4). Age correlated positively with negative perceptions: older visitors were more sensitive to peak sounds, as Schreckenberg et al. (2010) noted, “noise sensitivity increases with age due to auditory system changes” [
69]. Visit frequency also mattered: frequent visitors reported lower comfort, likely because they had higher expectations for “acoustic consistency”. In contrast, education and place of origin had positive associations: visitors with graduate degrees or above showed greater tolerance for “controlled cultural noise”, while non-local tourists perceived street liveliness as a “cultural attraction” rather than a disturbance. Income also had a weak positive effect, with higher-income visitors more likely to overlook minor noise disruptions.
Furthermore, this study’s unique regional, site, and cultural backgrounds enable its findings to supplement international soundscape theory and offer differentiated contributions in three aspects: (1) Regional dimension: Unlike studies on low-density European heritage sites [
44,
45,
46], Wuhou Temple (in the core of Chengdu, a western Chinese megacity) features “high overlap between historical space and urban life”. High visitor density in Chinese scenic spots creates frequent peak sounds (e.g., folk performances, tourist conversations), making L
10 (not the commonly used L
Aeq) the core indicator of subjective experience, which informs research on high-density Asian urban heritage sites. (2) Site dimension: Different from Aletta et al.’s (2016) open European urban parks [
70], Wuhou Temple adopts a “closed-separated” layout via “bamboo buffers + low walls”. Soundscape overlap concentrates at the junction of historical streets and urban forests, leading to an insignificant “sound intensity–sound source identification” interaction—contradicting the “universal positive interaction” assumption in international models and providing a classification reference for site-specific soundscape studies. (3) Cultural dimension: Contrasted with natural sound protection in nature reserves [
72], Wuhou Temple (a core Three Kingdoms cultural heritage site) endows forest sounds (e.g., bamboo wind, water) with cultural meanings, giving them dual “ecological restoration-cultural carrying” functions. This expands soundscape theory and offers a non-Western perspective for natural soundscape protection in cultural heritage sites. In summary, this study verifies the applicability of international soundscape theory in specific scenarios, reveals soundscape laws of high-density, culturally overlapping East Asian heritage sites, and provides key empirical support for the regional and cultural diversity of international soundscape research.
To contextualize the study’s methodological and theoretical contributions, this research develops a tripartite soundscape framework tailored explicitly for historic districts—a tool designed to address the unique interplay of culture, nature, and human perception in heritage settings. This framework consists of three core components: (1) sound source typology, (2) perceptual dimensions (pleasantness, quietness, harmony, comfort), and (3) socio-demographic profiling. This framework differs from standardized soundscape scales by prioritizing context-specificity over universality. For instance, Axelsson’s model focuses on a universal Pleasure-Eventfulness-Familiarity triad [
71], and ISO/TS 12913-2 emphasizes general spatial metrics [
75], yet both of these tools overlook the “cultural compatibility” of sounds in heritage sites. In contrast, our framework explicitly includes “harmony” as a key perceptual dimension, defined as the alignment between acoustic elements and a site’s historical identity. In practice, “harmony” is operationally defined through survey questions such as “Does the sound of this area match its historical characteristics?” and measured using a 5-point Likert scale. This dimension was identified via exploratory factor analysis (
Section 2.4), which showed that harmony loaded separately from pleasure or comfort, capturing unique variance related to perceived congruence between acoustic elements and the site’s cultural identity. This allows us to distinctively quantify how sound combinations align with heritage expectations. In addition, compared to objective bioacoustic indices (e.g., NDSI for biodiversity quantification) [
76], our subjective classification captures visitor-perceived sound dominance. Socio-economic variables (income, origin) extend beyond the SSID protocol’s basic demographics [
77], revealing subgroup variations (e.g., income-correlated quietness preferences). Future studies could integrate ISO-compliant acoustic indices (e.g., NDSI/ACI) with our perceptual framework to enhance ecological validity while retaining contextual strengths, or adopt machine learning for cross-cultural predictive modeling.