Sound Water Masking to Match a Waterfront Soundscape with The Users’ Expectations: The Case Study of The Seafront in Naples, Italy

: In the last decades, the soundscape approach has attracted the attention of architects and urban planners, leading them to incorporate the acoustic features into the enjoyment of their crea-tions. One of the key aspects for an appreciated urban environment is to match the expectations of the users. In this study, the matching of the waterfront soundscape with the users’ expectations is evaluated by laboratory tests using semantic differential scales applied to reproduced virtual scenarios obtained adding different water sound pressure levels (SPLs) to the original in-situ setting. The tests were carried out by an immersive virtual reality (IVR) device, using 360° videos and spatial audio recorded in two sites of the waterfront in Naples, Italy. The scenarios were presented to the participants according to three experimental protocols, namely audio-only (A), video-only (V), and simultaneous audio-video (AV) reproduction. The examined different acoustic scenarios were the original one recorded in situ and others obtained adding seawater sounds at SPL increments of 5 dB. The results show that all the scenarios with water sounds added are rated more pleasant than the original one for the audio-only scenario. When video and audio are displayed simultaneously, two scenarios are more pleasant than the original one, likely because there is a need for coherence between the water sound SPL heard and the visible noise sources. Sounds coherent with the type of shore show a higher matching with expectations and pleasantness appraisals, rather than those that are uncoherent with the layout scenario.


Introduction
Understanding the complex phenomena involved in environmental noise perception is becoming a priority in strategies to fight against outdoor noise, especially because noise exposure represents a significant risk to physical and mental health; among other effects, it can cause sleep disturbances, cardiovascular problems, and cognitive impairment [1][2][3][4][5][6][7][8][9]. Environmental perception can be defined as a multi-dimensional phenomenon catching objects and events by the senses, understanding and identifying them in the environment and, eventually, creating a reaction [10,11]. Thus, gathering sensorial information is a three-stage process: Sensation (input), organization of the perceived elements and their identification/recognition (interpretation), and the response (outcome) [12]. The context of the acoustic environment, including non-auditory factors (e.g., the location features, individual differences, socio-cultural aspects, etc.) [13][14][15][16], can also determine how the auditory sensation is interpreted. The dynamism of the places can also affect the soundscape perception, for instance preferring the acoustic environments that best match the type of activities occurring in the place [17][18][19]. Dealing with people activity, Jo and Jeon have recently found that sounds made by people decreased the perceived tranquility of the soundscape, but made it more dynamic [16]. Other contextual factors, such as the visual stimuli and the landscape features, can strongly determine the perception of an acoustic environment [20][21][22][23][24]. For instance, Menzel et al. [25] found differences in the loudness perception of a car pass-by depending on its color. Liu et al. studied two types of landscape factors, visual and functional, evidencing that two of them are highly correlated with the overall soundscape preference [26]. Visual aspects may influence the perception of the acoustic environment [27], and vice-versa [28][29][30][31]. In that regard, Anderson et al. evidenced that the perception of an outdoor setting was conditioned by the sounds heard there [29], and Gan et al., in a study performed in natural and urban areas, found out that the acoustic preference played a much more important role in landscape evaluation than visual preferences [28]. The urban morphology (street patterns, low or high building density, etc.) is another aspect to take into account when studying the influence of the context on the acoustic perception [32][33][34]. A study by Liu et al. evidenced that the composition and structure of the landscape characterized through spatial patterns could influence the soundscape perception of urban areas [35]. In previous studies carried out by the research group of this paper in the waterfront of Naples, landscape factors related to the shape of the waterfront were found to be related with the acoustic perception of the environment [36,37].
The individual characteristics of respondents should also be considered when evaluating the factors influencing the soundscape perception. In a top-down processing, also called the conceptually-driven processing, the assimilation of sensory information is influenced by expectation, stored knowledge, interests, beliefs, and context [38,39]. The expectations and previous experiences that subjects have on a specific environment can influence its perception [40] through the codification of a complex cognitive structure. Several studies have evaluated the influence of the expectations on certain aspects of the perception of an acoustic environment [40][41][42][43]. Brambilla and Maffei suggested that the annoyance caused by a sound source is influenced by the expectations to hear it and its congruence within the context [41]. The expectations from context were included as one of the components in a model for the human auditory perception developed by Botteldooren and De Coensel [44]. Bruce and Davies developed a model in which the expectations of an acoustic environment depend on the expectation of the different elements composing it [40]. Bruce et al. analyzed through sense-walkings the influence of the expectations on singular factors, such as the soundscape and smellscape, on the environmental perception [45]. Some research indicates that certain types of expectations, such as those generated by negative information of certain noise sources, can increase negatively the perceived noise nuisance. This was observed in the study carried out by Crichton et al. on the perception of wind farm noise [46], in which participants receiving positive information were less annoyed by noise than those receiving negative information. They also found a relation between noise sensitivity and predicted noise annoyance in those receiving negative information about wind farms.

Water Masking Sounds
There is clear evidence in the literature that in open spaces people prefer to hear natural sounds over artificial man-made sounds [47,48]. Sound masking techniques based on natural sounds are sometimes used to make the users' stay more pleasant in public spaces where technological sounds, such as road traffic noise, are predominant. One of the most used masking sounds is that of water, generally coming from water fountains in their different forms (interactive floor water fountains, wall fountains, water curtain walls, water sculptures, water games for playgrounds, etc.). The large use of this sound source is due in part to experimental studies showing the benefits of water sounds: Verena-Thoma et al. evidenced that water sounds reduce stress on individuals with somatic complaints, not observed during listening to music or in silence [49]; Jeon et al. stated that water sounds are the preferred type of natural sound masking, compared to bird twittering, wind, or church bells [50]; likewise, Hao and Nilsson et al. suggested that the water sound reduces the loudness perception of road traffic noise [34,51], and You et al. found that the preferred water sound level for masking is 3 dB below the background noise level [52]. In that regard, the feeling of improved tranquility occurs even with low sound pressure levels (SPLs) of water sounds [53].

Immersive Virtual Reality as a Tool for the Appraisal of Soundscape
The validity and reliability of immersive virtual reality (IVR) applied to the soundscape appraisal compared to on-site surveys have been proven successfully in the literature: This technique combines spatial audio recording and playback (e.g., Ambisonic coding) with visual scenarios generated entirely by computer [18,54], like 360° stereoscopic videos [55]. The results of applying these techniques evidenced that, comparing on-site surveys and IVR test results, there was no statistically significant difference in the responses given on the soundscape quality. In the last years, different studies can be found in the literature using 360° videos and spatial audio [55]; the flexibility and affordability of these audio-visual systems encouraged their application for different research purposes, such as soundscape classification [56], evaluation of the pleasantness of natural sounds in residential areas [57], or studying the effect of social interaction on the perception of the soundscape [16].
A recent study has explored the perception of loudness and annoyance of different sound sources, evaluated with the images of the anechoic chamber, and with panoramic videos of the original sound measurement locations [57]. The outcomes of the study reveal that the appraisals of the loudness and annoyance of the sound sources were significantly lower when panoramic videos of the original sound measurement locations were shown, suggesting that visual information of the real environment may influence the results with respect to those obtained for audio-only scenarios.

Motivation and Objectives of the Study
The sound masking studies listed above were carried out by adding water sources visually and acoustically to the scenarios, or only adding water sound to a reproduced acoustic environment. However, masking studies on urban environments including the presence of water (seafronts, riverfronts, etc.), not heard because it is masked by the road traffic noise or people activities, have not been undertaken so far, and also the evaluation of how water sounds at different SPLs match the people expectations in the soundscape has not been analyzed yet.
Unfortunately, the important aspect of the users' expectations on typologies of outdoor spaces are not taken into account in urban design. Thus, it was deemed necessary to investigate new techniques to face this need, and also to get tools to improve the acoustic design of spaces. The present study deals with the performance of water sound masking in seafront scenarios and assessing the degree of adequacy of the seafront soundscape with users' expectations. Two areas of the seafront in Naples have been considered, and water sounds at different SPLs have been added to the original sound recordings, also taking into account the appraisals on expectations, artificiality of the water sounds added and their relevance with the context, and the sound sources that could be heard.
The objective of the study was to test the following hypotheses, named from H1 to H5: Hypothesis 1 (H1). The SPL of the seawater sounds added to the original audio-visual environment does not affect the degree of satisfaction regarding the expectations on the acoustic environment (masking water sound SPLs and expectations).
Hypothesis 2 (H2). Two different audio-visual scenarios of the seafront with the same seawater SPL added will satisfy the participants' expectations at the same degree (locations and expectations).

Hypothesis 3 (H3).
Semantic differential scale (SDS) ratings on the soundscape are the same for two different sound masking conditions (SDS and water sound SPLs).

Hypothesis 4 (H4).
Considering equal acoustic environment conditions, SDS ratings are the same for audio-visual scenarios (AV) rather than for audio-only (A) scenarios (SDS and experimental stimuli).

Hypothesis 5 (H5).
The addition of water sounds not coherent with the layout of the promenade does not influence the degree of satisfaction of the user's expectations of the acoustic environment (coherence and expectations).

Materials and Methods
In Naples, despite the proximity of the sea, the characteristics of the places are such that the sea waves' sound could not be heard almost anywhere along the seafront. The absence of this sound is due to the sea normally being calm and, in most areas, there is a certain distance (vertical or horizontal) between the seashore or the breakwater and the seafront promenade. This setting was suggested to evaluate which water sound levels would match people's expectations to improve their ratings of the acoustic environment.
Two sites were selected as scenarios of the IVR surveys, based on the type of sounds that could be heard, and the topological relationship between the promenade and the water. Site 1 (Acton street) is a street dominated by traffic noise, and natural sound sources are seldom perceived, except for the wind. There is a considerable vertical distance between sea level at the breakwater and the promenade. Site 2 (beach) is a pedestrian zone next to an artificial seashore; a busy street is nearby, and the background traffic noise is not so high ( Figure 1). During the sound recordings, the sea was very calm, and the sound of the sea waves was hardly heard. The visual and acoustic environment were recorded simultaneously using Go-Pro Hero4 cameras and an Ambisonic microphone (Soundfield SPS 200) at 1.65 m height, according to the recommendations of the ISO 12913-2 [58]. The duration of the recordings was 120 s. A class 1 sound level meter (SOLO 01DB) was used to measure the SPLs and the spectrum of the acoustic environment. The Surround Zone VST Plugin was used to convert the registered A-format to B-format required for the reproduction in laboratory [59]. Images taken from six camera positions, arranged to capture the visual field of a sphere, were edited and merged to create the 360° videos. The images of the microphone, the sound level meter and the tripods that held the devices were deleted by means of video editing. Audio and visual scenarios were embedded using the Vizard 5.2 software.
The listening tests were carried out in the anechoic chamber of the University of Campania. Five speakers (DynaudioBM5AMKII) and one subwoofer (DynaudioBM9S) arranged symmetrically in a 5.1 configuration were located in a 3 m diameter circle. An external sound card Motu 828 MKII drove the loudspeakers. The reproduction chain was calibrated in order that the equivalent sound levels measured at the participant's position were equal to those obtained in the field measurements, with a tolerance of 1.5 dBA. The experiment was split into a pilot and a main phase.

Pilot Study
This study, formed by two listening sessions, was undertaken to select the appropriate SPL steps of the seawater sounds to be added to the in situ recordings. The study was carried out in the anechoic chamber of the University of Campania. Each session was attended by a group of 10 students, different between the two sessions. In both sessions, the assessment of the stimuli was collected by four semantic differential scales (SDS), namely unpleasant-pleasant, chaotic-calm, eventful-uneventful, boring-exciting [60], shown on the display of the Oculus headset used for the stimuli reproduction. Hereinafter, each scale is named SDS followed by the left-hand side attribute of the scale itself. For the first session, 8 different audio-visual scenarios were played back, formed by the 360° video taken at site L1 and 8 different acoustic environments, namely the original audio recording in situ (N0 with LAeq set to 65 dBA) and the others obtained adding to N0 seawater SPLs, varying from 56 to 74 dBA at 3 dB steps (−9, −6, −3, 0, +3, +6, or +9 dB referred to the LAeq of N0). The criterion used by Jeon et al. and Galbrun et al. in [61,62] was applied for the selection of the SPL steps. During this session, audio-visual (AV) scenarios were randomly shown to participants, following a balanced Latin square design. Once the participants had rated all the scenarios, they were asked to evaluate again the audio-visual (AV) scenario to which they had given the highest rating on the SDS-Unpleasant, as a control question. The second session differed in the SPL increments: A 5 dB step was chosen, like in a test by Jeon at al., in which semantical differential scales were used [61]. Six acoustic settings were considered, formed by N0 again and the others obtained, adding to N0 seawater SPLs, varying from 55 to 75 dBA at 5 dB steps. A summary of the experimental design is reported in Table 1. Table 1. Experimental design of the pilot study: Session number, group of participants (A or B, with 10 participants each), location (L1-Acton street), audio-visual stimuli (AV) and control question. Acoustic environments used at session 1: The original audio recording N065 (with LAeq = 65 dBA) and seven acoustic stimuli obtained adding to N065 seawater SPLs, varying from 56 to 74 dBA at 3 dB steps (N0 + SW56, N0 + SW59, N0 + SW62, N0 + SW65, N0 + SW68, N0 + SW71, N0 + SW74). Acoustic environments at session 2: N065, and five acoustic stimuli obtained adding to N065 seawater SPLs, varying from 55 to 75 at 5 dB steps (N0 + SW55, N0 + SW60, N0 + SW65, N0 + SW70, N0 + SW75).

Main Study
The main study was split into two sessions separated by 15 days and involved 30 subjects (14 women, 16 men), aged from 20 to 47. Each session lasted approximately 30 min. Participants were tested one by one in the anechoic chamber and, before starting the test, they were asked to read the instructions to fill the questionnaire in. General questions (age, gender, education level, and number of visits to the waterfront of Naples) were asked to the participants before starting the first IVR session. Afterwards, they were asked to put on the head mounted display, in order to explore the IVR scenarios. Regarding the stimuli, three experimental settings were arranged and presented to the participants in the following order: Audio-only (A), video-only (V), and audio-video (AV). For each experimental setting, the scenarios were ordered and shown to the participants based on a randomly generated Latin square. The questions were shown on the head-mounted display, and the participants had to answer them out loud so that they were aware of the noise level. A 11-item questionnaire was designed to evaluate the acoustic and visual environment, expectations of the participants regarding the perceived sound sources, SPL of water sounds (SW), qualitative features of the soundscape (through SDS), and the artificiality of the water sounds added (Table 2). Each item was rated on a 7-point Likert scale, except for the questions on the appraisal of the acoustic and visual environment, which were rated in a 4-point Likert scale.
How do you consider this acoustic environment in relation to the following attributes?
To what extent does this acoustic environment meet your expectations of the seafront? To shorten the duration of the test, and considering the results of the pilot study (reported in Section 2.1), it was decided to evaluate the semantic differential scales, expectations, and artificiality of the sound considering four acoustic environments (N0 + three acoustic scenarios with masking water sounds), and the questions about noise sources, which required more time to be answered, considering only two scenarios. The LAeq of N0 was set at 65 and 55 dBA for L1 and L2 sites, respectively. The LAeq of seawater sound (SW) added N1, N2, and N3 were 60, 65, and 70 dBA at L1, and 50, 55, and 60 dBA at L2, respectively. The experimental design is summarized in Table 3. Table 3. Experimental design of the main study: Session number, group of participants (C, with 30 participants), location (L1-Acton street and L2-beach), type of sensorial stimuli (Audio-only, A, video-only, V, and audio-video, AV) and type of questions (semantic differential scale SDS, artificiality, expectations, soundscape quality SQ, landscape quality LQ, and overall environmental quality EQ). Acoustic environment for location L1: The original audio recording N065, and three acoustic stimuli obtained adding to N065 seawater SPLs varying from 60 to 70 at 5 dB steps (N1, N2, and N3). Acoustic environment for location L2: The original audio recording N055, and three acoustic stimuli obtained adding to N055 seawater SPLs varying from 50 to 60 at 5 dB steps (N1, N2, and N3).

Participants'
Groups Site  Figure 2 shows the spectrograms (in dB) at L1 and L2 of the original audio recording N0, and of the seawater sounds (SW) added at N2 SPL. The levels were adjusted in amplitude, resulting in similar spectral characteristics to the original sample, at different SPL [61]. Typical high noise values of traffic noise can be observed at 63 Hz in the original audio recordings, even in L2 where the traffic noise was distant. Coherent and uncoherent seawater sound sources were also evaluated through SDS. In location L1, waves sound lapping against the breakwater (SW) was added as coherent sound, and seashore sound (SS) as uncoherent; contrariwise in L2. According to the assumptions of the study, the data obtained from the surveys were analyzed to assess whether there are significant differences between pairs of variables, or to obtain the degree of association between them. For this purpose, the Wilcoxon signed rank test and Spearman's correlations were performed, respectively. Cohen's classification has been used to evaluate the effect size.

Pilot Study
Comparing the responses given to the repeated questions (Table 4), the consistency of the results was more satisfactory for the second session, where the masking sound changed by 5 dB steps. In particular, the ratings on SDS-Unpleasant were the same for all subjects, excepting two of them showing a difference of ±1 point; for SDS-Chaotic and SDS-Boring, the maximum differences were also ±1 point, and in SDS-Eventful the maximum difference was ±3, but occurring for only one participant. For that reason, it was decided to use 5 dB steps for the acoustic conditions in the main study.
For session 2, higher appraisals were observed for the 60 dBA audio-visual scenario; furthermore, most participants reported equal or similar responses to the scenarios N0 + SW65 and N0 + SW60. In particular, 31 responses over 40 (10 participants × 4 SDS) were equal for both acoustic conditions, or with a difference of ±1. Similarly, this also happened with the ratings given to the SDS for the scenarios N0 + SW70 and N0 + SW75 (28 responses over 40).
Considering the results of the pilot study, and the planned long duration for the main study, it was decided to not include the acoustic conditions N0 + SW55 and N0 + SW75. Table 4. The difference of ratings given to the repeated condition between session 1 (3 dB steps) and 2 (5 dB steps).

Main Study
The different analyses undertaken to test the hypotheses of the study, ordered from H1 to H5, are described below together with the results.

Hypothesis 1 (H1). Masking water sound SPLs and expectations.
This hypothesis deals with evaluating whether adding seawater sounds to real audio-visual environments has an effect on matching the expectations placed on the soundscape. The Wilcoxon signed rank test can be used to compare two repeated measurements on a single sample, assessing whether their population mean ranks differ, and to understand whether the ranked values of one group are consistently higher or lower than the other. Actually, the test was used to compare whether the mean ranks of the expectations on two different acoustic environments-each one with a different SPL of water sound masking the real acoustic environment, but the same visual scenario-differ. The Wilcoxon signed rank test was conducted on all the possible pair-combinations of acoustic scenarios (six), at locations L1 and L2.
As reported in Table 5, at location L2, four of the six paired combinations show significant differences at 95% confidence level. Comparing the medians, 84% of participants-that gave different ratings to each compared scenario-consider that the sea sound level N1 better meets their expectations than the real acoustic environment (N0) at position L2. The same trend is observed for N2 level versus N0. Seventy-seven percent of participants also consider that N3 matches their expectations worse than N1. N2-N3 pairs show the same trend. Sound N3 satisfies the expectations worse than N1 or N2, and it does not show a significant difference with N0. It should be also noted that, also considering not-significant differences, the masking conditions N1, N2, and N3 satisfy more widely the expectations of participants than the original acoustic environment N0. Table 5. Wilcoxon signed rank test (p-Value and comparison of ratings between paired-variables), and effect size results for all possible pair combinations between N0, N1, N2, and N3 at locations L1 and L2. For the analysis of the participants' ratings, "a" is the first, and "b" the second variable of the pair. The effect size is normally used to report the magnitude of differences between two groups-a large effect size means that the difference is important [63]-and it is a useful tool when interpreting the effectiveness of the results [64]. The effect size was calculated using Rosenthal expression for parametric tests [65], by dividing the absolute standardized test statistic z by the square root of the number of subjects (last column in Table 5). According to Cohen's classification, the effect sizes at L2 are considered moderate (range 0.3-0.5) for the paired groups N0-N2 (0.424) and N1-N3 (0.395), and large (>0.5) for N0-N1 (0.555) and N2-N3 (0.586).
At location L1, although N1, N2, and N3 match the expectations of more participants better than the original acoustic environment N0, the difference between groups is not significant at 95% confidence level.

Hypothesis 2 (H2). Locations and expectations.
This hypothesis deals with evaluating whether there are statistically significant differences in matching the participant's expectations on the soundscape of two locations when adding the same SPL of seawater sound. Again, the Wilcoxon signed rank test was used to calculate significant differences between the repeated measurements of the user's expectations at locations L1 and L2 using three masking conditions (N1, N2, and N3) of the original setting N0. Table 6 shows the results for each acoustic environment and considering all environments together as one group. Participants' expectations are different at both locations for N1 and N2 with a significance level below 0.5, and for N0 with a significance level below 0.1. In particular, the acoustic environment satisfies better participants' expectations at L2 than at L1 at all levels (e.g., for the N2 level, 18 participants gave higher ratings on the expectations at L2 rather than L1, and only 3 gave higher ratings to L1). Table 6. Wilcoxon signed rank test (p-Value and comparison of ratings between paired-variables), and effect size results between the expectations at L1 and L2, for each masking condition (N1, N2, and N3) and original setting N0. For the analysis of the participants' ratings, "a" is the first, and "b" the second variable of the pair. Considering only the statistically significant differences, for all the paired variables on the expectations the effect sizes are large (ee_N1 = 0.609), moderate (ee_N0 = 0.355, ee_N2 = 0.474, and ee_all = 0.398), and low (ee_N3 = 0.111).

Hypothesis 3 (H3). SDS and water sounds SPLs.
This hypothesis was formulated to understand whether there are differences in the SDS ratings for audio-only (A), and for audio-video (AV) scenarios. Figure 3 shows a comparison of the ratings given on the four SDS, that is unpleasant-pleasant (top left), chaoticcalm (top-right), eventful-uneventful (bottom-left), and boring-exciting (bottom-right) with and without video, at locations L1 (red line) and L2 (green line). The horizontal axis was divided into lower, equal, and higher ratings to A than to AV experimental settings. The vertical axis gathers the results according to the four acoustic stimuli. Significant (continuous line) and not significant differences (dashed line) between paired variables using Wilcoxon signed rank test are also shown. For the SDS-Unpleasant, at both locations there are significant differences (at 95% confidence level) between the ratings given to the scenarios with and without video for N0 and N3 sounds; furthermore, (excluding equal ratings) for paired variables at N0, a higher number of participants gave lower ratings to audio-only (A) than to audio-video (AV) at both locations. This trend changes with increasing the masking noise level until stimulus N3, for which the opposite behavior is observed.
For the SDS-Chaotic and the N0 acoustic environment, there are only statistically significant differences at L2 (p-valueL2_N0 = 0.019). For these paired variables, a higher number of participants perceived the acoustic environment as more calm with audio-only than with audio-video.
SDS-Eventful results also show significant differences only at L2, at stimuli N0 and N1 (p-valueL2_N0 = 0.001; p-valueL2_N1 = 0.002); in particular, more participants gave higher ratings to audio-only than to audio-video. At L2, this trend is reversed for high masking SPL.
SDS-Boring paired analysis at L1 location shows that most people considered the acoustic environment more exciting when video images are added to the scenarios (in particular, 17 participants felt the soundscape was more exciting with audio-video scenarios than with audio-only, and only six respondents felt the opposite).

Hypothesis 4 (H4). SDS and experimental stimuli.
This hypothesis was formulated to assess whether there are differences in the SDS ratings for paired acoustic conditions. Figure 4 shows a comparison of the ratings for the SDS-Unpleasant without (left) and with video images (right), evaluated at locations L1 (top) and L2 (bottom). The horizontal axis was divided into higher, equal, and lower ratings to the first variable than to the second one. The vertical axis gathers the results for N0, N1, and N2. For both locations and audio-only conditions, results show that there are statistically significant differences between the ratings given to the paired variables N0-N1 (p-valueL1 = 0.001; p-valueL2 = 0.000), N0-N2 (p-valueL1 = p-valueL2 = 0.000), and N0-N3 (p-valueL1 = 0.001; p-valueL2 = 0.000); for those paired groups, participants gave lower ratings to N0 than to the rest of the sound masking levels. However, the results are not as conclusive for audio-video scenarios, and only significant differences are found between N0-N1 pair at L1 (p-valueL1 = 0.010), and between the N0-N1 (p-valueL2 = 0.000) and N0-N2 (p-valueL2 = 0.000) pair at L2. Results show statistically significant differences in the pleasantness of N1 and N3 when images are displayed simultaneously to the audio at locations L1 and L2 (p-valueL1 = 0.004; p-valueL2 = 0.000). In both cases, most people gave higher ratings to N1 than to N3. Although the scenarios with masking sound are more pleasant than the original acoustic environment, N1 is preferred over N3, and N2 over N3. Thus, it can be concluded that lower sound masking SPLs are more pleasant than the higher ones, and that N1 was rated the most pleasant acoustic environment.

Hypothesis 5 (H5). Coherence and expectations.
Wilcoxon signed rank test was performed between coherent and not coherent masking sounds (for the acoustic condition N1) added to the original acoustic environment, for each of the four semantic scales (Table 7). Results show that the soundscape is considered significantly more pleasant when sounds are coherent with the visual scenario than when they are uncoherent, for both locations (p-valueL1 = 0.009; p-valueL2 = 0.013). They are also rated more calm (less chaotic) when sounds are coherent than when they are uncoherent (p-valueL1 = 0.033; p-valueL2 = 0.003). Furthermore, at L2, when coherent sounds of water are added to the original acoustic environment, the soundscape is considered more uneventful (p-valueL2 = 0.001) and exciting (p-valueL2 = 0.001) than when the sounds added are uncoherent with the environment. It can be assumed that the differences between groups are more remarkable at L2 than at L1 (with medium and high size effect for all the SDS at L2, and only medium size effect for SDS-Unpleasant and SDS-Chaotic at L1). The correlation coefficients between the artificiality of the sounds and the SDS are very low and considered not significant. Table 7. Wilcoxon signed rank test (p-value and comparison of ratings between paired-variables), and effect size results between coherent and not coherent sounds at L1 and L2. For the analysis of the participants' ratings, "a" is the first, and "b" the second variable of the pair. The Spearman's correlation coefficient between pleasantness and expectations when added sounds are coherent with the visual environment is higher (high association: rL1 = 0.710, p-valueL1 = 0.000; rL2 = 0.551, p-valueL2 = 0.002) than when they are not (medium association: rL1 = 0.387, p-valueL1 = 0.035; rL2 = 0.470, p-valueL2 = 0.009) at both locations (Table  8). For the other SDS variables analyzed, significant correlation coefficients were obtained only for location L2. Table 8. Spearman correlation coefficients between the expectations and the semantical differential scales (SDS-Unpleasant, SDS-Chaotic, SDS-Eventful, and SDS-Boring) evaluated with coherent and uncoherent water sound masking at N1 level. * The correlation is significant at the 0.05 level. ** The correlation is significant at the 0.01 level.  Figure 5 shows the mean and standard deviation of the appraisals given on the SDS-Unpleasant for N0 and for the audio-video scenarios in which coherent and not coherent water sounds were added, at locations L1 and L2. The soundscape is more pleasant when coherent water sounds are added (meanL1 = 0.93, sdL1 = 1.60; meanL2 = 1.77, sdL2 = 1.19) than when only the real audio-video environment is displayed at both locations (meanL1 = 0.03, sdL1 = 1.23; meanL2 = 1.37, sdL2 =0.77). This pleasantness decreases when uncoherent water sounds are added (meanL1 = 0.30, sdL1 = 1.51; meanL2 = 1.30, sdL2 = 1.15).

Discussion
The level of satisfaction with the expectations on the soundscape has been evaluated in different acoustic and visual scenarios using sound masking. The acoustic conditions of the scenarios vary both by the coherence of the sound sources added, and by the SPL of the masking water sound.

Hypothesis 1 (H1).
All the masking water sound SPLs satisfy the expectations of a larger number of participants better than that observed for the original sound of the area, although significance in the differences occurred only between the pairs of variables N0-N1 and N1-N2 in L2 at 0.05%, and for N0-N1 in L1 at 0.1%. Thus, the hypothesis H1 can be rejected. The lack of statistically significant differences at L1 (for any of the six possible combinations of audio-video scenarios) may be because traffic noise levels are high and, therefore, difficult to mask. Consequently, adding seawater sound does not involve a change in meeting participants' expectations. From the comparison between the acoustic environments N1, N2, and N3, it can be inferred that the lowest masking level matches the expectations of the users better than the higher ones in the two locations, possibly because when the SPL is high, there is no coherence between the noise level and the sound sources shown in the video.

Hypothesis 2 (H2).
Comparing the two locations at the same audio condition, location L2 (beach) matches the expectations of more participants than L1 (Acton street) for all the noise masking conditions, probably because the background noise at L1 has less artificial sounds, and noise levels are lower and easier to mask. Consequently, the findings do not meet the baseline assumption H2, as the acoustic environments of one location satisfy better participant's expectations than the ones of the other.

Hypothesis 3 (H3).
Comparing the SDS dealing with the audio-only (A) and audio-video (AV) scenarios, for the SDS-Unpleasant a greater number of participants gave higher ratings to AV than to A for scenarios without masking at both locations, and with low masking SPLs at L1. The sounds of seawater include broadband frequency components that make the sound similar to white or pink noise [57]. When no images are displayed (A scenarios), the participants may not recognize which noise source generates the sound and may have identified the first masking condition N1 as traffic noise, and therefore, have rated the acoustic environment as unpleasant.
This trend is reversed for high levels of masking noise-being the acoustic environment more pleasant with A than with AV. This can be because for A setting, participants are only paying attention to what they are hearing: As listening to the sounds of water is a pleasant experience [50,52,53,61], even with high SPL of water, more participants report the acoustic environment as pleasant. For the original acoustic environment (N0), the addition of images seems to have a significantly positive effect on the appraisals of the participants. However, when sound levels of water are very high, there is no coherence between what people see and hear, and probably for this reason participants gave lower ratings to the acoustic environment at AV than to A scenarios. A similar effect regarding high and low levels of masking noise also occurs for the SDS-Chaotic and SDS-Uneventful, but only at L2. The results for the SDS-Chaotic at L2 may be because the video allows seeing in the distance a street with a traffic activity; people can, therefore, associate these images with the background traffic noise that is heard which, given its nature, may not have been identified in the A scenario, or may have been mistaken with electrical noise from the speakers. When considering SDS-Uneventful, similarly to what happened to chaotic-calm appraisals, visual images may allow identifying elements that generate sound events, and therefore to rate the acoustic environment as more eventful at AV than A scenarios.
Consequently, hypothesis H3 is not met, as SDS ratings on the soundscape are not the same for at least two different sound masking conditions.

Hypothesis 4 (H4).
When evaluating the differences between the appraisals given to pairs of acoustic environment conditions, most participants gave higher ratings to N3, N2, and N1 than to N0 in the A scenarios; consequently, the results do not confirm hypothesis H4. Thus, all masking SPLs led to a more pleasant acoustic environment than the real one. This is because when A is reproduced, the participant's attention focuses on the acoustic environment only. When video images are simultaneously displayed with audio, the AV scenarios contain much more information that distracts the attention of participants. Regarding the AV scenarios, the appraisals are also higher for N2 and N1 than for N0, although when comparing N3 and N0, most participants give equal or worse ratings to N3; this implies that the video stimulus is capable of altering the opinion about the acoustic environment. A similar conclusion was reached by Asakura et al. when evaluating loudness and annoyance perception with images of the anechoic chamber and images of the original scenarios [57].

Hypothesis 5 (H5).
The correlation coefficient between the SDS-Unpleasant and expectations is higher when the sounds heard are coherent, and therefore, the original H5 hypothesis is not met. Furthermore, a greater number of participants give higher ratings when the sounds are coherent than when they are not coherent. This happens even if both sounds correspond to seashore sounds, one of water lapping softly against the breakwater, and the other from a sandy shore. The differences between the appraisals are more remarkable at L2 than at L1, probably because the noise levels of the original scenarios are lower, and composed of more natural sounds in L2, and that allows a greater symbiosis between real and added sounds.

Conclusions
The present study evaluates the relationship of the expectations with qualitative scales of the soundscape appraisal and with different masking noise levels. In the laboratory test, 360° videos and spatial audio recorded in the waterfront of Naples were shown to the participants using the Oculus headset. The study was conducted using audio-only (A), video-only (V), and audio-video (AV) scenarios. The results show that, for A scenarios, all the added masking water SPLS make the acoustic environment more pleasant. For AV scenarios, only the first two levels of masking sounds are more pleasant. Similarly, when using two types of seawater sounds, the ones consistent with the morphology of the shore match the expectations better than non-coherent ones. The results of this study show that both the evaluation of the degree of satisfaction of expectations can be very useful tool when designing urban soundscapes. The role of expectations and coherence with the environment has been shown to have a great influence on the soundscape perception. Thus, it would be important to consider these aspects as key factors in the analysis of soundscapes and how they are perceived, in order to improve the quality of recreational areas in cities.

Data Availability Statement:
The data presented in this study are available on justified request from the corresponding author. The data are not publicly available due to privacy reasons.