The Soundscape Indices (SSID) Protocol: A Method for Urban Soundscape Surveys—Questionnaires with Acoustical and Contextual Information

A protocol for characterizing urban soundscapes for use in the design of Soundscape Indices (SSID) and general urban research as implemented under the European Research Council (ERC)-funded SSID project is described in detail. The protocol consists of two stages: (1) a Recording Stage to collect audio-visual recordings for further analysis and for use in laboratory experiments, and (2) a Questionnaire Stage to collect in situ soundscape assessments via a questionnaire method paired with acoustic data collection. Key adjustments and improvements to previous methodologies for soundscape characterization have been made to enable the collation of data gathered from research groups around the world. The data collected under this protocol will form a large-scale, international soundscape database.


Introduction
Soundscape studies strive to understand the perception of a sound environment, in context, including acoustic, (non-acoustic) environmental, and contextual, and personal factors.These factors combine together to form a person's soundscape in complex interacting ways [1].In order to predict how people would perceive an acoustic environment, it is essential to identify the underlying acoustic and non-acoustic properties of soundscape.
The soundscape community is undergoing a period of increased methodological standardization in order to better coordinate and communicate the findings of the field.This process has resulted in many operational tools designed to assess and understand how sound environments are perceived and apply this to shape modern noise control engineering approaches.Important topics which have been identified throughout this process are soundscape 'descriptors', 'indicators', and 'indices'.Aletta et al. [2] defined soundscape descriptors as "measures of how people perceive the acoustic environment"; soundscape indicators as "measures used to predict the value of a soundscape descriptor"; and soundscape indices can then be defined as "single value scales derived from either descriptors or indicators that allow for comparison across soundscapes" [3].(Please find the link to the Soundscape Indices (SSID) ERC-funded website in the Supplementary Materials.) This conception has recently been formalized and expanded upon with the adoption of the recent ISO 12913 standard series [4][5][6].ISO 12913 Part 1 sets out the definition and conception of Soundscape, defining it as the "acoustic environment as perceived or experienced and/or understood by a person or people, in context".Here, the soundscape is separated from the idea of an acoustic environment, which encompasses all of the sound which is experienced by the receiver, including any acoustically modifying effects of the environment.In contrast, the soundscape considers the acoustic environment, but also considers the impact of non-acoustic elements, such as the listener's context and the visual setting, and how these interact with the acoustic environment to influence the listener's perception.
The ISO/TS 12913-2:2018 is the current reference document addressing data collection and reporting requirements in soundscape studies.In terms of methods, the ISO document covers two main approaches, namely: soundwalks combined with questionnaires (Methods A and B) and narrative interviews (Method C) [5], which relate to on-site and off-site data collection, accordingly.Part 3 of the ISO 12913 series builds on Part 2 and provides guidelines for analyzing data gathered using only those methods [6].However, the range of possible methodological approaches to soundscape data collection is much broader and it includes, for instance, laboratory experiments [2,7,8], pseudo-randomized experience sampling [9], and even non-participatory studies [10].The protocol described in this paper was designed having in mind the need for a relatively large soundscape dataset that could be used for design and modeling purposes, thus trying to expand the scope of soundwalks that typically deal with much smaller samples of participants [11].For the sake of comparability and standardization with these methods, we chose to refer to the soundscape attributes reported in the ISO Part 2 (Method A).
Several studies prior to the formalization of the ISO standards on soundscape demonstrated the general, but inadequate, relationship between traditional acoustic metrics, such as L Aeq , with the subjective evaluation of the soundscape [1,[12][13][14][15].These have typically aimed to address the existing gap between traditional environmental acoustics metrics and the experience of the sound environment.Yang and Kang (2005) showed that, when the sound level is 'lower than a certain value, say 70 dBA', there is no longer a significant change in the evaluation of acoustic comfort as the sound level changes.However, the perceived sound level does continue to change along with the measured sound level, showing that (1) measured sound level is not enough to predict soundscape descriptors such as 'acoustic comfort', and (2) there is a complex relationship between perceived sound level and soundscape descriptors which is mediated by other factors.
Subsequent studies have shown that, even with large data sets and several possible acoustic indicators examined, models that are based on objective/measurable metrics under-perform in predicting soundscape assessment when compared to models based on perceptual responses.Ricciardi et al. [16], with a methodology based on smart phone recordings, achieved R 2 = 0.21 with acoustic input factors L 50 and L 10 − L 90 , whereas the same dataset and model building method achieved R 2 = 0.52 with perceptual input factors overall loudness (OL), visual amenity (VA), traffic (T), voice (V), and birds (B).This indicates that merely examining the acoustic level is not sufficient for predicting the assessed soundscape quality, and that additional objective factors and a more holistic and involved method of characterizing the environment is required.This protocol is trying to extend the scope of objective measurements that are being collected in conjunction with perceptual responses by including other environmental and visual data.These previous studies have generally been limited by one or many of the following factors: limited number or types of locations, limited responses sample size, and no non-acoustic factors, generally limiting the generalizability of their results beyond the investigated locations.
The ability to predict the likely soundscape assessment of a space is crucial to implementing the soundscape concept in practical design.Current methods of assessing soundscapes are generally limited to a post-hoc assessment of the existing environment, where users of the space in question are surveyed regarding their experience of the acoustic environment [11,17].While this approach has proved useful in identifying the impacts of an existing environment, designers require the ability to predict how a change or proposed design will impact the soundscape of the space.To this end, a model that is built upon measurable or estimate-able quantities of the environment would represent a leap forward in the ability to design soundscapes.
Developing soundscape indices is a process that requires consideration of how people perceive, experience, and understand the surrounding sound environment.For the purpose of modeling and comparisons, it is important that such indices are numerical entities and that these quantities are collected consistently across all investigated spaces and soundscapes.Although the soundscape approach taken in this protocol represents a step-change away from existing methods of noise exposure measurements, strong cues particularly in the realm of acoustic measurement methods should be taken from existing standards both to make use of the significant knowledge and experience that has gone into the creation of these standards and to facilitate compatibility between soundscape and traditional measurements.In general, the measurement methods and best practice given in environmental noise standards such as ISO 1996-1:2016 and ISO 1996-2:2017 [18,19] should be followed wherever possible, including the use of standardized acoustic equipment such as standard sound level meters.
An European Research Council (ERC) Advanced Grant project is ongoing to develop the proposed "Soundscape Indices" (SSID), which adequately reflect levels of human comfort and preference while integrating measurable and observable quantities.The framework proposed for the SSID project is laid out in detail by Kang et al. [3], the first step of which is generating a large-scale and coherent database of the required soundscape characterization data.Given the already recognized differences in soundscape assessment across various countries and cultures [20,21] and the success of existing international soundscape efforts such as the Soundscapes of the World project [22], the collection of soundscapes from many different countries and in many different contexts is an important component of the SSID project.
Therefore, the following protocol has been conceived and implemented within the SSID framework to collect data about urban soundscapes for use in general soundscape research and toward the design of Soundscape Indices.Thus far, the collected database includes nearly 4000 participants' responses from 59 locations in 10 cities and provinces across the UK, China, Spain, and Italy.This protocol has been refined and adjusted as needed during this extensive data collection process to arrive at this final version.This work was conducted by nine associated research groups and coordinated by the SSID group based at University College London and has already produced several pieces of published work towards the creation of Soundscape Indices [23][24][25][26][27][28][29][30].Additional collaborations and data collection efforts are currently underway in France, the Netherlands, and Croatia.

Purpose
This protocol was designed to achieve two primary goals: (1) gather in situ soundscape assessments from the public, which can be further analyzed and utilized in designing a soundscape index; (2) conduct recordings needed to reproduce the audio-visual environment of a location in a laboratory setting for conducting controlled experiments on soundscape.These two goals represent two levels of data required for developing a general soundscape model.The first enables large scale data collection, resulting in a database with thousands of perceptual responses and their corresponding quantitative data which can be statistically analyzed on a large scale, or used for training in machine learning modeling.In situ assessments also represent the most holistic assessment, ensuring all factors that influence the soundscape are present, including those which cannot be reproduced elsewhere.
However, there are questions that cannot be practically addressed in situ, such as soundscape assessment of less-or un-populated areas, the influence of mismatched acoustic and visual cues, physiological and neural responses to various soundscapes, and so on [31].Laboratory experiments with controlled environments are required to address these aspects.Toward the development of a coherent SSID, however, it is important that these two forms of data are collected simultaneously and with compatible methods, such that the results of the two approaches can be confidently combined and compared.In addition, since this protocol is intended to be used for the creation of a large-scale international database with additions carried out by several different and remote teams, it has been designed for efficiency, scalability, and information redundancy.

Protocol Design and Equipment
The first goal is achieved by conducting in situ questionnaires using a slightly altered version of Method A (questionnaire) from Annex C of the ISO/TS 12913-2:2018 technical specification [5] collected either via handheld tablets or paper copies of the questionnaire.Typically, a minimum of 100 responses are collected at each location during multiple 2-5-h sessions over several days.During the survey sessions, acoustic data are collected via a stationary class 1 or class 2 Sound Level Meter (SLM) (as defined in IEC 61672-1:2013 [32]) running throughout the survey period and through binaural recordings taken next to each respondent.These acoustic and response data are linked through an indexing system so that features of the acoustic environment can be correlated with individual responses or with the overall assessment of the soundscape, as required by researchers.
The second goal is achieved by making First-Order (or higher) Ambisonic recordings simultaneously with 360 • video which can be reproduced in a virtual reality environment.It has been shown that head-tracked binaural and multi-speaker ambisonic reproduction of recorded acoustic environments recorded in this way have high ecological validity [33], particularly when paired with simultaneous head-tracked virtual reality video [22,34,35].
The on-site procedure to collect these data are separated into two stages, which will be outlined in detail in Section 4. The stage during which the spatial audio-visual recordings are made for lab experiments is called the Recording Stage, while the stage during which questionnaires and environmental data are captured is called the Questionnaire Stage.
The procedure has been designed to include multiple levels of data and metadata redundancy, making it robust to on-site issues and human error.The most crucial aspect of the redundancy is ensuring perceptual responses can be matched with the appropriate corresponding environmental and acoustic data even when some information is lost or forgotten.

Labeling and Data Organization
In order to be able to identify all of the many data components of the Recording and Questionnaire Stages and to associate these with their various corresponding data, the following labeling system is suggested.This system is focused on (1) relating all of the separate recordings and factors to specific questionnaire responses and (2) efficiency and consistency on site.A recent paper by Aumond et al. [14] demonstrated the importance of addressing multiple levels of factors which influence perception, form individual-, to session-, to location-level.The successful pleasantness models building incorporating these information levels showed a marked improvement over the equivalent individual-level or location-level only models.The data organisation system proposed here was designed in order to maintain this important information, and the levels of information for the data collected on site are shown in Table 1.
At the top level is the Location information.This includes information about the location which does not change day-to-day, and generally characterizes the architectural character of the space, or typical climate conditions for the area.As described in Section 2.2, each 'environmental unit' should be considered a new location.Therefore, if researchers want to investigate the differences in soundscape assessment in the middle of a small urban park and along the road next to the same park, these would be considered different locations since they would (typically) have different environmental factors, and should be given different names.The name chosen should be concise, but it should be obvious what location is referred to.
The next level is information which is specific to each session, labeled with a SessionID.This SessionID should contain the name of the location and a numerical index which will increase with each repeated session at that location.The SessionID is associated with the data collected during the Recording Stage, and with the data which are continuous throughout the Questionnaire Stage, SLM, and ENV data.For easy automatic processing, correct spelling and consistency with the format is crucial so that data can be filtered according to the SessionID or the location, as is often necessary.
In addition, for ease of automatic processing, it is recommended not to include spaces in the SessionID to avoid string splitting issues in analysis code.
Underneath each SessionID will be a set of GroupIDs.One GroupID is assigned for each group of participants.This should correspond to a single binaural recording and a single 360 • photo.This will be used to (1) relate multiple surveys taken simultaneously and (2) link the recording and photo with the surveys.The GroupID is particularly crucial as it allows commonly missing data to be shared across multiple collection methods.For instance, occasionally paper questionnaires will be missing start and end time information.In this case, this information can be pulled directly from other questionnaires with the same GroupID.Where no questionnaires have the times, it is possible to extract an approximate start time from the binaural recording or 360 • photo and then estimate an average end time.
The GroupID should have the following format: [a set of letters representing the location name][the SessionID index number][an incrementing index for each group].For example, for the second session at Regent's Park Japanese Garden, the location name is RegentsParkJapan, the GroupID letters might be 'RPJ'; the SessionID would be 'RegentsParkJapan2' so the GroupIDs for that session would start at '201'.Therefore, for example, the tenth group of participants for that session would be labeled 'RPJ210'.This format ensures that, if the location or SessionID are not recorded for a questionnaire, it is still obvious which session it belongs to.

Location and Measurement Point Selection
To select the appropriate measurement point, it should be ensured that the following contextual factors representative of the site are present in the spatial recording: openness, greenness, presence of landmarks, dominant use (walking, staying), and social presence (related to the dominant use).These are identified as objective metrics often used in urban and landscape research [36][37][38][39][40], possibly contributing to soundscape assessment [23,41].This relies on researcher's opinion-driven assessment-it is advised to observe the location for a moment and then choose the point representative of the context and the first-person user experience.For instance, in a park, it would probably be near a bench in the central area near the fountain; in a busy square, it would be a place where most people gather and have the best view on the landmark.While doing so, the placement too near the prominent vertical objects such as a statue, a wall, or a mast should be avoided as it might cause issues in later handling the visual data (3 m is considered a safe distance from these features).Similar concerns are also true for the audio data and careful attention should be paid to avoid placing the recording equipment near extraneous noisy equipment or in acoustic shadows.Further guidance on this is given in Point 4 of Section 4.1.It is important to avoid placing the recording equipment at a position where no users are expected (i.e., don't put the equipment in the middle of a flower bed or a grass area that nobody uses.
For the purposes of this protocol, a single location was considered to be an 'environmental unit' wherein the environmental factors are consistent and is typically perceived to constitute a single distinct area.The exact dimensions and delineation of the environmental unit will vary depending on the characteristics of the space, so it is ultimately up to the judgment of the researchers on site to select an appropriate measurement point to best capture the character of the environmental unit.

Equipment
The equipment listed in Table 2 is designed to facilitate both the audio-visual recording of the location and the collection of objective environmental factors, as given in Table 3.What equipment is brought on site should be adjusted depending on availability, needs of the researchers, and whether only one of the protocol stages will be carried out, or both.The equipment selected should be neutral and not noticeable.In general, this means dark or neutral colors as opposed to high-visibility colors and selecting compact equipment.
The use of class 1 or 2 sound level meters has been stipulated to maintain verifiable consistency and quality of data across all soundscape studies which make use of this protocol, as well as with data collected under various other environmental acoustics purposes.As the accuracy of acoustic information gathered at the site is the most vital in the discussion of soundscape indices, specific requirements have only been set out for the acoustic equipment.Class 1 is highly preferred, but consideration is made for cost and availability of equipment.It should be noted what standard of SLM was used in the data collection and appropriate consideration of the precision and tolerances of the equipment should be taken during the data analysis.

Techniques for Field Data Collection
There are several methods available for characterizing the physical environment and collecting soundscape assessments.Here, we will address the techniques employed in this protocol and general best practice for each of them.

Questionnaire Surveys
As stated above, the questionnaire is primarily based on Method A of ISO/TS 12913-2:2018.This method begins with a set of questions relating to the sound environment which are assessed on a 5-point Likert scale, coded from 1 to 5. A sample codebook to demonstrate the recommended variable naming and response coding is included in Appendix D.
The first section includes four questions relating to sound source identification, where the sound sources are divided into four categories: Traffic noise, Other noise, Sounds from human beings, and Natural sounds (labeled SSI01 through SSI04, respectively).These taxonomic categories of environmental sounds are based on the work done by Guastavino [44] and Brown, Kang, and Gjestland [45].
Following this are five questions addressing the participant's overall assessment of the surrounding sound environment, addressing overall acoustic quality, the appropriateness of the sound environment to the location, perceived loudness, and how often the participant visits the place and how often they would like to visit again (labeled SSS01 through SSS05, respectively).
The fourth section comprises the WHO-5 well-being index, asking how the participants have been feeling over the last two weeks, such as 'I have felt calm and relaxed'.The WHO-5 index is constructed to constitute an integrated scale in which the items add up related information about the level of the individual's general psychological well-being [47,48].This information can provide additional insight into how exposure to pleasant or annoying soundscapes may impact psychological well-being as was investigated by Aletta et al. [27] or, alternatively, how a person's current psychological status may influence their perception of the sound environment as recently investigated by Erfanian, Mitchell, Aletta, and Kang [49].Each of the five WHO questions (labeled WHO01 to WHO05) are assessed on a 6-point scale coded from 0 to 5.
The final section of the participant-facing questionnaire comprises five questions on the participant's demographic information (age [AGE00], gender [GEN00], occupational status [OCC00], education level [EDU00], ethnicity [ETH00], and local vs.tourist [MISC03]) and a free response for the participant to provide any additional comments they would like to make on the sound environment [MISC01].It is important to note that the section on ethnicity, and to a lesser extent education level, will need to be adjusted to ensure the available responses are appropriate for the location where the survey is being conducted.
At the end of the questionnaire are a set of spaces available for the researcher conducting the survey to fill out, adding additional information about the observed behavior of the participants, indexing and labeling metadata, and space for any additional notes.More information and guidance on this information is included below.
This questionnaire is intended to collect a consistent core set of perceptual responses and information about the participant, with space to add additional questions as required by specific research goals.Some examples of this which have been implemented by the various research groups are specific questions calling attention to water sounds and features, the perception of visual features, and an open response for identifying the dominant sound source.Given the proper labeling and coding, these additional questions can be fully integrated into the overall dataset, allowing the researchers the freedom to pursue their own research interests while maintaining consistency and compatibility with the overall database.
General notes for conducting the questionnaires: • The core questionnaire is reported in Appendix C. The labels and corresponding scales are also reported.Ideally, the form should be submitted and filled on a tablet via a survey app (e.g., REDCap, Qualtrics, KoBoToolbox, or similar) so that data can then be easily downloaded in an .xlsxor .csvfile.Using paper forms is also acceptable; however, researchers on site will need to take more careful note of information such as the time of response and the information will need to be manually input after the session is completed.If using an electronic version, the system should be set up to record the start and end times and GPS coordinates for each survey.
• If using an electronic version, be sure to have enough tablets with internet connectivity (if required by the survey system) and sufficient battery life; if using the paper version, be sure to print enough copies.Even if using the electronic version, it is recommended to also print a number of paper versions as a backup or if a large group agrees to participate at once.

•
Regardless of the translation of the items, it is important that the label (e.g., SSI01) is kept, as well as the size and direction of the scales (1-5, etc.) to maintain data consistency.

Contextual and Environmental Factor Data Collection
During each survey, the equipment listed in Section 2.3 is set up to capture the contextual and environmental data for the location.Table 3 lists the factors to be collected and at what stage they should be collected.

Spatial Audio-Visual Recordings
In order to capture the acoustic and visual information in the space for replication in a laboratory setting, 360 • video and AMB audio are recorded to be used in Virtual Reality (VR) playback.The goal of this is two-fold: first, to enable researchers to document and replicate the in situ environment of the space as it was during a questionnaire survey session for lab experiments and, second, to capture environments in which performing a questionnaire survey is not feasible.
Typically, questionnaire surveys are carried out over a period of several days at the same location.The goal of these multiple sessions is to capture as many questionnaire responses as needed (100 for a particular soundscape is typically recommended [11]), which, in the experience of the authors is prohibitively difficult to achieve in a single session in most locations.It is recommended that the repeated sessions are conducted under similar circumstances and environmental conditions.As such, it is not entirely necessary to repeat the spatial recordings each time a questionnaire survey is conducted.Instead, it is useful to use the spatial recording as a chance to gain a different perspective on the space under investigation.For instance, if the questionnaires are conducted in the middle of a large urban park, the first session could collect a spatial recording within the environmental unit of the questionnaire site, but the subsequent returns to the site could collect spatial recordings in a different environmental unit, say, along a road bounding the park, or in a space in the park which does not typically have many people.This enables the simultaneous expansion of the questionnaire database and the gathering of additional environments to investigate in a laboratory setting.
General notes for spatial recordings:

•
The audio-video recordings can be done before or after the questionnaire survey.

•
The purpose of the audio-video recordings is to capture representative recordings which can be reproduced in a laboratory setting.During the first time at a location, the focus should be on capturing the environment as experienced by the respondents to the questionnaires at that location.Therefore, the recordings should be performed in nearly the same spot, with similar lighting and environmental conditions.For further survey sessions, provided the conditions are similar, other recordings could be taken which provide additional perspectives around the space for reproducing in the lab.

•
These recordings can be performed entirely separately from the questionnaire survey, if desired.Reasons for doing this may be (but are not limited to): location is not populated, making questionnaires impossible; specific locations or conditions are required for a lab experiment; time limitations require many sites in an area to be captured and in situ questionnaires could not be completed in time.

•
The 360 • video will take a significant amount of storage space.Researchers should ensure that there is ample free space on the camera SD cards prior to going out on site.If conducting multiple surveys away from their home institution (i.e., in another city), teams are recommended to bring a large external hard drive so that videos can be offloaded after each session.

Reference Recordings
A soundscape index, or any investigation of the impact of the physical environment on the soundscape, requires consistent and accurate measurement of the environment, most importantly calibrated measurement and recording of the acoustic environment.For this protocol, this has been achieved through the use of separate calibrated binaural recordings and measurements made with a calibrated sound level meter (SLM).

Procedure
Figure 1 shows the whole process of the on site soundscape protocol.The relevant equipment in each row should be operating when the row is colored in, such that when multiple rows are shaded this means that multiple pieces of equipment should be running during that time period.The following section prepares step-by-step instructions for conducting the in situ surveys, including the Recording Stage and Questionnaire Stage. Figure 2 shows an example of the recommended equipment setup.The equipment should be assembled, checked, and calibrated prior to arriving at the measurement location.Calibrate the equipment according to the manufacturer's instructions.All sound level meters should have built-in methods to calibrate using a standard 94 dB 1 kHz tone calibrator.If a similar method is available for the ambisonic microphone, this should be used.If a built-in method is not available, but a calibrator can be fitted to the microphone capsules, then the ambisonic microphone should be calibrated by recording the 1 kHz signal through the system for each microphone capsule after the gain settings have been finalized on site (see below).If it is not possible to calibrate the ambisonic microphone, then the levels recorded will need to be compared to the levels taken simultaneously with the SLM.This is why it is crucial to have an appropriate quality, calibrated SLM included within the same setup as the AMB recordings.To the left is the equipment (color-coded to match Figure 1), with the ambisonic microphone and SLM microphone in the windscreen, with the 360 • camera on top of the tripod and to the right are one researcher interacting with the participant while the second researcher conducts the binaural recording.The body of the SLM and the multi-channel recorder are stored in a bag under the tripod which can contain all of the pieces of equipment for easy transport.

1.
Set up the equipment by prioritizing the position of the 360 • camera and position the lens at the average eye level 160-180 cm, as shown in Figure 2.
It is advisable to test the setup for video stitching issues and reconfigure if needed (e.g., the equipment will be partially visible in the raw video recording, so you need to test if the chosen setup allows for efficient erasing/hiding/patching of the exposed parts in the post-processing).Companies selling 360 • cameras usually offer free software for basic editing and previewing.It is advisable to position the camera as the highest item in the set to avoid the need for editing both the sky and the ground.

2.
Carefully position the AMB microphone so its axes are aligned with the axes of the 360 • camera; the microphone's front (usually marked by the logo) and the camera's front should be looking in the same direction.Many AMB microphones allow them to be oriented vertically or horizontally (end-fire), this should be noted and adjusted in the relevant software settings.This is essential for informed post-processing.It is advisable to position the capsules of the AMB microphone and the capsule of the SLM as near to each other as possible, without introducing scattering effects.It can usually be done within the same windshield unit, but it is not essential to do so and depends on the available clamps and stands.

3.
The gain settings for the four ambisonic audio channels should be set to the same level.In some devices (such as the MixPre 10), this can be set by locking the channel gain settings to a single channel.Many devices also offer ambisonic plugins which simplify these settings and automatically link the gain settings-these should be used where available.4.
Set the SLM to log sound levels and simultaneously record .wavaudio.The recommended logging settings are given in Table 3.The SLM should be mounted and positioned according to standard guidance for environmental noise measurements, like that given in Section 9 of ISO 1996-2:2017 [19] or Section 5 of ANSI/ASA S12.9-2013/Part 1 [50].Generally, the microphone should be a minimum of 1.2 m above the ground and a minimum of 1 m from any vertical reflecting surfaces.

5.
Attach the environmental meter(s) to the tripod.Care should be taken when positioning the environmental monitor.Most units will include guidance on their use from the manufacturer-these should be followed where available.Some general items to keep in mind include not accidentally covering air quality sensor holes, not positioning light sensors in the shade of the other equipment, and not positioning temperature sensors in direct sunlight unless this is how they are intended to be positioned.

Recording Stage
The following section prepares step-by-step instructions for conducting the Recording Stage of the on site protocol, as shown in Figure 1.

1.
Double check all settings and file save locations on the recording equipment.

2.
Adjust gain settings to ensure there is no clipping.Good practice is to listen for what is expected to be the loudest sound event during the recording period (e.g., sirens) and set the gain such that the level is comfortably under clipping during this event.
Stand at the front of the camera/ambisonic microphone and clap.The clap can help synchronize the audio with the video, if necessary, and ensuring you are standing in line with the front of the 360 • video can help with lining up the directionality of the two, if necessary.

5.
Retreat out of view of the camera, blending into the surrounding crowd, or otherwise make sure not to be obvious to someone watching the video.6.
Record at least 5 min of consistent and representative audio and video.It is recommended to record for 15 min to give the best chance of being able to extract a solid 5 min of useful video and audio.7.
Stop recording on all devices and ensure all files are saved properly.

Questionnaire Stage
The following section prepares step-by-step instructions for conducting the in situ questionnaires and their accompanying reference recordings as part of the Questionnaire Stage.Typically these are performed during the same working session as the Recording Stage, using the same set of equipment.The selection of an appropriate location and setup of the equipment should follow the guidance given in Section 2.2, while making sure the location selected is representative of where the respondents will be stopped.Wherever possible, the equipment should be assembled and located so as not to draw the attention of the respondents and particularly to avoid influencing their perception of the space.

1.
Double check all settings and file save locations on the recording equipment.If starting this stage immediately after the Recording Stage, make sure to rename or advance the index on the filenames for the SLM and environmental meters.

2.
Start recording on the SLM and environmental meter (or leave running from preceding Recording Stage).These will continue running until the end of the Questionnaire Stage.

3.
Gather the tablets and/or paper questionnaires and prepare to approach potential participants.

4.
Approach participants and ask if they would be willing to take part in a research study.If the participants are in a group, they can participate at the same time, but should each fill out a separate questionnaire.When approaching participants, you should identify yourself as a researcher or student researching urban sound.We advise avoiding phrases such as "noise", "noise pollution", "noise disturbance" or other terms which carry a negative connotation.In general, explanations and answers to questions should strive to be as neutral as possible regarding the nature of the soundscape.

5.
Once the participant has consented to participate, hand them the questionnaire or tablet and provide them with basic instructions for answering the questionnaire.Emphasize that they should be responding and assessing the current sound environment, in the current place.Note that this is a common misunderstanding-many participants assume the questionnaire is focused on the sound environment at their home, or in the city in general.Where a mix of tablets and paper questionnaires are being used, each group should have at least one participant using a tablet such that start and end times and precise GPS coordinates can be pulled from the accompanying electronic questionnaire.While one researcher is interacting with the participants, the second should arrange the equipment for taking the binaural recordings and 360 • photo.6.
Once the participant has started answering the questionnaire, start recording the binaural audio.
If the participants are in a group and all are taking the survey at the same time, only one binaural recording is needed for the whole group.The researcher conducting the recording should strive to keep their head as stationary as possible and to avoid making any extraneous noise.
Make sure that at least 30 s of consistent audio is recorded while the participant is filling in the questionnaire.This should not include talking either from the researcher or the participant.
If talking or other intrusive (non-representative) sound occurs, extend the recording period to end up with a solid 30 s of good audio.The goal is to capture the sound environment which the participant was exposed to while filling out their questionnaire, but to exclude sounds which the participant is not likely considering as part of their assessment.Most commonly, this would be the researcher talking, or the participant themselves talking.Any other sounds which the participant was "naturally" exposed to should be included.
When taking the binaural recording, attempt to orient the head (artificial or researcher wearing a headset) in the same direction as the participants.This is not crucial as it is often impossible to achieve, but it is preferable.Be careful not to move the head during the recording.7.
Note the GroupID in the metadata for the binaural recording, or make a manual note of the binaural recording file name and the GroupID separately.8.
Take one 360 • photo with the camera to capture the general setting.This can also be done at regular intervals during the survey session.9.
When the participant has finished filling in the questionnaire, thank them for their participation and fill in the additional researcher questions at the end of the questionnaire.These help to both track the data collected and to document the conditions on site.The most important of these are: Experience has shown this is possible up to about three groups at a time, with four researchers on site.11.
Once the session is finished, stop the equipment and ensure all files are saved properly.12.
After each session, make note of the character of the site and the environmental conditions during the survey.This might include, but is not limited to: • Site typology and intended use (e.g., urban park, transit station, urban square, etc.)

Lessons from International Data Collection
As this protocol has already been implemented by several research groups across four countries, it has undergone a rigorous testing and development process.Throughout this process, adjustments have been made which resulted the final protocol presented here.However, no process is perfect or applicable in all situations.As such, after consultation with the research groups involved, we have compiled the most common feedback and guidance to keep in mind when implementing this protocol.

Sampling
The research groups were instructed to try keeping the structure of respondents well-balanced.This often led to longer times and larger sample sizes required as most comments from five research groups addressed age and type of location as the most influential factors for participant sampling.However, while some reported higher response rates from younger (students) members of public, the others reported higher response rates in case of older high educated people.A common observation was that public parks are the locations with the highest response rates, most probably due to a high number of people taking part in activities that allow enough time to take part in a survey.The type of space was also reflected in the sense of privacy.In locations that were more public, people in groups were more likely to take part in the survey, while in the more private locations it was the opposite.Amongst other comments, whether a participant was a tourist or a local also had an influence on the response rate.Tourists seemed more likely to participate in the survey.
Several groups reported excessive heat and cold to be negatively affecting the response rates.One research group, which conducted the survey also in a residential area, distinguished privacy/ownership of the survey site as a major factor.

Data Collection
A group of three researchers seems to be the minimum number needed to conduct the survey, as observed by the partner research groups.The group of nine researchers on-site proved to be the most effective number.The time needed to complete the survey varied greatly depending on the location.
Although the questions are written in a manner that emphasizes the focus on the actual acoustic environment perceived at the moment, additional care should be made to ensure the proper understanding of that concept while approaching the participants.Researcher's comments are invaluable here to keep track of the outliers if a researcher feels similar issues or other factors (i.e., wearing headphones) lead to collecting invalid/misleading data.

Equipment
Some partners had previous experience in soundscape research, but for all this was the first study that featured surveying large number of public participants around a single measurement point.All the research groups found it very important to delegate one researcher/technician to care exclusively about the equipment and the quality of the recordings.
The intention of the recording stage is to record a first-person experience most representative of the location.Therefore, the researchers are instructed to 'make themselves invisible' in the recording.However, at some locations, various groups decided to put out a sign asking members of public not to touch or come near the measurement point as they experienced passers-by touching the windshield out of curiosity.
The equipment setup has been designed to be as compact and unobtrusive as possible so as to limit any intrusion on the participant's experience of the space.From our experience, most participants do not end up with the equipment within their field of view during the questionnaire and often do not notice the presence of the stationary equipment.In some locations, this is not possible and participants may comment on its presence; however, over the thousands of surveys collected, only a small number of respondents have commented on the equipment as noticeably impacting their experience.

Translation
Regarding the on-site soundscape survey, the translation of the questionnaires (and in particular the perceptual adjectives used for the soundscape appraisal) is a key point to consider when using the protocol in regions where English is not the local language.Indeed, while the ISO/TS 12913-2:2018 document from which the soundscape-related questions of this protocol are derived aims at providing standardized scales, it does not provide official translations in languages other than English.Some perceptual constructs are difficult to render in different languages and people might assign different meanings to them (e.g, [51][52][53][54]).For this reason, in the soundscape research community, there is a growing interest in testing and validating reliable translation of the ISO soundscape adjectives [24], which will hopefully lead to a wide-spread use of this soundscape tool.It is expected that these validated translations could simply be substituted for their English counterparts in this protocol, when they become available.

Figure 2 .
Figure 2. Photo of a full survey carried out in a park in London during the Questionnaire Stage.To the left is the equipment (color-coded to match Figure1), with the ambisonic microphone and SLM microphone in the windscreen, with the 360 • camera on top of the tripod and to the right are one researcher interacting with the participant while the second researcher conducts the binaural recording.The body of the SLM and the multi-channel recorder are stored in a bag under the tripod which can contain all of the pieces of equipment for easy transport.

Table 1 .
Labeling system for on site data collection.Regent's Park Japanese Garden is used as an example location.SLM: Sound Level Meter (acoustical factors); ENV: Environmental factors; BIN: Binaural; QUE: Questionnaires; PIC: Site pictures.

Table 3 .
[43]e of recommended context and acoustic measurement factors.The recommended acoustic data settings are given here in order of importance.In cases where researchers do not have access to a meter capable spectral logging, L Aeq logging should be prioritized over spectral analysis.During both stages, spectral data can typically be extracted from the audio recordings, but accurately tracking the sound level is crucial.**Therecommended environmental factors are given here in order of importance.More flexibility is allowed in selecting which factors to record and investigate (compared to the acoustic data) as it is still unclear how and to what extent environmental factors influence soundscape assessment.However, previous studies have indicated visual (i.e., lighting levels) and temperature are significant factors[43]. *