Semantic Crowdsourcing of Soundscapes Heritage: A Mojo Model for Data-Driven Storytelling

The current paper focuses on the development of an enhanced Mobile Journalism (MoJo) model for soundscape heritage crowdsourcing, data-driven storytelling, and management in the era of big data and the semantic web. Soundscapes and environmental sound semantics have a great impact on cultural heritage, also affecting the quality of human life, from multiple perspectives. In this view, contextand location-aware mobile services can be combined with state-of-the-art machine and deep learning approaches to offer multilevel semantic analysis monitoring of sound-related heritage. The targeted utilities can offer new insights toward sustainable growth of both urban and rural areas. Much emphasis is also put on the multimodal preservation and auralization of special soundscape areas and open ancient theaters with remarkable acoustic behavior, representing important cultural artifacts. For this purpose, a pervasive computing architecture is deployed and investigated, utilizing both clientand cloud-wise semantic analysis services, to implement and evaluate the envisioned MoJo methodology. Elaborating on previous/baseline MoJo tools, research hypotheses and questions are stated and put to test as part of the human-centered application design and development process. In this setting, primary algorithmic backend services on sound semantics are implemented and thoroughly validated, providing a convincing proof of concept of


Introduction
Cultural Heritage (CH) is considered very important from multiple perspectives of everyday modern human life, including but not limited to education, history, cultivation of cultural awareness, social engagement, entertainment, and well-being. The proliferation of Information and Communication Technologies (ICTs) and especially digital mobile devices has significantly propelled CH projects and associated featured services (websites, multimedia/mobile apps, etc.). In this context, ordinary users can navigate and virtually visit places and artifacts displaying cultural and heritage interests, literately, without time or geographical restrictions. These services can be deployed at the change of attending a physical environment with cultural value for augmenting the whole experience (before, during, and after the visit) or general infotainment activities. Apart from the cases of digital museums and exhibitions concerning artworks, historical buildings, monuments, and other cultural items, intangible CH has flourished through the processes of information capturing, documentation, and digital synthesis of CH storytelling experiences [1][2][3][4][5][6][7].
Among others, the audiovisual heritage associated with places, performances, and events can benefit from this progress in recording, managing, and authoring data-driven narratives [5][6][7][8][9]. In this context, average users can become active participants in the processes of contributing and exploiting multimedia content by experiencing, evaluating, Sustainability 2021, 13,2714 2 of 19 and reinforcing the associated services. For instance, previous works have proved that applicable media assets can be quickly and massively crowdsourced, making use of the inherent audiovisual capturing and networking capabilities that modern mobile devices offer [10][11][12]. Apart from the data themself, useful context-, time-, and location-aware metadata can be extracted to facilitate semantic information management and retrieval [13][14][15][16]. Through social tagging, it is possible to gather information about emotionally pleasant or unpleasant sounds in different urban areas [17]. However, as discussed in [10], not many ICT tools and/or services have been developed to support people in contributing audiovisual data, assisting toward the design of a CH framework.
Environments, either physical or artificial, bring together their own acoustic profiles. Distinct sound languages can shape a recognizable identity offering an individual experience to the human's sound perception [18]. The concept of the soundscape was introduced as early as 1977 by R. Murray Schafer, making the first attempts to describe what exactly a human ear hears or listens to, when in a particular and self-explained environment [19]. It was in 2008 when the International Organization for Standardization (ISO) established the working group ISO/TC 43/SC1/WG 54 "Perceptual assessment of soundscape quality." The objective of this group was to assist and promote consistency and compatibility between both theoretical and methodological approaches of soundscape studies and practice, developing the following definition, as given in ISO 12913-1, Section 2.3 [20]: "Soundscape is an acoustic environment as perceived or experienced and/or understood by a person or people, in context." Therefore, when discussing soundscapes heritage, the key issue is to focus not only on the meaning of sounds, but on their implicit impact on the everyday quality of life and the opportunity to promote genuine acoustic sustainability. Besides, the interdisciplinary field of soundscape studying and research also lies in the conservation of acoustic heritages [21,22].
Data-driven storytelling is related to the way of making stories through data, i.e., the captured audiovisual content and its associated semantic metadata. In this perspective, possible multisite monitoring (offered by multiple mobile users) can be deployed, offering the option of selecting and/or augmenting the preferred viewpoint/reproduction configuration [14]. This feature makes a good match to the empirical and strongly personalized aspects of perceiving soundscapes, opposed to the somewhat neutral/impersonal acoustic environment capturing and reproduction [18][19][20][21]. Hence, the idea is to engage the audience for sound-related CH capturing and semantic description, thus forming a mediated way of experiencing soundscapes. Apparently, there are multiple aspects that can be assembled in this direction, encompassing all spatiotemporal, acoustic, visual, and semantic levels at the reproduction site. Nonetheless, the main goal here is to attract mobile users for collecting and contributing semantically enhanced media assets (i.e., audiovisual records with their pattern-related metadata), equipping them with the necessary Machine Learning (ML) capabilities for on-site sound detection and classification [15,16]. Such mobile applications would allow the description of the associated scenes and sound-fields (both aurally and visually), and to share the soundscape experience as intangible CH storytelling. This notion of soundscapes, which is perceived by the captured content, the offered retrieval/reproduction, and the associated sound (and video) semantics, will be considered throughout the rest of the paper.
The current work focuses on the collaborative collection and documentation of soundscapes and environmental sound semantics, which apart from CH, also significantly impact human life quality in multiple perspectives (as explained in the next sections). The whole approach has many similarities with sophisticated Mobile Journalism (MoJo) services, helping professional and citizen journalists collect news-items and shape them into featured data-driven storytelling [23,24]. Relying on the so-called MoJo-mate platform (Mobile Journalism Machine-Assisted Reporting) [23,24], an analysis is held regarding model elaboration and adaptation for the needs of soundscape heritage purposes. In this perspective, state-of-the-art machine and deep learning services are implemented both client-(mobile) and cloud-wise. This approach allows for multilevel semantic monitoring of sound-related heritage, while offering new insights toward sustainable urban and rural growth. Much emphasis is placed on capturing, preserving, and recreating soundscapes and open ancient theater acoustics, representing important cultural artifacts.

Related Work
Based on the preceding introduction, there are multiple perspectives concerning the related work around the discussed research domain. Data-driven storytelling, as a form of digital, sensemaking narrative, has recently received significant attention. Recognizing the increasing need to support novel means for integrating data visualization into narrative stories, featured cultural and audiovisual heritage projects deploy state-of-the-art technologies to capture, manage, and publish CH data through rich-media storytelling experiences [1][2][3][4][5][6][7][8][9]. Among others, related services or cultural activities (tomorrow's heritage) include tourist promotion and environmental preservation/awareness for landscapes and intangible artifacts [1][2][3], sites modeling/reconstruction and content restoration/documentation [4,5], and multi-disciplinary collaborations in research and education innovations [6][7][8][9]. Audiovisual and soundscape-related heritage initiatives also emerge, focusing on historical sound records and landscapes preserved, re-created, and reproduced as means of intangible CH expressions [25][26][27][28][29]. Furthermore, the impact of environmental sounds, noise, and soundscape components is analyzed on various aspects of modern human life, i.e., examining their associations with the residents' physical/mental health, perception, and behavior, aiming to unveil factors of sustainable growth and development and overall quality of life as well [30][31][32][33][34][35][36][37]. Social media soundscape information can serve for the prediction of health effects of noise pollution in different areas [38]. In this context, cooperative smart-sensing and crowdsourcing practices have been proposed and launched to raise public awareness toward soundscape conservation, safeguarding, and overall ecological consciousness through multimodal mapping capabilities [39][40][41][42][43][44][45][46].
Summing up, the conducted literature review revealed important aspects of soundscapes, i.e., environmental monitoring, sound and intangible cultural heritage, data-driven documentation, decentralized/smart sensing, etc., with diverse extensions on human health and sustainable growth indicators. Many related publications have attempted to enlighten most of the above viewpoints by utilizing mobile terminals and collaborative mapping [17][18][19][38][39][40][41][42][43][44][45]. However, to the best of our knowledge, such a multi-faceted approach (like the current one) has not been reported, incorporating sophisticated on-site semantic analysis and crowdsourcing dynamics, as they are advanced in today's ubiquitous society (i.e., in the era of big data and the semantic web). The impact of the anticipated services is also strongly connected to featured projects, which have been deployed to discover and recreate sounds of the past, emanating from the perspectives of acoustic heritage, archaeo-acoustics, and historical acoustics. Such works, supported by limited historical/acoustic data, rely mainly on computational models and simulation outcomes to offer an intangible CH experience, projecting relationships between people and sound over time [46,[68][69][70][71][72]. In this direction, we can forestall the dense impact of the proposed MoJo-adapted system, which can document today's soundscapes to be experienced as tomorrow's heritage, taking advantage of semantically enhanced data-driven storytelling. Recalling the importance of ground-truth datasets and crowdsourcing audio semantics in the age of deep learning, the launched model can easily lead to massive soundscape data and metadata. The in-depth analysis of those repositories would reveal finer pattern correlations and taxonomies, with sharper conceptualization capabilities.

Project Motivation and Research Objectives
The related work presented in the previous section indicates that the field of crowdsourcing soundscape assets is very fruitful and mature, providing significant benefits for cultural heritage preservation and urban development. Audience engagement can be feasible, given a proper framework design. The motivation of the current project em-anates from the idea of incorporating proper ML/DL analysis for soundscape semantics through a cloud-based architecture. For this reason, early backend implementations for General Audio Classification and Detection are presented and evaluated. The successful implementation of MoJo-mate, a mobile application offering machine-assisted reporting with semantically enhanced capture and documentation MoJo facilities [23,24], justifies this approach. The encompassed audio processing and recognition layers exhibit stateof-the-art time-, context-and location-aware ubiquitous computing services, combined with generic/hierarchical pattern classification schemes [13][14][15][16]. These content analysis perspectives are considered ideal for meta-information augmentation of environmental sounds and soundscapes, which can be massively crowdsourced as User-Generated Content (UGC) to represent essential sites or places of intangible CH. The multilevel semantic interpretation of audio (and audiovisual) streams, contributed by both experienced and average users, will allow monitoring how the formed soundscapes have evolved and/or are still evolving over time and within special areas of interest. Typical examples include sensitive ecological zones, landscapes with environmental and cultural interest, and places hosting cultural activities (in ancient or modern theaters and music halls), UNESCO world heritage sites, etc.).
The utmost target is to collect the necessary volumes of data in an easy and entertaining way, provide in-situ/real-time and batch semantic analysis modes, augment the physical visiting experience, and enable data-driven storytelling through multiple auralization and visualization layers. Such techniques will allow the monitoring of the way acoustic comfort of historic urban and rural areas is affected by sound space components (e.g., cars, motorbikes, tourists) and, overall, the necessities of improving the environmental qualities. Another important aspect refers to assessing the mediated navigation experience of both physical and virtual visitors, with respect to the offered digital storytelling, derived by soundscapes and environmental acoustics recreation. No doubt, these perspectives are equally important for the processes of intangible CH collection, management, and preservation. In the long-term, sustainable growth and well-being indicators could be systematically monitored, correlated, and predicted in relation to the associated sound-field attributes (e.g., in heritage sites and areas featuring substantial environmental, cultural, or historical interest).
The work presented here is part of a broader project, aiming to collect and document multimedia semantics of soundscape heritage, to be later used for data-driven storytelling. The Logical User-Centered Design (LUCID) [6,7,11,23] was adopted through the whole process, emphasizing the audience engagement and reinforcement part. This was also one of the principal elements that had to be answered in the early beginnings of this undertaking, i.e., the degree to which targeted users would be interested to actively participate and contribute in this effort, which is aligned with the Analysis/Communication phase of standard application development procedures. Hence, a related survey was carefully set-up and executed to serve the needs of audience analysis. The second key factor would be to investigate whether mobile terminals and the associated algorithmic backend can be adapted to the task of crowdsourcing soundscape semantics. In this perspective, ML and DL systems were implemented as the initial/piloting algorithmic solutions and were thoroughly evaluated at various levels to provide a convincing proof-of-concept of the tested scenario.
Based on the above analysis, Research Hypotheses (RH) are stated and put to test, providing a convincing proof of concept of the proposed model, its feasibility, and effectiveness, emphasizing the semantic processing part: Research Hypothesis 1 (RH1): It is both feasible and innovative to launch a Mobile Journalism application for soundscape heritage crowdsourcing and data-driven storytelling, and there is an audience willing to use the application and contribute. In this context, risen Research Questions (RQ) accommodated to the listed hypotheses are as follows:

Research Question 1 (RQ1):
How can the MoJo framework be configured for soundscape heritage capturing and documentation? How can the crowdsourced media assets serve the needs for datadriven storytelling?
Research Question 2 (RQ2): What are the main classification taxonomies that can be incorporated in the initial backend implementations of soundscape recognition? What is the estimated accuracy and computational load of these algorithmic systems?
The rest of the paper is organized as follows. The system architecture and concept, as well as the experimental procedures, are presented and justified in the Materials and Methods section. Results and discussion illustrate the corresponding outcomes (and their thorough evaluation), providing multi-perspective analysis with regard to the stated hypotheses and questions. Conclusions are finally drawn, stressing the novel aspects and the contribution of the whole project, followed by the respective Summary section.

Integration of State-of-the-Art Audio and Soundscape Semantics on the Cloud
The main target of the current paper is to enhance the semantic aspects of capturing, managing, and recreating soundscapes, engaging the audience in the direction of mobile crowdsourcing and sharing related audio events. In this context, crowdsourced audio data can be comprehended in various ways, one of them being monitoring encountered soundscapes. Theoretically, this can be achieved by manually matching and managing different input streams from end-users, exploiting the aspects of semantic tagging, and annotation at different levels of hierarchy. However, in real-world conditions, difficulties regarding user-and context-related heterogeneities arise, which require the employment of intelligent audio processing and interaction methods, to utilize and benefit from the underlying semantic information of audio data.
While many related processing strategies can be deployed on mobile computing environments, resources for processing and analyzing vast amounts of audio data in a mobile device are typically limited [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24]. Thus, a strong motivation for embracing cloudbased services emerges in this scenario. In this direction, accessible and highly capable cloud-based computing environments can facilitate the binding of semantically relevant content, by incorporating previous knowledge on individual soundscape characteristics (i.e., the rules that a listener would associate to a specific soundscape) [73].
Prevailing research on intelligent audio analysis and sound recognition is highly focused on the sub-fields of General Audio Detection and Classification (GADC) and Environmental Sound Recognition (ESR). The analysis aims at the semantic description of complex acoustic scenes, relying on a system that inputs an audio signal and outputs the semantic description of that signal. Hence, in this case, the meaningful aspects of a soundscape are to be detected and identified.
State-of-the-art approaches in computer audio intelligence motivate data-driven modeling, through machine learning. A wide variety of pre-processing and classification algorithms can deliver a solid generalization performance, given large amounts of training data. Moreover, the performance of these models is strongly dependent on the quality of the utilized data. For this reason, mobile devices can offer significant advantages in the direction of large-scale diversified labeled audio data gathering and the construction of generic ground-truth semantic audio databases [15].
Efficient pre-processing and semantic monitoring techniques can also be deployed as a front-end client-based system, given the ability to adapt to the variance in the acoustic environments and the respective sound recording conditions. This process can locally interact with the input signal and map it into a latent space, allowing users to on-site-monitor soundscape semantics, with the option to define patterns of interest and associate them with specific audio features, geolocations, and/or visual content [56,57]. The proposed modular architecture allows the attachment of multi-channeled ambisonics sensors to the client terminal (i.e., soundfield microphones), to apply more sophisticated spatiotemporal localization and mapping that could facilitate the audiovisual content description and management [49][50][51]74,75]. On the other side, more demanding semantic analysis can be performed on a batch processing mode, as a cloud service, making use of recent advantages on Convolutional Neural Networks (CNN), Deep Learning (DL), and multimodal decisionmaking systems [58][59][60][61][62][63][64][65]. The focus here lies in the discrimination of time-concurrent audio events in a hierarchical classification taxonomy. This processing type is more adapted to the audio domain and may have considerable advantages over end-to-end solutions. Moreover, a soundscape crowdsourcing approach is favored in the proposed methodology for constructing big datasets, as users are encouraged to contribute with new labeled data while making use of the services. This real-world soundscape intervention approach to audio management systems can offer further conceptual analysis perspectives of crowdsourced audio data, layered on top of existing semantic analysis assets.

The Implemented Sound Heritage and Storytelling Model
Soundscapes can tell the story of spaces through time. While the acoustic scenes can characterize certain places and ecosystems, they are also in constant movement and evolution, as they change as a whole, and as temporary events occur, breaking the perceived continuity of sound. Treating environmental recordings in this scope allows the design of an interactive storytelling mode, where varying soundscapes can be in the spotlight of the narration.
When a crowdsourcing approach is adopted, definitive and linear storytelling is replaced by a collective narration, formed with the combination of the provided audio recordings and audiovisual assets provided by the users. The criteria that individual listeners follow to access the available files define different perspectives and can form a vast amount of stories that emerge from the provided soundscape recordings. An intuitive design can support interactive storytelling, facilitating the exploration of the dataset in creative ways. Two of the main aspects of treating soundscapes have already been mentioned and they refer to their spatial and temporal evolution. An interactive map, with a supplementary timeline option, can provide the functionality for filtering the data, using both the geographical and temporal information of the recordings. The user can access environmental recordings using an interactive world map, while the option of selecting the time interval within which the recording was created is available. Context-aware content-creating applications can provide such information without manual annotation at the time of the recording [11][12][13][14][15][16]74,75].
Besides the straightforward spatiotemporal filtering of results, content-based retrieval can form different storytelling paths. Soundscapes that are far away in terms of distance or time may capture similar acoustic scenes, e.g., open theaters, cities, forests. Manual annotation from the content creators can provide a tagging scheme to retrieve relevant assets. By providing a data-driven analysis system on the cloud, several soundscape descriptors can be extracted automatically from the audio characteristics of the recordings. Users can form queries to browse through the dataset, based on the manual and automated tagging of data. In this approach, the integration of featured personalization and recommendation modules can push relevant content to the users, based on their queries and, overall, the monitoring of their behavior and interests.
So far, several scenarios of searching for audio content through textual input, as well as extracting textual descriptors from audio content, have been presented. However, modern trends in Human-Computer Interaction demand more intuitive query processes. In the context of soundscape storytelling, it is possible to retrieve audiovisual content through an audiovisual input query. By recording or providing a soundscape, users should be able to search the database through similarity checks (e.g., pattern matching). This will result in accessing content with audio characteristics that match the input. In the same way, by accepting not only audio recordings but also videos, or accompanying assets (e.g., users can upload photographs along with the environmental recordings), a mapping between different modalities can be created. By providing certain soundscapes, relevant content can be generated, and vice-versa. This interaction can provide great possibilities in the paths a user can follow to access different stories.
Another meaningful parameter that can boost interactive storytelling functionality is the acoustic modeling of distinctive soundscapes, especially those related to cultural heritage (i.e., the cases of notorious ancient open theaters). This process of defining a transfer function can be used to estimate and imitate the acoustic behavior of a scene. In the case such a functionality is offered, users can provide studio-quality or close-miking recordings with no reverberation and simulate the reproduction of their recordings as if they had been held within various soundscapes [5,56]. It is essential to mention that related functionalities have been recently deployed on the MoJo-mate application, facilitating time-, context-, and location-aware audiovisual recordings with significant semantic enhancements concerning the encountered audio patterns and the surrounding acoustic behavior [11,12,[16][17][18][19][20][21][22][23][24]. While these modalities have been successfully integrated and evaluated for the needs of MoJo capturing and publishing services, the proposed re-orientation can be even more valuable in the direction of preserving and demonstrating soundscape heritage. Furthermore, the collection of big data in a more organized manner and the gradual construction of semantically enhanced audio (and audiovisual) repositories can force added-value services toward implementing diverse ground-truth sets and their utilization on more sophisticated semantic conceptualization automations. As already stated, such analysis perspectives can be correlated with human well-being, cultural heritage, and sustainable development indicators, which is very important in today's rapidly changing ubiquitous society.

The Proposed Model Architecture
The work presented emanates from the particularities residing in the vast increase in UGC. Apparently, mobile devices offer significant advantages in the direction of massive harvesting of large-scale diversified labeled audio data. Users' smartphones make the procedures of recording, recreating, and sharing audio and audiovisual material as simple as possible. Professional and nonprofessional users capture audiovisual content using mobile devices (smartphones and tablets) and upload it to the platform. However, multimedia data that are collected through crowdsourcing are often of low quality, due to nonprofessional hardware limitations and the lack of proper training. In this direction, mobile automations add a level of intelligence to assist the process. Difficulties regarding user-and context-related heterogeneities are overcome through the adoption of dedicated audio processing and interaction techniques for the semantic tagging and annotation of audio events.
To this end, the implementation of a 4-layer, cloud-based architecture is shown in Figure 1, offering audio-driven multimedia analysis and classification. Mobile terminals offer sensory and recording software to capture sound and audiovisual data, which can be enhanced with time, geolocation, and other context-aware metadata. The user can upload the created files on the cloud for analysis. The data handling layer is responsible for orchestrating and distributing the incoming data depending on the resource allocation, while also extracting audio tracks from audiovisual material and selecting the channels/segments to be further processed. Next, the audio processing and classification layer takes over, resulting in an assembly of salient (human-crafted) audio features, as presented on the left side of Figure 1 (terminal-wise analysis). A set of dedicated temporal feature integration processes is involved [54,57,59], attempting to classify the sounds identified in the given soundscape through typical Multi-Layer Perceptron (MLP) architectures. Apart from this on-site analysis, heavier processing is deployed on the cloud, utilizing state-of-the-art CNN architectures for machine-driven convolutional feature engines and finer pattern recognition (right side of Figure 1). Overall, these two independent flows employ different-complexity (and computational load) machine learning models, associated with the client-wise and server-wise (cloud) perspectives, as previously stated. The resulting entities are stored in a repository along with their semantic representation. Based on this information, an interactive map is created, augmented with a timeline bar and multiple semantic filtering options, taking into consideration time, location, and pattern-related tags. The captured audio streams are pinned in this multilevel information mapping so that spatiotemporal monitoring and auralization processes are offered as part of the storytelling. Hence, both the UGC contributors (displayed at the bottom of Figure 1) and the end-users/consumers (depicted at the top of Figure 1) can reproduce the evolution of sound and soundscapes over time, and in relation to the available semantic layers. The main goals of the proposed architecture concern the efficient and purposeful employment of cloud services and mobile artificial intelligence for the support of interactive soundscape exploration. More specifically, the current paper evaluates the individual and ensemble potentials of the two different semantic analysis processes (terminal-and cloud-wise), thus making a convincing proof of concept for their usefulness in the attempted CH data-driven storytelling.
1 Figure 1. The adopted semantic crowdsourcing model architecture. Terminal-site audio semantics is deployed through feature extraction, temporal integration (enhanced temporal integration (ETi)), and multi-layer perceptron (MLP)-driven pattern recognition. Server-wise semantics are applied in heavier processing modes using convolutional neural networks (CNN) architectures for end-toend content-based recognition. Captured audio (and audiovisual) data are enhanced with diverse semantic tags and pattern-related metadata, which are documented in the formed ground-truth repository. These media assets also augment the proposed data-driven cultural heritage (CH) storytelling model.

Experimental Setup 2.4.1. Concept Validation: Preparation of a Questionnaire Survey
The initial hypothesis (RH1) can be examined by answering typical questions for soundscape capturing, sharing, exploration, and specific aspects regarding users' cultural interests and habits, thus retrieving vital feedback. In order to grasp and monitor users' preferences the research utilized a quantitative survey method for data collection, with the formation of a corresponding online questionnaire.
Detailed information regarding this survey is provided in the associated results section, along with the assessment outcomes. An overview of the chosen inquiries is presented here, aiming to justify the adoption and configuration of the formed questionnaire. Hence, background-related questions (soundscape knowledge, relevance, previous use, etc.) were structured in a categorical form of potential answers, with 5-point Likert scales (1-5, from "Totally Disagree" to "Totally Agree" or from "Not at all" to "Very Often"). Binary values (i.e., gender) and higher-dimensional lists were also involved. The items were divided into three subsets, with the former involving basic characteristics/demographics of the users (questions 1-4), the second implicating questions on the participants' background/knowledge on soundscapes (questions 5-10), and the latter containing suggested modalities and usability characteristics of the proposed mobile application (questions 11-17, in Table 1). The test formation was validated after discussions and focus groups with representative users and authorities of various kinds. Specifically, there were involved journalists, cultural and soundscape heritage enthusiasts, multimedia producers/programmers, technologists and researchers in machine/deep learning, environmental sound recognition, audio semantics, etc. The survey was updated based on the received feedback, investigating the audience interest in soundscapes and soundscape heritage, while also estimating the anticipated dynamics of the proposed approach. Probability of using the app for own soundscape re-production (1-5)

14
Probability of using the app for others' soundscape re-production (1-5) 15 Technological maturity (Yes, No, Don't know) 16 App capability for soundscape sustainability (1)(2)(3)(4)(5) 17 App extra usability features and modalities (6 suggestions: Soundscape search from soundscape, image search from soundscape, soundscape recommendations, given soundscape's related subjects, augmented reality, user sharing and combination) Table 1 synopsizes the final set of questions selected for the needs of this survey. During the survey preparation, all ethical approval procedures and rules suggested by the "Committee on Research Ethics and Conduct" of the Aristotle University of Thessaloniki were followed. The respective guidelines and information is available online at https: //www.rc.auth.gr/ed/ (accessed on 2 March 2021). Moreover, the declaration of Helsinki and the MDPI directions for the case of pure observatory studies were also taken into account. Specifically, the formed questionnaire was fully anonymized, and the potential participants were informed that they agree to the stated terms upon sending their final answers, while they have the option of quitting anytime, without submitting any data.

Configuration and Validation of the Audio-Semantic Modalities
Aiming to conduct an objective evaluation for both terminal-and server-side classification algorithms, a comparative evaluation between a lightweight feature-based method and a deep learning approach was decided. As already explained, these two approaches represent the earliest algorithmic implementations that the project should launch, so they are investigated in this first research. Specifically, an Enhanced Temporal Integration (ETi) model [57] with a fully connected neural network (i.e., MLP) and typical 2-dimensional CNN topologies [58], proposed as the terminal and server-side classification approaches, respectively, were tested on typical audio classification scenarios, utilizing common datasets. Again, the specific pattern analysis taxonomies are thought of as the minimum, though entirely adequate, pilot developments to provide a convincing proof of concept, while initiating the semantic crowdsourcing process and the gradual construction of the anticipated ground-truth repository, as well.
The classification scenarios involve two datasets, according to a 3-class generic classification and an environmental 10-class scheme. The first one is simulated using the LVLib-v3 dataset [59], which follows the Speech/Music/Other (SMO) taxonomy, while the 10-class task is based on the UrbanSound8K dataset [55]. This decision is justified by the fact that the Other class of the LVLib-v3 can be hierarchically split into more classes, which for instance, can follow the scheme of the UrbanSound8k [15]. On the one hand, LVLib-v3 includes 1.5 h of recordings, and it is available online at m3c.web.auth.gr/research/datasets (accessed on 2 March 2021) and specifies a 3-fold cross-validation strategy to make the results comparable across the algorithms of different creators. On the other hand, UrbanSound8K is a standard benchmark for environmental sound recognition and contains 8.75 h of field recordings, divided into 10 environmental sound categories.
Regarding the classification units, as aforementioned, the ETi with an MLP and a 2-dimensional CNN and were deployed. It is a fact that the latest deep learning approaches can process raw waveform data [58], but the 2-dimensional topologies deliver the best balance between performance and computational cost and were selected in this case. In addition to this, the ETi method proved to be a lightweight solution for conventional feature-based classification, offering decent performance [59]. The CNN processes melspectrogram patches, with a shape of 84 time-steps × 56 bands. Spectral analysis is executed on a 512/256 sample size/step basis with a sampling rate of 22,050 Hz. The convolutional network consists of four consecutive CPD blocks (each one containing successive Convolutional, Pooling, and Dropout layers), a Global Average Pooling (GAP), and two Fully Connected (FC) layers with an additional Dropout layer in between. The number of filters is 16, 32, 64, and 128 for the convolutional layers with a kernel size of 3 × 3, while the pooling size is set to 2 × 2. The number of neurons of the FC layers was set to 64 and according to the number of classes, respectively. A schematic of the deployed CNN architecture is given in Figure 2 [54]. A typical network setup was deployed with two hidden layers, featuring 64 and 32 neurons. Concerning the rest of the parameters, both networks follow the same configuration: The ReLU function was used as activation for all intermediate (Convolutional and Fully Connected) layers and SoftMax for the output layer, Categorical Cross-Entropy as the loss function, and Adam as the optimizer. Dropout was set to 25%.

Concept Validation: Audience Analysis Results
To examine the proposed research question regarding the usefulness of an application similar to the one proposed, we undertook an online survey (N = 171). Data collection via an online survey appeared to be the most realistic and feasible method to reach a broad audience that would lead to a representative sample. From the collected sample, 61.4% of the responders were females, 36.4% were males, while 2.3% preferred not to state their gender. Regarding sample's distribution in the given age groups 18-25, 26-35, 36-45, 46-55, and above 55, the results are 30.4%, 48%, 15.8%, 5.3%, and 0.6% respectively. In general, the results showed that many people are not familiar with what a soundscape is. In more detail, given six (6) common acoustic scenarios, the participants were asked to identify which of them could be considered soundscapes. The study shows that over 70% of the participants were able to identify the cases in which actual soundscapes were given (e.g., sound of a bell in a village), while on the other hand, about 40% of them had difficulty distinguishing what was rather a false-positive soundscape (e.g., a teleconference). The majority of the participants expressed their interest in the mediated soundscape experience that is aimed within the current project, as thoroughly analyzed below.
In order to balance the diversity of the sample, we selected 104 out of the 171 participants, the ones positively posed against soundscape heritage, considering it an important factor for sustainability, especially in cultural places. This division was also dictated by the fact that some of the questions require a basic background and understanding of soundscapes. Thus, it would be unreliable or biased to equally balance the replies on soundscape heritage and semantics of those without a basic comprehension of the associated terms. The results from the selected sample (N = 104) show that only 30% of the participants explore soundscapes once a month. In addition, 30% of the participants record sounds and soundscapes frequently, while 66% of them record mostly cultural-related

Concept Validation: Audience Analysis Results
To examine the proposed research question regarding the usefulness of an application similar to the one proposed, we undertook an online survey (N = 171). Data collection via an online survey appeared to be the most realistic and feasible method to reach a broad audience that would lead to a representative sample. From the collected sample, 61.4% of the responders were females, 36.4% were males, while 2.3% preferred not to state their gender. Regarding sample's distribution in the given age groups 18-25, 26-35, 36-45, 46-55, and above 55, the results are 30.4%, 48%, 15.8%, 5.3%, and 0.6% respectively. In general, the results showed that many people are not familiar with what a soundscape is. In more detail, given six (6) common acoustic scenarios, the participants were asked to identify which of them could be considered soundscapes. The study shows that over 70% of the participants were able to identify the cases in which actual soundscapes were given (e.g., sound of a bell in a village), while on the other hand, about 40% of them had difficulty distinguishing what was rather a false-positive soundscape (e.g., a teleconference). The majority of the participants expressed their interest in the mediated soundscape experience that is aimed within the current project, as thoroughly analyzed below.
In order to balance the diversity of the sample, we selected 104 out of the 171 participants, the ones positively posed against soundscape heritage, considering it an important factor for sustainability, especially in cultural places. This division was also dictated by the fact that some of the questions require a basic background and understanding of soundscapes. Thus, it would be unreliable or biased to equally balance the replies on soundscape heritage and semantics of those without a basic comprehension of the associated terms. The results from the selected sample (N = 104) show that only 30% of the participants explore soundscapes once a month. In addition, 30% of the participants record sounds and soundscapes frequently, while 66% of them record mostly cultural-related content. Moreover, 40% stated that they want soundscapes to be available for future reference and/or exploration. Moreover, the selected sample featured a clear interest in soundscape preservation over time, while the majority of them (69%) stated that they use their mobile devices for soundscape capturing and sharing. On the other hand, from the smaller percentage of participants not showing interest in sound heritage (13%) or being moderate about it (26%), almost half of them capture soundscapes quite often, thus constituting a group of potential application users.
It is noteworthy that although soundscape capturing, sharing, and reproduction is not that widespread, the selected participants showed a high interest in the proposed application. More specifically, 89% of the participants would use an application like the one proposed for soundscape capturing and sharing. Further, 77% would use the application for the reproduction of what was once recorded, either by themselves or other users. Finally, 87.5% of the participants believe that an application similar to the one proposed here would assist in the sustainability of soundscapes' heritage. Figure 3 provides graph statistics for both the whole (N = 171) and the subset group (N = 104), concerning some of the important questions (namely, #12, #13, #14, and #16). It can be noticed that most users are willing to capture and contribute soundscape recordings, especially the ones belonging to the selected subset (a mean value of 4.03 is observed with a st.dev of ±1.11, compared to the 3.47 ± 1.11 respective values of the entire population). Likewise, almost all participants consider it very likely to reproduce their own or other soundscapes, appraising the impact of the application to sound and soundscape heritage (again, the mean values are higher and with slightly smaller dispersion in the case of the selected sub-group). In summary, the results of the conducted survey validate the first hypothesis (RH1) and the associated research question (RQ1) that there is an audience willing to use the suggested MoJo application, contributing to soundscape heritage crowdsourcing and the subsequent data-driven storytelling (even subjects that do not fully comprehend the underlying principles of the soundscape semantic).
content. Moreover, 40% stated that they want soundscapes to be available for future reference and/or exploration. Moreover, the selected sample featured a clear interest in soundscape preservation over time, while the majority of them (69%) stated that they use their mobile devices for soundscape capturing and sharing. On the other hand, from the smaller percentage of participants not showing interest in sound heritage (13%) or being moderate about it (26%), almost half of them capture soundscapes quite often, thus constituting a group of potential application users.
It is noteworthy that although soundscape capturing, sharing, and reproduction is not that widespread, the selected participants showed a high interest in the proposed application. More specifically, 89% of the participants would use an application like the one proposed for soundscape capturing and sharing. Further, 77% would use the application for the reproduction of what was once recorded, either by themselves or other users. Finally, 87.5% of the participants believe that an application similar to the one proposed here would assist in the sustainability of soundscapes' heritage. Figure 3 provides graph statistics for both the whole (N = 171) and the subset group (N = 104), concerning some of the important questions (namely, #12, #13, #14, and #16). It can be noticed that most users are willing to capture and contribute soundscape recordings, especially the ones belonging to the selected subset (a mean value of 4.03 is observed with a st.dev of ±1.11, compared to the 3.47 ± 1.11 respective values of the entire population). Likewise, almost all participants consider it very likely to reproduce their own or other soundscapes, appraising the impact of the application to sound and soundscape heritage (again, the mean values are higher and with slightly smaller dispersion in the case of the selected sub-group). In summary, the results of the conducted survey validate the first hypothesis (RH1) and the associated research question (RQ1) that there is an audience willing to use the suggested MoJo application, contributing to soundscape heritage crowdsourcing and the subsequent data-driven storytelling (even subjects that do not fully comprehend the underlying principles of the soundscape semantic).

Audio Classification Results
Classification results are presented ( Table 2) in terms of accuracy statistics (mean value/standard deviation) as they have been extracted by the associated evaluation in unknown samples (testing dataset). In this manner, it is anticipated the expected generalization performance of the trained modalities, i.e., their ability to provide accurate classification estimates to entirely new/unknown data. On the LVLib-v3 dataset, both classification modules perform almost equally, achieving high scores, similar to relevant tests conducted in previous works [57][58][59]66]. As expected, the CNN classifier performs slightly better. As already explained, the Urbansound8k dataset involves a 10-class scheme, making the classification problem more demanding, compared to the 3-clas LVLib-v3 taxonomy. While a reasonable accuracy drop is noticed for this reason (i.e., in Urbansound8k), the performance ratings of these models are in line with the current state-of-the-art on the same datasets. Concerning further the UrbanSound8k dataset, the high learning capacity of the CNN is more evident, making the performance gap wider, where the deep network clearly outperforms the conventional model. However, the utilized temporal integration technique ensures decent classification accuracy, making the feature-based approach capable of successfully accomplishing the more demanding task. The results show that the ETi lives up to the standards of deep learning approaches, especially when computational resources are limited [13,16,56]. This was further investigated, and a computational complexity evaluation was also executed. The additional evaluation involves the measurement of prediction times for both models, and a relative presentation of the results was decided because absolute measurements can significantly vary on different processing units. Table 3 depicts the computational cost in terms of network size and prediction times. Table 3. Network size and relative computational complexity for the ETi and CNN models.

Number of Parameters Complexity
ETi 15k 1×

CNN 100k 2×
It can be noticed that in the case of the ETi approach, network size is significantly smaller, facilitating the deployment on devices with low processing power. Nevertheless, the size of the CNN is not that large to make the deployment of the model impossible in the modern mobile computing devices. Summing up, the CNN can equip both client-and cloud-wise semantic analysis services, while the ETi provides adequate performance at the lowest processing cost. These findings directed our decision for selecting the ETi and the CNN as client-and cloud-wise classification solutions, respectively.
Overall, based on evaluation results of the trained models, and the justification concerning the selection of these two demanding datasets, the remaining research hypothesis (RH2) and question (RQ2) are validated/positively answered. Hence, the adopted audio classification schemes, suited for pattern-related soundscape semantics, can be served through relatively light-weight (concerning the required memory and computation load) ML and DL modules. Two related systems have been successfully trained and evaluated as the initial algorithmic backend solutions. The accuracy of those models is already more than satisfactory. However, it can be further enhanced through the users' feedback (and the implicated semi-supervised learning features) deployed within the proposed MoJo framework. Furthermore, the hierarchical and/or hybrid combination of the two taxonomies, along with the initiation of the crowdsourcing process, would lead to the gradual construction of a dedicated dataset. This problem-adapted ground-truth repository would facilitate the training of more sophisticated ML and DL networks with superior performance and additional semantic conceptualization perspectives.

Discussion
The current paper introduces MoJo services updated and adapted to the need of semantic soundscape, crowdsourcing, management, and data-driven storytelling. Based on the conducted experiments, the stated hypotheses have been fully verified, i.e., the audience is interested in such a mobile application (RH1). Furthermore, current technology is adequately mature to reliably deliver the wanted functionalities through General Audio Detection and Classification techniques deployed through Machine and Deep Learning networks to serve the required soundscape semantics (RH2). Furthermore, specific audio processing and semantic analysis features were tested in an effort to quantify the implementation parameters set in RQ1. The configured modalities, both client-and server-wise, exhibit remarkable accuracy with acceptable computational load. Based on the previous experience with the MoJo-mate platform [11][12][13][14][15][16][17][18][19][20][21][22][23][24], especially for the data shaping, presentation, and publishing part, the proposed model can efficiently deploy the desired data-driven storytelling and management services, which have a heavy impact on the CH domain. Concerning the technological adequacy and reliably that RQ2 inquires, the proposed integration seems to overcome the expected difficulties and to suitably serve the desired semantic enhancement, documentation, and auralization/reproduction perspectives. Specifically, along with the above-mentioned low-level measurement modes, the software also provides long-term audio analysis capabilities, based on semantic audio processing concepts [56]. This higher-level mode brings real-time audio-pattern recognition, visually resulting in an event detection markup timeline. A dynamic audio-samples database is used as a pattern-storing matrix, which is configurable by users. Samples can be added, by making a simple recording, and deleted as well. Relying on the MoJo-mate application experience, a user-friendly measurement session manager is feasible, allowing each measurement to be easily stored on the mobile terminal memory and recalled on demand. Additional session measurement data can be stored, including title, location, user's comments, etc., while the position is automatically determined utilizing the device GPS. Likewise, timestamps are easily overlaid by the device, while a handy interface allows photo and video capturing of the measurement location, i.e., the recorded soundscape. A cloud-based session manager handles all the users' data, aiming at building a user-generated, spatiotemporal digital map used for storing measurements. Users can store, update, and retrieve raw audio data and their corresponding analysis output. All measurements uploaded to the cloud are accessible by anyone who uses the application. By exploiting the GPS sensor and cellular data capabilities, the application can easily classify and group measurements by geographical location and kind. Thus, a user can instantly check and confirm the correctness of a specific measurement by comparing it to similar ones, provided by other users. They can even obtain the desired data without making a measurement.
Audio recognition usually refers to different recognition tasks, like acoustic scene detection, speech recognition, and speaker recognition. Systems that implement such models are oriented to specific scenarios of recognition. Applying audio recognition to soundscape management is a much more complicated task. The information that can be extracted from the recordings is not pre-defined. Environmental noise can contain multiple layers of audio information and includes a great variety of possible temporal audio events. In the proposed approach, an ensemble of algorithms is proposed to compose a hierarchical classification scheme. For example, an algorithm for acoustic scene classification can classify an acoustic scene as "river," while an audio event detection can recognize a "speech" audio event at a certain time, triggering algorithms that extract information concerning speaker diarization and spoken language, thus triggering algorithms that transform speech-to-text, etc. This approach results in several layers or perspectives of audio monitoring, giving the user the possibility to browse through the data with different levels of information abstraction. In the context of environmental recordings, several information layers concerning acoustic characteristics, noise levels, etc. can also be included in the defined hierarchical scheme. Another interesting approach for analyzing complex scenes is automated audio (and audiovisual) captioning. This defines an end-to-end model that maps acoustic scenes to descriptive texts but can also correlate them with associated visual entities.

Summary
The current work focuses on the collaborative collection and documentation of soundscapes and environmental sound semantics. The whole approach has many similarities with sophisticated Mobile Journalism services, assisting professional and citizen journalists in collecting news-items and shaping them into featured data-driven storytelling. Crowdsourcing media assets for cultural heritage is a fruitful field that can engage an audience through successful design and motivation decisions. Along with audio/multimedia content and metadata, semantic annotation can be incorporated through typical sound classification scenarios. A comparative evaluation between a lightweight feature-based machine learning network and a convolutional deep learning architecture was decided for the terminal and server-side algorithmic approaches, employing two different classification taxonomies with applicable audio datasets. Adopting the LUCID design and development methodology, audience engagement and reinforcement was triggered through an online survey, confirming that users are willing to contribute and appraise the impact of the application to crowdsource sound semantic and soundscape heritage.
The innovation of the paper lies in the incorporation of sophisticated on-site semantic analysis and crowdsourcing dynamics, as they are advanced in today's ubiquitous society (i.e., in the era of big data and the semantic web). Specifically, one of the advantages of this approach, which also highlights one of the main novelties of our work, is that besides collecting and storing resources (recordings of soundscapes and corresponding metadata) from users, it is possible to provide semantically enhanced services on the cloud. Environmental sound recognition is addressed in the paper as one of the featured functionalities using machine learning techniques. Relying on the so-called MoJo-mate platform, an analysis is held regarding model elaboration and adaptation for the needs of soundscape heritage. A four-layer, cloud-based architecture was deployed, incorporating two independent flows that employ different-complexity (and computational load) ML/DL models, associated with the client-wise and server-wise (cloud) perspectives for soundscape semantics. The achieved model performance supports the feasibility of the proposed system. The impact of the proposed MoJo-adapted system lies in the ability to document today's soundscapes to be experienced as tomorrow's heritage, taking advantage of semantically enhanced data-driven storytelling.