1. Introduction
Are there any connections between soundscapes and geography? Finnish geographer Johannes Gabriel Granö was the first to draw attention to sound as an element of landscape [
1]. However, there are two theories about the origins of the term “soundscape”. One is that composer Alvin Lucier first used it in the 1960s [
2]; the other is that urban planner Michael Frank Southworth first used it in a 1969 article to refer to the acoustic properties of cities [
3,
4]. It was later popularized by Canadian musician Raymond Murray Schafer, who pioneered acoustic ecology [
5]. Since the 1980s, soundscapes have attracted the attention of scholars from many professions worldwide. Porteous and Mastin conducted a case study of soundscapes through field investigations [
6]. Rodaway pointed out that sonic geography refers to the spatial organization of sounds and the characteristics of places in terms of sound [
7]. Farina presented an overview of soundscape ecology from the aspects of principles, patterns, methods, and applications [
8]. Schulte-Fortkamp et al. focused on the spatial and temporal relationships between people, activities, and places, alongside their impact on the sound environment through auditory sensation [
9]. At the same time, many countries have established organizations and forums dedicated to soundscape research [
10,
11], among which the International Organization for Standardization (ISO) has proposed a standard for soundscapes, ISO/DIS 12913, consisting of four parts; the fourth part, “Design and intervention”, is under development [
12,
13,
14]. The ISO defines a soundscape as an acoustic environment as perceived or experienced and/or understood by a person or people, in context [
12]. Since the 21st century, the objectives of soundscape research have been continuously extended and expanded upon, and the research methods and contents have become diversified to include multi-sensory landscape research on soundscape–smellscape–lightscape interactions [
15,
16,
17,
18]. In summary, we can see that geographers have joined the study of sound alongside other disciplines involved in researching soundscapes, such as acoustics, psychology, ecology, sociology, and architecture [
4,
11,
16,
19,
20,
21,
22,
23].
However, no matter which disciplinary perspective it is based on, soundscape research has focused on the analysis of received sound source information over a long period. It lacks a description of geographical environment and receiver (e.g., equipment for acquiring objective soundscape information) parameters and their influence and mechanism on sound generation, propagation, and reception, which are essential elements for the complete construction of soundscape distribution in a real geographical environment. The complete description and expression of the sound source, propagation medium, and receiver are foundational, important parts of soundscape research. Sound, as a part of the real world [
24,
25,
26], surrounds us in a complex geographical environment, and the study of soundscapes cannot be separated from the geographical environment, as soundscapes and geography are closely related. It is necessary to expand and enrich the study of soundscapes; however, there are few comprehensive and systematic studies of soundscapes from a geographical perspective.
Meanwhile, “one square inch of silence” is disappearing [
27]; many soundscapes are transient and change over time, which is a dynamic process and phenomenon that necessitates the recording of acoustic and geographical scene data and the preservation and reconstruction of soundscapes through certain techniques. Given the broad scope of soundscapes, constructing and maintaining a repository of sound data to reanalyze and study soundscapes from an interdisciplinary and cross-industry perspective would be an invaluable resource for scientists and practitioners [
28]. Most current research is based on soundscape data recording and collection by means of field questionnaires, interviews, and laboratory experiments [
29,
30,
31,
32,
33,
34,
35,
36], but these are based on subjective perceptions of soundscapes, which vary widely. Recently, the audio–visual interaction approach has become more popular in soundscape research [
37,
38,
39,
40]. Scholars have utilized digital photographs, videos, and virtual reality (VR) techniques in conjunction with microphone arrays to record the complete sound field for soundscape reproduction [
37,
39,
40,
41,
42]. For example, Li and Lau provided a review of research on audio–visual interactions in regard to soundscape assessment, including recording methods, reproduction techniques, and a detailed experimental environment design [
38]. This is a method of reproducing and visualizing the recorded soundscape data using VR technology in the laboratory. However, a systematic approach is lacking for how the information of the acquired audio and even video data can be organized and stored. How to mine, organize, and store the implied soundscape information is the foothold of this research. The mined soundscape information can be used for a subsequent soundscape analysis, 3D model reconstruction, and reproduction.
The ISO and Moving Picture Experts Group (MPEG) have been developing the Multimedia Content Description Interface MPEG-7 since 1996, with the goal of developing an indexing, searching, and interoperability interface for multimedia resources to support content-based applications such as searching, management, and filtering. The MPEG-7 standard is a standardized description of the content of a multimedia message, covering not the multimedia content itself but “data about data”. The MPEG-7 standard includes various Descriptors (Ds), Description Schemas (DSs), and Description Definition Languages (DDLs). In particular, XML for MPEG-7 is adopted because of its significant role in creating data structures related to the description of multimedia content. MPEG-7 can be used as a data model for the storage and recording of sound data, and the MPEG-7 audio standard specifies the representation of audio data at the feature and semantic levels instead of in a purely annotative form, as is the case with other metadata standards. This standard provides a standardized descriptive interface for describing, organizing, and expressing audio processing and analyses.
However, the MPEG-7 audio standard is only limited to the recording and retrieval of objects and events in the audio through the characteristics of the audio itself. The analysis of the audio content lacks the constraints of geographical spatiotemporal information and does not integrate the audio information into the geographical scene. The storage of only the sound source-related information cannot completely describe and analyze the various types of scene elements and scene element description information covered in the soundscape. It lacks geographical environmental parameters and receiver information other than the sound source, which is insufficient for constructing a complete soundscape. Therefore, this study explores a solution that can describe and construct a geo-soundscape data model in a complete and detailed way based on the existing MPEG-7 data model, assist scholars who previously had to rely on their subjective imagination to fill in the missing information, and support the retrieval and aggregation of soundscape information from different perspectives.
Lack of geographical environment information, receiver characteristics/parameters, sound source information, and the relationships and interaction mechanisms between them makes it impossible to fully and comprehensively express the content of the soundscape. According to the “geographical scene” proposed by Lv et al. [
24,
25], sound and other elements are part of the geographical scene, and the soundscape is closely related to geography. Therefore, it is necessary to comprehensively expand the soundscape from the perspective of geography; however, Prof. Lv did not further advance the concept, framework, and data model of geographical soundscape.
The purpose of this study is to construct soundscapes from a geographical perspective; expand the existing soundscape research into geo-soundscape research; clarify the roles of sound elements, the geographical environment, and other elements in a geographical soundscape; analyze the correlation and interaction mechanism between the elements; and describe and express the soundscape through various scene elements and elemental description dimensions. This study elaborates on geographical soundscape research through a conceptual, framework, and data model and provides new perspectives and directions for studying the sound environment. It provides important guidance for building a more reasonable urban landscape pattern.
  2. Concept and Framework for Geo-Soundscapes Based on Geographical Perspectives
Geographical perspective refers to a way of thinking that observes and recognizes the world from the perspective of the geography discipline and then analyzes and addresses problems. As 
Figure 1 shows, Lv describes the “geographical scene” carrying the six elements of time, place, people, things, events, and phenomena through seven descriptive dimensions: geographical semantics, spatial location, geometrical shape, attribute characteristics, element relationship, evolutionary process, and action mechanism [
24,
25,
26,
43]. The next step is to perceive and analyze the various scenes in the real world from the perspective of multidimensional geography, forming a highly condensed descriptive method to express the real world in an abstract manner and decompose the complex scenes into several describable units.
A soundscape is a typical complex geographical scene in which the sound emitted by a sound source is influenced and constrained by the geographical environment and ultimately acts on the receiver. The current definition of soundscape lacks an understanding of the action mechanism between sound and the geographical environment, and the soundscape is not composed of a single element; it is a synthesis formed by the interactions and interconnections between sound elements and geographical scene elements. Therefore, soundscape research needs to be reorganized from the definition and conceptual framework in order to (1) redefine the soundscape from the perspective of geography and comprehensively consider the information of sound source, geographical environment, and receiver, as well as the interactions and interconnections between sound elements and geographical scene elements; (2) construct a conceptual framework for the geo-soundscape, where, in addition to the content related to the sound source, the parameters of the geographical environment also need to be considered, as well as receiver characteristics, further geographical scene elements, and description information of the elements and other content.
This study further expands on soundscapes from the perspective of geography: defining the geo-soundscape with the cognition of geographical scene, thus enriching the content of the conceptual framework of soundscapes. It also analyzes and expresses the conceptual framework of soundscapes from the hierarchical structure of the content of low-, mid-, and high-level features, expanding the information of the receiver, geographical environment parameters, and further acquired geographical scene elements and their descriptions. To lay the foundation for the construction of the geo-soundscape data model, the following must be integrated and analyzed: sound data and geographical scenes, basic features such as the receiver and geographical environment, geographical scene elements such as geographical objects and events, and scene element descriptions such as interactions and relationships between elements.
  2.1. Definition of Geo-Soundscape
In this study, “geographical soundscape (geo-soundscape)” is defined as a geographical complex in a certain spatial and temporal context that is composed of sound as a theme. Its background includes elements of physical, human, and information geography. These background elements and sound-related elements are interconnected and interact with each other to form a three-dimensional, dynamic, sound-themed complex.
Sound is an important part of the human-settlement environment and cannot exist independently of the geographical scene. The purpose of geo-soundscape research is to clarify the roles played by sound-related and background elements in a geographical scene, as well as the correlation between the elements. The interconnection and interaction between many elements can bring harmony and balance to the human environment, simulate the distribution of geo-soundscapes in a real human environment, and create a sound environment suitable for human survival.
  2.2. Geo-Soundscape Conceptual Framework
The geo-soundscape cannot be separated from the comprehensive effect of three aspects: the sound source, geographical environment during the propagation process, and receiver that receives the sound signal. In response to the current situation, in which only the extraction and calculation of the acquired acoustic signals are considered, this study expands multiple elements, such as the geographical environment and interaction relationship between them, to construct the conceptual framework of the geo-soundscape. The differences and similarities between a traditional soundscape and a geo-soundscape are shown in 
Figure 2. A traditional soundscape pays more attention to the perception and reaction of the subjective receiver (human) to the sound, while the geo-soundscape places more emphasis on the geographical environment and the relationship and action mechanism between the sound elements and the geographical environment elements in the soundscape. The receiver should not be limited only to humans, who are only a special case of the receiver, but should also consider the soundscape data acquisition and expression of the objective receiver. In the geo-soundscape, the sound source, receiver, and geographical environment are abstracted into six elements, such as people and time, and described by seven dimensions, such as elemental relationship and attribute characteristics, to form a complete geo-soundscape expression and description.
This study developed a description and expression of geo-soundscapes from the perspective of geographical scenes. The conceptual framework of the geo-soundscape is understood and analyzed from a hierarchical perspective, including in terms of both structural and content layering.
  2.2.1. Structural Layering
Structural layering is categorized, according to the audio structure, into audio frames, audio segments, audio, audio groups, and multiple audio groups. In this study, audio frame features were used for short-term audio processing. The audio segment features were statistically calculated based on the audio frame. Audio has a temporal structure and semantic content and can contain one category of consecutive sounds or multiple categories of sounds; therefore, the audio can be separated and segmented to detect the jump points of acoustic features and segment the audio units of a certain duration. An audio group is a combination of audio recordings from multiple locations within a certain scene range which can be used to calculate the spatial location of the sound source. Multiple audio groups are combinations of audio groups that expand the scene range. The audio data structure is shown in 
Figure 3.
  2.2.2. Content Layering
In this study, the geo-soundscape was classified according to its content-abstraction level and the degree of generalization of the content-expression concepts: low-, mid-, and high-level feature layers. The existing soundscape conceptual framework includes audio parameters, audio features, geographical environmental information, and external microphone parameters. On this basis, this study further expands and adds geographical environment information, microphone features, and comprehensively acquired geographical scene elements and their description dimension information to construct a soundscape in a real geographical environment. 
Figure 4 shows that content layering is correlated with structural layering, where some of the low-level features and parameter information are obtained based on on-site recordings, such as geographical environmental parameters, some of which need to be extracted based on audio frames/segments, such as audio features, and some of which need to be obtained based on multiple audio groups, such as the global scene for high-level features.
The low-level feature layer was the base layer. This layer serves as the foundation for providing the necessary information for extraction of element information from the geo-soundscape. Every piece of information in this layer can be obtained directly from the receiver, geographical environment, and the audio itself. Information related to audio can be calculated and obtained from the audio, which has been more extensively researched. However, the receiver and geographical environmental information cannot be obtained directly from audio, which has been studied less by scholars and requires further examination.
The mid-level feature layer is further expressed based on the low-level feature layer, which is the description and expression of a local scene. It must be combined with relevant algorithms (signal processing, sound source separation, deep learning, and sound source localization) for further extraction of scene elements and their description dimension information related to the sound source itself. Other scene element information in the geographical scene relies on the geographical environment and receiver parameter information recorded and stored in the low-level layer features. This layer is based on the information obtained from the audio group to jointly extract geographical objects (time, place, people, things), phenomena, and events in the geographical scene. It also needs to extract information on the scene element description dimensions of semantics, localization, geometry, attributes, relationships, processes, and action mechanisms of these scene elements to provide the element and description information for final scene construction.
The high-level feature layer describes and expresses a global scene within a unified spatiotemporal framework. Scene elements such as time, place, people, things, phenomena, and events can be aggregated as the core and expressed based on seven dimensions: geographical semantics, spatial location, geometrical form, attribute characteristics, element relationships, evolutionary processes, and action mechanisms.
  3. Geographical Scene-Oriented Geo-Soundscape Data Model
A geographical scene-oriented geo-soundscape data model was constructed in this study from the perspective of geographical scene cognition. The model further expands a series of low-level feature information, including geographical environmental and microphone parameters, by retaining the original MPEG-7 model. Based on this low-level feature information, mid-level feature information was extracted and integrated, and the elements in the soundscape were reanalyzed and redescribed from a geographical perspective. The elements of the scene were then aggregated as a core to form a complete geo-soundscape. The model can describe the sound information, element information, and correlation relationship between the elements in the geo-soundscape in detail, which can satisfy people’s storage needs from multiple perspectives and also help in regard to multi-conditional querying and flexible indexing needs. A model was introduced from the MPEG-7 basic framework and extended Geo-MPEG-7. Geo-MPEG-7 was expanded from the low, mid, to the high level, and the geo-soundscape data model was constructed layer by layer.
As 
Figure 5 shows, this study built a framework based on the MPEG-7 data model, followed the language rules of XSD, inherited the description of audio parameters and features from MPEG-7, and expanded other audio, microphone, environmental, and low-level features. From the perspective of geography, low-level features are refined into mid-level features, forming six elements with time, place, people, things, events, and phenomena as the core. They are described through the seven dimensions of geographical semantics, spatial location, geometrical form, attribute characteristics, elemental relationships, evolutionary processes, and action mechanisms, ultimately realizing the aggregation of high-level features with a certain scene element as the core and the construction of a geo-soundscape.
  3.1. Low-Level Feature Layer
The low-level feature layer includes audio features, audio parameters, microphone features, and geographical environment parameters. Audio features describe the characteristics of the audio itself and are the most important sources for obtaining and detecting information about geographical objects, phenomena, and events in a geo-soundscape. The audio features include MPEG-7 audio descriptors, time-domain, frequency-domain, and acoustic perception features. In addition, the parameters of the audio file are important, including the audio creation time (which requires extended recording), duration, format, number of channels, bit depth, sampling rate, and audio size.
Only the semantic category information of the sound source can be extracted by relying on the audio itself; however, the sound source sounding time, spatial location information, evolutionary processes, and other information cannot be obtained. It is necessary to expand the audio creation time, the microphone receiver’s positional information (posture, spatial location), and other parameters to assist in the realization of the sound source’s sounding time, position tracing, and tracking; other information related to the sound source can be further extended and deduced. The frequency and intensity of sound are also limited by the performance of the microphone. Therefore, the internal parameters of the microphone (e.g., directivity, frequency range, and sound intensity measurement range) are also important and must be extended for recording. Currently, there are many devices that can collect audio data, including professional microphone and smart mobile terminal devices, such as mobile phones and computers, which can collect sound signals while ensuring a certain accuracy and measurement range. Therefore, the characteristics of microphone receivers must expand the information, such as the type of acquisition equipment and the receivable distance of the sound signal.
However, in addition to the geographical scene elements of the sound source in the soundscape, other geographical scene elements, such as the geographical environment encountered during the sound propagation process, can only be expanded upon with the help of external data. In this study, geographical environmental parameters refer to geographical scene elements that cannot be obtained directly from the audio which have nothing to do with the sound source itself but play a crucial role in the propagation of sound signals. The geographical environmental parameters that constrain the propagation of sound can be considered as all media and their properties that allow sound signals to pass through. The media in a geo-soundscape can be divided into two categories: type and spatial location. They consist of solids, liquids, and gasses.
  3.2. Mid-Level Feature Layer
The mid-level feature layer describes the scene elements. The time elements of the scene elements can be obtained from the metadata description of the audio file or from the audio parse content. The location element is the spatial location of the study object under a specified reference system that has been determined. It can also provide the location information of a sound source obtained using a localization algorithm based on the positions of multiple microphones. The appearance of character elements in soundscapes is mainly obtained by parsing the audio file content.
The scene element description dimension information is used to comprehensively reflect and describe the spatial differentiation regularity, evolutionary process, and interaction relationships of geographical objects and phenomena. The semantic information of the sound source, which is a geographical scene element, can be extracted using sound source classification algorithms, such as audio feature extraction and deep learning. The spatial location of the scene element associated with the sound source may be the spatial location of the static element or that of the dynamic element. The locations of the elements in a scene can be calculated and determined based on audio signals over time. The geometrical form of the scene elements associated with the sound source can be described by geometrical shapes, such as points, lines, polygons, and volumes. Attribute characteristics are detailed characteristics that describe scene elements in a geo-soundscape, including spatiotemporal, physical, chemical, biological, and social attributes. Element relationships are sequential, temporal, spatial, and spatiotemporal. Because the sound emitted must have lasted for a certain period before it can be captured by a person or microphone, the change and development of the elements in the geo-soundscape cannot be described by sudden occurrences and disappearances, and the evolutionary process of sound-producing elements can only be described by continuous changes. The action mechanism of sound-producing elements includes physical, chemical, biological, social, and economic factors. On the one hand, there is an interaction mechanism between sound sources, propagation media, and receivers. On the other hand, there are certain element relationships between different elements resulting in the transmission of information, so there may be an action mechanism between elements.
  3.3. High-Level Feature Layer
High-level features are expressed by geo-soundscapes, which are aggregated from elements of low- or mid-level features. The constructed geo-soundscapes can be categorized and analyzed from the perspectives of different scales, dimensions, and scopes. Scenes at different scales can be categorized into general and refined geo-soundscapes, which can be categorized into two- and three-dimensional geo-soundscapes. Those at different scopes can be categorized into large- and small-range geo-soundscapes, which can be divided into learning, working, living, shopping, war, and disaster scenes, according to the perspective of the domain. Appropriate scene categories and divisions can be selected based on different application requirements.
In addition to the low- and mid-level features used to obtain feature information, as well as selecting scene types for soundscape construction under different application requirements, the retrieval and aggregation of sound data are also crucial for soundscape construction. Therefore, a high-level feature layer includes the aggregation of a certain scene element as its core. In practical applications, retrieval and aggregation solutions that meet multiple perspectives and conditions are needed. Therefore, based on geographical scene cognition, this study proposes taking a certain scene element as the core for aggregation under a unified spatiotemporal framework and the construction of a global scene with high-level features corresponding to a complete geo-soundscape. For example, using the time element as the core, we can aggregate all the audio at 23:00 on 7 May 2024; using the place element as the core, we can aggregate all the audio on the east side of a certain building; using the character element as the core, we can aggregate all the audio of Lucy; using the thing element as the core, we can aggregate all the audio of a certain car horn; using the event element as the core, we can aggregate all the audio of a certain earthquake disaster event; and using the phenomenon element as the core, we can aggregate all the audio of a certain torrential downpour.
  3.4. MPEG-7 Basic Framework and Its Inheritance Relationship by Geo-MPEG-7
To facilitate expansion, MPEG-7 defines only a root element, named MPEG7, an optional top-level sub-element (DescriptionUnit or Description), and an abstract top-level sub-element category. Therefore, users can extend and modify the top-level sub-elements by inheriting and implementing the top-level sub-element category. The DescriptionUnit can represent part of the multimedia content information using any type defined in the MPEG-7 framework, while the Description is a complete description of the multimedia content using the abstract type “CompleteDescriptionType”. As 
Figure 6 shows, this study selects the descendant types of the abstract type “CompleteDescriptionType” of the “Description” node for inheritance and expansion.
As 
Figure 7 shows, “CompleteDescriptionType” and its subclass “ContentDescriptionType” are abstract classes that cannot be used directly. The two define a series of description information and sensory information about the audio metadata, including description metadata (DescriptionMetadata), relationships between defined description element types (Relationships), the order in which instances appear (OrderingKey), and users’ emotional feedback to multimedia content (Affective), which are optional types that do not need to be implemented.
As 
Figure 8 shows, “ContentEntityType” is an instantiable type that inherits from “CompleteDescriptionType”. Its defined child node, “MultimediaContent”, is defined using the abstract type “MultimediaContentType” and relies on the instantiable classes that it inherits to populate it. These classes express audio content characteristics using a general audio description framework and application-specific tools.
As “MultimediaContent” is an element node that describes the information related to multimedia content, “ContentEntityType” (the parent class of “MultimediaContentType”) is selected to expand the description information of non-multimedia content (e.g., microphone features, environmental parameters characteristics) and extend the type inheritance, as 
Figure 9 shows. This implies that when the Geo-MPEG-7 class was instantiated, the MPEG-7 content was completely included in the Geo-MPEG-7 framework.
  3.5. Geo-MPEG-7
The design of Geo-MPEG-7 in this study continues the MPEG-7 architecture. To facilitate the subsequent expansion, each node was defined differently. For ease of reading, the designed XSD is organized into six files (see “
https://github.com/lgnzgxl/Geo-MPEG-7 (accessed on 15 January 2025)”): basic; low-, mid-, and high-level feature layers; Geo-MPEG-7 overall structure type; and Geo-MPEG-7 inheritance nodes. The definition of the basic type is mainly expanded to basic data types, including simple element types with added attributes or limited value ranges, such as coordinates, trajectories, geometry, time periods, Euler angles, temperature, and humidity, and complex element types consisting of a series of elements and attributes, such as geographical semantics, spatial location, geometrical forms, attribute characteristics, element relationships, evolutionary processes, and action mechanisms. For convenience, several simple basic attributes were defined.
As 
Figure 10 shows, the element types expanded by Geo-MPEG-7 include three levels of feature layers: low, mid, and high.
  3.5.1. Information Extraction and Storage
Low-level features are the basic information related to the sound recording, which may be derived from the audio data or other related data. They can be features of the audio content extracted by algorithms or external parameters that cannot be obtained by relying only on the audio content.
The information extraction of mid-level features includes descriptions of the time, place, people, things, events, and phenomena present in the geo-soundscape and the semantics, localization, geometry, attributes, relationships, action mechanisms and evolutionary processes of these elements. Information about the geographical scene elements and element descriptions related to sound is extracted by applying source separation, sound source classification, and source localization algorithms to the audio data. Some of the information is extracted by relying on extended information recorded in the low-level features. Some information can be obtained through speculation on and judgment of the extracted information.
High-level feature extraction consists of two aspects: (1) soundscape data are retrieved and aggregated on demand according to low-level feature and mid-level feature information; (2) a geo-soundscape is constructed by combining the sound propagation model through a specific aggregation scheme.
By the above method, the acquired information of each layer is recorded and stored in the XML file of Geo-MPEG-7 in the form of numbers, text, or attached pictures and documents.
  3.5.2. Illustration with Examples
Owing to space limitations, we use the extraction, recording, and description of audio features and environmental parameters as an example.
A human voice recorded for 1 s is used as the sound source (see 
Figure 11 for the spectogram), and audio features are extracted based on the MPEG-7 audio standard; that is, the MPEG-7 audio encoder (see “
https://mpeg7audioenc.sourceforge.net (accessed on 15 January 2025)”) is used for calculation. The results are output to the Geo-MPEG-7 XML file; 
Figure 12 shows some of the stored results.
The geographical environment has a large impact on the propagation of sound, including the temperature and humidity of the atmosphere, the temperature of the liquids, the material of the solids, and the spatial location of various media. Parameters such as atmospheric temperature and humidity can usually be obtained directly through field observations, while liquid and solid temperatures can be measured by thermometers. The material of solids is usually recorded directly with the help of external data and usually needs to include a sound absorption coefficient and other sound-related parameters; if a parameter is missing, the relevant information can be obtained by querying the data.
Figure 13 shows a field diagram for an example geographical environment parameter information for recording this 1 s human voice. After the sound source is emitted, the propagation medium to be passed through mainly includes solids, such as doors, floors, ceilings, walls, and air. Atmospheric parameters can be obtained through on-site measurements; temperature and humidity are 5 °C and 30%, respectively. The spatial location of the atmosphere can then be represented by the 3D spatial range as all locations within the indoor scene range. Solids, such as doors, floors, ceilings, and walls, can be used as additional auxiliary information by means of corresponding polygonal geometric model files. As 
Figure 14 illustrates, all geographical environmental information that needs to be recorded can be stored at the corresponding nodes of Geo-MPEG-7 to describe and express the soundscape at that time as completely as possible.
   5. Discussion
Many scholars have recognized the importance of the geographical environment, which affects the propagation and diffusion of sound sources, and have continuously emphasized the study of such information as geographical environment and acoustic parameters [
44,
45,
46,
47]. In general, the current conceptual framework for soundscape includes the received audio features related to the sound source (e.g., MFCC, spectrogram, frequency, MPEG-7 audio descriptor), the semantic category information of the sound source, spatial location information, information on the person/group itself, and information on the geographical environment (e.g., climate, weather, wind direction, temperature, humidity, buildings, facilities). However, the internal and external parameter of the receiver, the elemental relationship, and the action mechanism between sound elements and sound-related elements have not been mined, resulting in the current soundscape framework being incomplete and insufficiently comprehensive. Thus, no data model can adequately store and express the soundscape or fully characterize the soundscape distribution in real environments. (1) The receiver’s unknown parameters cause deviations in the collection of sound source information; for example, mobile phones do not have the wide frequency band and sound intensity range of professional equipment. (2) The lack of location information regarding the geographical environment and the lack of a relationship and interaction with the sound source mean that we cannot determine the influence of the geographical environment on the sound propagation. (3) The lack of a relationship or interaction between the receiver and sound source means that we cannot completely capture the sound information. (4) The absence of a well-developed data model means that we cannot store and organize the expanded information and thus, cannot achieve the complete construction of soundscapes in real environments.
By analyzing the actual soundscape distribution, this study shows that the complete construction of the soundscape is inseparable from the completeness of the information in the soundscape and the association and interaction between the various soundscape elements. The proposed Geo-MPEG-7 model is shown to be a comprehensive supplement of geographical scene information. The model emphasizes information aggregation of the whole soundscape while maintaining the original retrieval and filtering functions. The traditional MPEG-7 model can only store audio-related information. Our enhanced data model is essential for achieving a comprehensive description and effective management of soundscapes.
This study has some limitations, as there are few records of subjective survey results. It focuses on the reproduction and exploration of objective scenes. However, geographical environmental information is currently difficult to obtain through automated means, and further technological breakthroughs and innovations are needed.
Soundscape studies should be a melting pot that involves the intersection of numerous subject areas; therefore, it requires the integration and penetration of multidisciplinary knowledge. Thus, it will surely be a difficult process for soundscape studies to be embraced and adopted by scholars, readers, and the general public while also better serving humankind. As long as the concepts, ideas, and perceptions of soundscapes can be used and applied appropriately to advance research in various disciplinary fields, soundscape research will become more comprehensive.
  6. Conclusions
This study reorganizes and extends the concept and framework of soundscapes from the perspective of geography by incorporating the geographical environment, receiver characteristics, and further acquired information on geographical scene elements and their description dimensions into soundscape research, extending it to geo-soundscape research. The MPEG-7 data model is improved and extended to form the Geo-MPEG-7 data model. The main findings of this study are as follows.
(1) This study combines geographical scene cognition and expands the information about the geographical environment, receiver information, scene elements, and their descriptive dimensions in the soundscape to form a complete geo-soundscape conceptual framework. The expansion of this information facilitates a more comprehensive description and expression of soundscape content which can break through the limitation of considering only sound source information and provide a systematic conceptual framework. Through relevant cases, this study shows that the framework can ensure the integrity of soundscape information and modeling possibilities.
(2) A scene-oriented geo-soundscape data model (Geo-MPEG-7) is proposed which does not change the original audio storage structure and format but is an extension of the audio data information description. The Geo-MPEG-7 model realizes the comprehensive enrichment of geographical scene information while maintaining the original retrieval and filtering functions. The case study highlights that the model, which expands the related element classes, can store, organize, and describe real soundscape information more completely and is more flexible in querying and aggregating the soundscape information. It provides the possibility and scientificity of comprehensively constructing soundscapes in a real geographical environment. It also highlights the significance and unique value of this research in expanding geographical information in the field of soundscapes.