Diversity Monitoring of Coexisting Birds in Urban Forests by Integrating Spectrograms and Object-Based Image Analysis

: In the context of rapid urbanization, urban foresters are actively seeking management monitoring programs that address the challenges of urban biodiversity loss. Passive acoustic monitoring (PAM) has attracted attention because it allows for the collection of data passively, objectively, and continuously across large areas and for extended periods. However, it continues to be a difﬁcult subject due to the massive amount of information that audio recordings contain. Most existing automated analysis methods have limitations in their application in urban areas, with unclear ecological relevance and efﬁcacy. To better support urban forest biodiversity monitoring, we present a novel methodology for automatically extracting bird vocalizations from spectrograms of ﬁeld audio recordings, integrating object-based classiﬁcation. We applied this approach to acoustic data from an urban forest in Beijing and achieved an accuracy of 93.55% ( ± 4.78%) in vocalization recognition while requiring less than 1 / 8 of the time needed for traditional inspection. The difference in efﬁciency would become more signiﬁcant as the data size increases because object-based classiﬁcation allows for batch processing of spectrograms. Using the extracted vocalizations, a series of acoustic and morphological features of bird-vocalization syllables (syllable feature metrics, SFMs) could be calculated to better quantify acoustic events and describe the soundscape. A signiﬁcant correlation between the SFMs and biodiversity indices was found, with 57% of the variance in species richness, 41% in Shannon’s diversity index and 38% in Simpson’s diversity index being explained by SFMs. Therefore, our proposed method provides an effective complementary tool to existing automated methods for long-term urban forest biodiversity monitoring and conservation.


Introduction
Biodiversity loss has been a major and challenging problem globally, and is a potential risk factor for pandemics [1]. The ongoing global COVID-19 pandemic has confirmed this concern. The loss of biodiversity has generated conditions that not only favored the appearance of the virus but also enabled the COVID-19 pandemic to surface [2,3]. Biodiversity conservation is, therefore, an urgent global task.
Although automated analysis techniques are rapidly improving, software tools still lag far behind actual applications [22,[39][40][41][42]. We therefore suggest that advancing the theory and practice of soundscape ecology research requires going beyond the limits of the temporal/frequency structure of sound and developing more tools to retain as much ecologically relevant information as possible from recordings, testing our methods in complex urban environments to clarify their robustness.
Remote sensing technology has been broadly used in many applications, such as extracting land cover/usage information. Object-based image analysis (OBIA) has emerged as an effective tool to overcome the problems of traditional pixel-based techniques of image data [43,44]. It defines segments rather than pixels to classify areas, and it incorporates meaningful spectral and non-spectral features for class separation, thereby providing a clear illustration of landscape patterns [43][44][45][46]. Owing to its superiority and efficiency [47], OBIA has been utilized in many different areas, such as computer vision [48,49], biomedical imaging [50,51], and environmental scanning electron microscopy (SEM) analysis [52][53][54]. Just as remote sensing images are numeric representations of the earth surface landscape consisting of water area, forest land, wetlands, etc. [55], spectrograms are visual expressions of collections of various sound components (biophony, geophony and anthrophony). As such, could OBIA provide a novel perspective for extracting bird vocalizations when introducing advanced remote sensing tools in the soundscape field? Could we further digitally summarize vocalization patches and use them as ecologically relevant indicators of acoustic community patterns?
Based on the above hypotheses, an automated bird vocalization extraction method based on OBIA is presented here. We hypothesize that OBIA may allow for the extracting of bird vocalizations from recordings with complex background noise and the representation of long-term acoustic data as numbers describing biophony. From the perspective of community-level soundscape ecology, we are not necessarily concerned with species identification, but with achieving a numerical description of the qualitative patterns of species vocalizations [24]. OBIA enables rapid identification of the number of bird vocalizations while providing multidimensional spectral, morphological, and acoustic traits, unlike other existing methods (whether manual or automatic). Examples of spectral variables include the mean value and standard deviation of a specific spectral band; morphological traits include size, perimeter, and compactness; acoustic traits include song length and frequency information.
In the present paper, we take a first look at how OBIA might provide a new perspective on the current automated acoustic analysis methods and provide a complement to existing acoustic indices that can be used for urban forest biodiversity assessments.

Study Area and Data Sets
For the present study, we selected an old urban forest in Zhongshan Park, Beijing (115 • 24 -117 • 30 E, 39 • 38 -41 • 05 N) as our case study area ( Figure 1A). As the capital of China and the second-largest city in the world, Beijing is also a major node in the East Asian-Australasian bird flyway [56]. Beijing is a key corridor for birds' spring and autumn migrations, as it is in the ecosystem transition zone from Northeast China to North China. Urban forests in Beijing play an important role in supporting roosting, breeding and other activities of birds, and as a result, they are rich in soundscapes.
Our data were derived from audio recordings continuously obtained during four consecutive sunny, windless days, from 18 to 21 May 2019, provided by three recorders positioned in Zhongshan Park ( Figure 1B). Recorders were placed and fixed horizontally at a height of 2 m on healthy growing trees ( Figure 1C,D). Auto-recording led to a total of 17,280 min of raw recordings, which were subsequently processed using the AudioSegment function in PYTHON v. 3  Our data were derived from audio recordings continuously obtained during four consecutive sunny, windless days, from 18 to 21 May 2019, provided by three recorders positioned in Zhongshan Park ( Figure 1B). Recorders were placed and fixed horizontally at a height of 2 m on healthy growing trees ( Figure 1C,D). Auto-recording led to a total of 17,280 min of raw recordings, which were subsequently processed using the Audi-oSegment function in PYTHON v.3.7.2 and sampled into 15-s clips every 15 min, resulting in a sub-dataset of 1152 15-s clips. A sampling protocol of 15 s was used as it provided a tradeoff between ensuring an effective acoustic survey and a manageable amount of data for processing when there is no standardized protocol [57].
Recordings were obtained using Zoom H5 acoustic sensors (Zoom Inc., Tokyo, Japan, System 2.40) with XYH-5 X/Y microphones. Zoom H5 is a commercial digital recording device that has good sound reliability and has been successfully used in other soundscape ecology research [27,58]. Parameters of sensors were set as follows: the sampling rate was 44,100 Hz, bit depth 16 bits, and recording channels two (stereo). Files were saved as noncompressed WAVE files.
Two trained technicians (Yu and Hanchen Huang) manually identified the acoustic events (AEs) in all audio clips for further use in evaluating the reliability and sensitivity of our approach. Over 95% of biological acoustic events (BEs) were from birds; thus, only bird sounds were identified to the species level (List of Bird Species see Table A1), while BEs that were not bird sounds were identified to the family level. For example, cricket sounds were labelled Gryllidae. Because we were unable to distinguish individual animals based on their sounds, technicians identified and counted the total number of BEs for a given species (or family) in each clip. Anthropogenic and geophysical sounds were also counted and classified into AEs. The results of this process were finally confirmed by Hanchen Huang. Richness (S), Shannon's diversity (H') and Simpson's diversity (λ) indices were calculated for each spectrogram (i.e., per 15 s clip) to reflect the diversity of bird species as well as AE and BE types [12].

Methods
We developed an automated bird vocalization extraction approach based on OBIA, which followed a typical analysis workflow of bird vocalizations with three main steps [59]: preprocessing, automated extraction, and feature calculation. In line with this Recordings were obtained using Zoom H5 acoustic sensors (Zoom Inc., Tokyo, Japan, System 2.40) with XYH-5 X/Y microphones. Zoom H5 is a commercial digital recording device that has good sound reliability and has been successfully used in other soundscape ecology research [27,58]. Parameters of sensors were set as follows: the sampling rate was 44,100 Hz, bit depth 16 bits, and recording channels two (stereo). Files were saved as non-compressed WAVE files.
Two trained technicians (Yu and Hanchen Huang) manually identified the acoustic events (AEs) in all audio clips for further use in evaluating the reliability and sensitivity of our approach. Over 95% of biological acoustic events (BEs) were from birds; thus, only bird sounds were identified to the species level (List of Bird Species see Table A1), while BEs that were not bird sounds were identified to the family level. For example, cricket sounds were labelled Gryllidae. Because we were unable to distinguish individual animals based on their sounds, technicians identified and counted the total number of BEs for a given species (or family) in each clip. Anthropogenic and geophysical sounds were also counted and classified into AEs. The results of this process were finally confirmed by Hanchen Huang. Richness (S), Shannon's diversity (H') and Simpson's diversity (λ) indices were calculated for each spectrogram (i.e., per 15 s clip) to reflect the diversity of bird species as well as AE and BE types [12].

Methods
We developed an automated bird vocalization extraction approach based on OBIA, which followed a typical analysis workflow of bird vocalizations with three main steps [59]: preprocessing, automated extraction, and feature calculation. In line with this workflow, the processing details of each step of our approach are described in the following subsections. An overview of the approach is depicted in Figure 2.  workflow, the processing details of each step of our approach are described in the following subsections. An overview of the approach is depicted in Figure 2. The step prior to spectrogram analysis was denoising, and it was carried out to obtain clear vocalization patterns to improve extraction accuracy and minimize false positives [60]. We selected only noise reduction techniques that could perform batch and fast processing, to allow the proposed approach to be more generalizable and to reduce the effects of human operations.
Audio signals are characterized by the presence of higher energy in the low-frequency regions dominated by environmental noise. Therefore, we applied a high-pass filter with a cut-off frequency setting (800 Hz) below the lowest frequency at which bird Audio recordings denoising The step prior to spectrogram analysis was denoising, and it was carried out to obtain clear vocalization patterns to improve extraction accuracy and minimize false positives [60]. We selected only noise reduction techniques that could perform batch and fast processing, to allow the proposed approach to be more generalizable and to reduce the effects of human operations.
Audio signals are characterized by the presence of higher energy in the low-frequency regions dominated by environmental noise. Therefore, we applied a high-pass filter with a cut-off frequency setting (800 Hz) below the lowest frequency at which bird songs are expected, with a 12 dB roll-off per octave [61][62][63][64]. This allowed for the elimination of current and environmental noise, mostly created by mechanical devices such as engines [65,66].
Since medium and high frequencies are the useful part for discriminating between different acoustic events [67], we applied a widely used signal enhancement method to our recordings, the Wiener filter [68][69][70] and postfilter, to minimize baseline and white noise and improve the quality of bird sound recordings. In field recordings, bird songs are transient but a considerable amount of background noise is nearly stationary. The Wiener filter eliminates this quasi-stationary noise, provided that it approximates a Gaussian distribution [22].
To verify the denoising effect, we randomly chose 80 clips from all datasets (including anthropogenic AEs) that were listened to before and after the denoising process to assess denoising efficacy.

2.
Short-time Fourier transformation (STFT)-Spectrogram Bird vocalizations of songs or calls are complex, non-stationary signals with a great degree of variation in intensity, pitch, and syllable patterns [71]. One of the proven methods for joint time-frequency domain analysis of non-stationary sound signals is STFT [59]. The STFT spectrogram is a two-dimensional convolution of the signal and window function [72]: the X-axis represents time, the Y-axis represents frequency, and the amplitude of a particular frequency at a particular time is represented by its color in the image [73].
Our STFT spectrogram was calculated with a Hamming window of 1024 samples, no zero paddings, and a 75% overlap between successive windows. A peak amplitude of −25 dB was set to standardize the spectrograms [42].

Bird Vocalization Extraction
Generally, bird vocalizations are classified into songs (longer-term) and calls (shorterterm). In the present study, the separation between songs and calls was not considered because both types use syllables as fundamental units [74]. All vocalizations were segmented and classified at the syllable level. To facilitate understanding, a BE refers to either a song (call) in recordings or a syllable in spectrograms [10].
Bird syllable extraction is the key part of the proposed OBIA. All processing steps were conducted in one framework. Firstly, we adopted a multi-resolution segmentation algorithm to segment spectrograms into image objects. Then, we manually selected samples representing a pure spectrum of bird syllables to generate potential extraction features using a Classification and Regression Tree (CART). Finally, to extract bird syllables from background noise, we applied the rulesets established from identified potential features to all spectrograms.

Segmentation
Segmentation creates new meaningful image objects according to their spectral properties ( Figure 3), including subdividing and merging operations [75]. We applied the MRS algorithm embedded in eCognition Developer™ to generate image objects. This algorithm consecutively merges pixels or existing image objects into larger objects based on relative homogeneity within the merged object [53,54,76]. The process uses three key parameters in the process: scale, shape and compactness [77]. A coarse scale value allows for the forming of larger objects and more heterogeneity, involving more spectral values. Shape defines the influence of color (spectral value) and shapes on the formation of the segments, while compactness defines whether the boundary of the segments should be smoother or more compact [78]. In our segmentation, we employed the "trial and error" method of visual inspection to determine the optimal segmentation parameters [53], as bird syllables and their shapes on spectrograms are easily recognized by human eyes. After several attempts, we finally selected a scale of 40, 0.2 of shape, and 0.5 of compactness to produce segmented image objects.

Classification
Techniques from machine learning and computational intelligence have been used in bird vocalization analysis [79,80]. In the present study, CART [81] was applied to construct accurate and reliable predictive models for syllable extraction. CART does not require any special data preparation, only a good representation of the problem. Creating a CART model involves selecting input variables and split points on those variables until a suitable tree is trained. To operate CART, we randomly selected 288 syllable samples, one for each of the 288 audio clips, and divided them into training samples (70%, n = 202) and testing samples (30%, n = 86). Meanwhile, the ten most explanatory object features, including spectral, shape and textural characteristics (Table 1), were identified by calculating the feature variations in syllable samples and other objects, and were imported into CART as input variables. The representation of the CART model is a binary tree derived from recursive binary splitting (or greedy splitting). The binary tree allows for relatively straightforward obtainment of clear classification rules from the model diagram. Rel. imp stands for relative importance; GLCM (Gray-level Co-occurrence Matrix) is a tabulation of how often different combinations of pixel gray levels occur in a scene.
The complexity of a decision tree is defined by the number of splits in each tree. Simpler trees are preferred, as they are easier to understand and less likely to overfit the data. Trees can be pruned to further improve performance. The fastest and simplest pruning method is to work through each leaf node in the tree and evaluate the effect of removing it using a hold-out test set. Leaf nodes are removed only if this results in a decrease in the overall cost function for the entire test set. Node removal is stopped when no further improvements can be made.
We introduced two indicators, relative cost (RC) and rate of change (ROC), to evaluate the performance of the CART model. The value of RC ranges from 0 to 1, with 0 indicating a perfect model with no error and 1 indicating random guessing. Similarly, the value of ROC ranges from 0 to 1, with higher values suggesting better performance. The

Classification
Techniques from machine learning and computational intelligence have been used in bird vocalization analysis [79,80]. In the present study, CART [81] was applied to construct accurate and reliable predictive models for syllable extraction. CART does not require any special data preparation, only a good representation of the problem. Creating a CART model involves selecting input variables and split points on those variables until a suitable tree is trained. To operate CART, we randomly selected 288 syllable samples, one for each of the 288 audio clips, and divided them into training samples (70%, n = 202) and testing samples (30%, n = 86). Meanwhile, the ten most explanatory object features, including spectral, shape and textural characteristics (Table 1), were identified by calculating the feature variations in syllable samples and other objects, and were imported into CART as input variables. The representation of the CART model is a binary tree derived from recursive binary splitting (or greedy splitting). The binary tree allows for relatively straightforward obtainment of clear classification rules from the model diagram. The complexity of a decision tree is defined by the number of splits in each tree. Simpler trees are preferred, as they are easier to understand and less likely to overfit the data. Trees can be pruned to further improve performance. The fastest and simplest pruning method is to work through each leaf node in the tree and evaluate the effect of removing it using a hold-out test set. Leaf nodes are removed only if this results in a decrease in the overall cost function for the entire test set. Node removal is stopped when no further improvements can be made.
We introduced two indicators, relative cost (RC) and rate of change (ROC), to evaluate the performance of the CART model. The value of RC ranges from 0 to 1, with 0 indicating a perfect model with no error and 1 indicating random guessing. Similarly, the value of ROC ranges from 0 to 1, with higher values suggesting better performance. The resulting optimal decision tree consisted of three features: brightness, intensity, and grey level co- resulting optimal decision tree consisted of three features: brightness, intensity, and grey level co-occurrence matrix (GLCM) homogeneity ( Figure 4). It had an RC of 0.124 and a ROC of 0.979, indicating the reliability of the prediction model. The final step was to apply the ruleset generated from CART to classify all image objects. Figure 4. The optimal decision tree resulting from CART. Instances with a split value greater than the threshold (in parentheses) were moved to the right.

Feature Representation of Extracted Syllables
The time-frequency pattern of syllables displayed via spectrograms is a useful representation of species information, which can also characterize bird vocalization patterns [82,83]. Based on the remote sensing framework, where syllables on spectrograms are comparable to landscape patches on maps, our approach could easily calculate and analyze morphological descriptors. Each extracted syllable was characterized by a sequence of feature vectors.
Considering their good performance with respect to feature analysis of syllables in previous studies, instantaneous frequency (IF) [59], and amplitude [84] were adopted in this work. Each parameter was further statistically analyzed to obtain more detailed descriptive values such as maximum (IF_MAX), minimum (IF_MI), mean, and central frequency. The mean instantaneous frequency (IF_MN) is the first moment of the spectrogram relative to the frequency, and it can be calculated using Equation (1): where fi(t) is the mean instantaneous frequency at time t, and S (t, f) is the spectrogram at frequency f and time t.
In addition to the time-frequency characteristics, we also summarized the landscape metrics that were of practical meaning for syllable patches. Datasets were rasterized in R (https://www.r-project.org (accessed on 28 May 2021)) and imported into FRAGSTATS [85]. For each extracted syllable (patch level), we calculated area, shape index, border index, length, width (duration), etc., to obtain the fundamental spatial character and morphological understanding of each patch. Landscape-level metrics were also further summarized and nondimensionalized in each spectrogram, reflecting time-frequency patterns of acoustic activities, including: (1) NP, the number of syllable patches; (2) CA, the sum of the areas of all patches; (3) SHAPE_MN, mean value of the shape index of each patch; (4) TL, total bandwidth occupancy of all patches, calculated from a transformation of patch length; and (5) TW, total duration of all patches, calculated from a transformation of patch width. To facilitate reading, we abbreviated all these syllable feature metrics as SFMs (Table 2). Figure 4. The optimal decision tree resulting from CART. Instances with a split value greater than the threshold (in parentheses) were moved to the right.

Feature Representation of Extracted Syllables
The time-frequency pattern of syllables displayed via spectrograms is a useful representation of species information, which can also characterize bird vocalization patterns [82,83]. Based on the remote sensing framework, where syllables on spectrograms are comparable to landscape patches on maps, our approach could easily calculate and analyze morphological descriptors. Each extracted syllable was characterized by a sequence of feature vectors.
Considering their good performance with respect to feature analysis of syllables in previous studies, instantaneous frequency (IF) [59], and amplitude [84] were adopted in this work. Each parameter was further statistically analyzed to obtain more detailed descriptive values such as maximum (IF_MAX), minimum (IF_MI), mean, and central frequency. The mean instantaneous frequency (IF_MN) is the first moment of the spectrogram relative to the frequency, and it can be calculated using Equation (1): where f i (t) is the mean instantaneous frequency at time t, and S (t, f ) is the spectrogram at frequency f and time t.
In addition to the time-frequency characteristics, we also summarized the landscape metrics that were of practical meaning for syllable patches. Datasets were rasterized in R (https://www.r-project.org (accessed on 28 May 2021)) and imported into FRAGSTATS [85]. For each extracted syllable (patch level), we calculated area, shape index, border index, length, width (duration), etc., to obtain the fundamental spatial character and morphological understanding of each patch. Landscape-level metrics were also further summarized and nondimensionalized in each spectrogram, reflecting time-frequency patterns of acoustic activities, including: (1) NP, the number of syllable patches; (2) CA, the sum of the areas of all patches; (3) SHAPE_MN, mean value of the shape index of each patch; (4) TL, total bandwidth occupancy of all patches, calculated from a transformation of patch length; and (5) TW, total duration of all patches, calculated from a transformation of patch width. To facilitate reading, we abbreviated all these syllable feature metrics as SFMs (Table 2).

TW (Total Width)
The sum of the widths of all patches belonging to a given spectrogram. ÷200 The total duration of the acoustic events (s).

Statistical Analysis
All statistical analyses were performed in R version 4.0.5 (R Core Team, Vienna, Austria, 2021).
Because the probability distribution of the raw data failed the Kolmogorov-Smirnov test for normality, transformations were performed using the bestNormalize package [86], which attempts a range of transformations and selects the best one based on the goodnessof-fit statistic to ensure transformations are consistent. It can also remove the effects of order-of-magnitude differences among variables.

Accuracy Assessment
Manual inspection of BEs from audio clips was used as a reference for accuracy assessment. To minimize errors, we marked each syllable patch with a serial number when calculating it to avoid missing or double-counting. We also recorded the time spent by the technician on each spectrogram. We used relative error (RE) as a measurement of accuracy, which is the ratio of the absolute value of the reference value [87]. Specifically, RE was calculated by dividing the number of syllables correctly identified through the automated approach by the total number of BEs identified through manual inspection as a reference.

Correlation Analysis
Correlation analysis (Spearman's rho, p < 0.01) was performed between the number of extracted syllable patches and bioacoustic and acoustic events to further verify the reliability of the approach of automated extraction of bird vocalizations. Then, a second Spearman's rho correlation test was performed for the relationships between SFMs and bird species S. Non-parametric correlation analysis was selected because not all data were normally distributed despite being transformed.

Modelling
To test the efficacy of SFMs as a biodiversity proxy, we used a random forest (RF) machine learning procedure (randomForest package) [88,89] to predict bird species biodiversity from SFMs calculated from matching recordings.
RF is a meta-estimator and one of the most accurate learning algorithms available. The RF algorithm aggregates many decision trees and combines the results of multiple predictions, while ensuring that the ensemble model makes fair use of all potentially predictive variables and prevents overfitting [90]. In addition, RF accommodates multivariate collinearity among predictors, which is convenient for calculating the nonlinear effects of variables [27]. We used a bootstrapping cross-validation method to select the model structure with the lowest median of mean squared error (MSE) and highest R 2 between tested data and predicted values [12]. MSE was also used to measure the importance of each variable. A higher percentage increase in MSE indicated a greater ability to predict the model [91].

Approach Reliability
With pre-processing denoising, we removed more than 85% (n = 77) of the anthropogenic acoustic events (for a denoising example see Figure 5A,B). The number of syllables identified from the spectrograms varied from 0 to 183 in each spectrogram according to the automatic extraction approach. The REs of the approach ranged from 11.52% to 0.00% across all spectrograms with an average of 6.45% (±4.78%), suggesting that the automated extraction process yielded high accuracy (93.55%) values (for an example of identified syllables see Figure 5).

SFMs' Correlation with Biodiversity
High correlation coefficients (r > 0.5) between SFMs and bird species richness were found, except for shape index and MIF (avg, min, and max) ( Figure 6). Simpson's and Shannon's diversity indices were also correlated with SFMs, albeit less strongly. PN always had the highest correlation with the three diversity indices. A high correlation coefficient between the NP values and the number of bio-acoustic events and all acoustic events (r = 0.71, p < 0.01; r = 0.89, p < 0.01) was observed for the Spearman's rho correlation matrix ( Table 3). The CA (area), TL (duration) and TW (frequency) values of extracted syllables were also significantly correlated with bio-acoustic events and all acoustic events, albeit less strongly (Table 3). These results indicate that this new, automated approach was much more efficient than manual inspection. For manual inspection, on average, it took approximately 27 s of effort to analyze 15-s acoustic data, which was approximately eight times longer than the time required for the automated approach (3.5 s refers to the time taken for the whole process shown in Figure 2, averaged by each 15 s-spectrogram). This is because analysts tend to replay recordings to confirm results [92], and the time spent on loading and annotating vocalizations must also be accounted for. Further, it is expected that the difference in efficiency would be more significant with increasing amounts of data, because spectrograms can be batch processed using our approach.

SFMs' Correlation with Biodiversity
High correlation coefficients (r > 0.5) between SFMs and bird species richness were found, except for shape index and MIF (avg, min, and max) ( Figure 6). Simpson's and Shannon's diversity indices were also correlated with SFMs, albeit less strongly. PN always had the highest correlation with the three diversity indices.

Prediction of Biodiversity
Random forest regression models confirmed that combinations of SFMs are good predictors of biodiversity (Table 4). Bird species richness was predicted well (R 2 = 0.57) but the acoustic diversity of bird communities was less reliably predicted (Simpson's and Shannon's diversity indices had R 2 of 0.38 and 0.41, respectively). These results suggest that SFMs have great potential for tracking acoustic communities, even in the presence of considerable anthrophony (human-induced noise) in an urban environment [29,93]. We used all 14 SFMs as predictors (including soundscape and patch levels) in each RF regression model and investigated the relative contributions of each SFM. These metrics described soundscapes (syllable patches) from different perspectives and levels. Results demonstrated that SFMs effectively explained different pieces of information in acoustic recordings, likely because their unique mathematical properties reflect different dimensions of a soundscape. The number of syllable patches (PN) was the strongest single predictor in the best model found for richness (Figure 7a). In the best models predicting Simpson's and Shannon's diversity, the SFM with the highest importance was the total length of patches (20% and 23% of variance explained, respectively) (Figure 7b,c). It is noteworthy that PN, TL, and PI (Patch Intensity) were always the top three predictors for the three models (the sum of their contributions was 68%, 57%, and 65% in models a, b, and c respectively), suggesting that PN, TL and PI can be considered the most promising SFMs. All other indices exceeded the analytic threshold [63,94], suggesting that they all contributed little to predictive power.

Prediction of Biodiversity
Random forest regression models confirmed that combinations of SFMs are good predictors of biodiversity (Table 4). Bird species richness was predicted well (R 2 = 0.57) but the acoustic diversity of bird communities was less reliably predicted (Simpson's and Shannon's diversity indices had R 2 of 0.38 and 0.41, respectively). These results suggest that SFMs have great potential for tracking acoustic communities, even in the presence of considerable anthrophony (human-induced noise) in an urban environment [29,93].
We used all 14 SFMs as predictors (including soundscape and patch levels) in each RF regression model and investigated the relative contributions of each SFM. These metrics described soundscapes (syllable patches) from different perspectives and levels. Results demonstrated that SFMs effectively explained different pieces of information in acoustic recordings, likely because their unique mathematical properties reflect different dimensions of a soundscape. The number of syllable patches (PN) was the strongest single predictor in the best model found for richness (Figure 7a). In the best models predicting Simpson's and Shannon's diversity, the SFM with the highest importance was the total length of patches (20% and 23% of variance explained, respectively) (Figure 7b,c). It is noteworthy that PN, TL, and PI (Patch Intensity) were always the top three predictors for the three models (the sum of their contributions was 68%, 57%, and 65% in models a, b,  The number of variables tried at each split. For each number of variables per split from one to six, a new random forest was generated, which was evaluated and chosen both by the error rates in the test set and the out-of-bag OOB error. See Appendix A Materials for the descriptions of variables that are not described in the text.

Discussion
When analyzing biophony in urban environments, anthrophony could lead to potential false positives [95]. Low recognition accuracy is often attributed to noise [96], which affects the whole process unless removed initially. Hence, at present, most methods for analyzing biophony in urban environments are trained and tested on relatively low numbers of high-quality recordings that have been carefully selected [22]. This may lead to better results but will limit the generalization of their approach to real field recordings, especially in urban areas with complex acoustic environments. Therefore, in our study, we used only noise reduction, which meant that batch and fast processing could be performed.
According to Spearman's rho correlations, NP was always more strongly correlated with AEs than with BEs, suggesting that SFMs were still somewhat influenced by humangenerated noise, even after the denoising process. SFMs were hardly affected by constant-

Discussion
When analyzing biophony in urban environments, anthrophony could lead to potential false positives [95]. Low recognition accuracy is often attributed to noise [96], which affects the whole process unless removed initially. Hence, at present, most methods for analyzing biophony in urban environments are trained and tested on relatively low numbers of high-quality recordings that have been carefully selected [22]. This may lead to better results but will limit the generalization of their approach to real field recordings, especially in urban areas with complex acoustic environments. Therefore, in our study, we used only noise reduction, which meant that batch and fast processing could be performed.
According to Spearman's rho correlations, NP was always more strongly correlated with AEs than with BEs, suggesting that SFMs were still somewhat influenced by humangenerated noise, even after the denoising process. SFMs were hardly affected by constantintensity noise (e.g., noise from aircraft or automobile traffic; for an example see Figure 5D: persistent noise was not extracted by the algorithm) [29,33]. In addition to pre-processing filtering, which eliminated most of the noise, CART models minimized the confounding effects of noise and syllables through training samples. The inherent properties of constantintensity noise are different from those of bird vocalizations and are easily recognized by the model. For example, the brightness (Figure 4) of bird vocalizations was generally greater than 80, while the values for noise were 30-60. Nevertheless, some intermittent human noises, such as car horns and ringtones, might be extracted together with bird vocalizations using our approach. However, we believe that testing the approach in other habitats such as natural forests or biodiversity conservation areas will yield more encouraging results.
By treating spectrograms as images, previous studies have applied image processing techniques to extract bird vocalizations [10,59,60,83,[96][97][98], such as widely used median clipping [41,99,100] and frame-or acoustic event-based morphological filtering [66]. There are plenty of toolboxes available to extract acoustic traits [22], such as central frequency, highest frequency, lowest frequency, initial frequency, and loudest frequency and so on [10,59], which are basically time-frequency characteristics only. To the best knowledge, all these studies aimed to identify or classify one or several bird species specifically. However, when focusing on the entire ecosystem, the species-level approach misses the forest for the trees [101]. Unlike the studies aiming at recognition of one or more species, ours attempted to take a global estimate of the acoustic output of the community. Our results indicated that SFMs are a promising complement to the existing indices working as biodiversity proxies when rapid assessments are required because SFMs were significantly correlated with diversity indices. SFMs allow for the effective interpretation of different pieces of bio-information in recordings, probably because their unique mathematical properties reflect different components of the soundscape, preserving more of the potentially eco-relevant information.
Using the proposed approach, we could collect a series of data including acoustic traits on the time-frequency scale (Figure 8), such as duration and mean, maximum, and minimum frequency of acoustic events, and morphological characteristics of acoustic events (Figure 9), such as the size and shape characteristics for each syllable patch. Such features (i.e., SFMs) may contribute to a more nuanced understanding of the acoustic environment of the study area from multiple perspectives. In Figure 8, the frequency patterns of syllable patches are shown across frequency intervals. The soundscape was dominated by midfrequency sounds (3-5 kHz); syllable patches were over 50%. Mid-frequency sounds are generally attributed to biophony, especially bird species ranging from larger birds such as the Eurasian magpie (Pica pica) to smaller species such as the Oriental reed warbler (Acrocephalus orientalis). At the other end of the frequency spectrum, the number of patches within individual high-frequency intervals was quite low (biophony patches in the highest four frequency intervals accounted for 1.5%) and were mainly within 20-22 kHz, which may be attributed to some night-flying moths of the family Noctuidae or to mating calls from grasshoppers. In Figure 9, the average shape index (SHAPE) over 24 h was shown to rise rapidly at dawn chorus to reach the peak of the day, falling rapidly and remaining steady until dusk, when the chorus rose rapidly again, and then fluctuated and fell until the morning. The daily pattern of SHAPE is consistent with previous studies of other acoustic indices [27,31,102,103], thus reflecting a daily activity pattern and highlighting distinct dawn and dusk bird choruses. However, compared to other indices that only focus on sound intensity, SHAPE provides a new perspective on patterns of complexity of bird songs: songs of the dawn and dusk choruses tended to be more complex and elaborate than daytime songs. This is mainly related to defending territory and/or attracting a mate [104].    Borrowing a framework from landscape ecology, many types SFMs can effectively be used to interpret different aspects of acoustic information and different components of the soundscape, which is presumably attributed to the mathematical properties of SFMs and to the introduction of a spatial concept. SFMs calculated in this study, such as area, compactness, roundness, and shape index of the patches, could be easily generated in eCognition developer. In addition, many existing open-source platforms (e.g., package landscapemetrics in R) or software have integrated huge workflows, which can provide similar functions (e.g., QGIS). Measurement of biophony from multiple dimensions has been considered useful for detecting variations in the behavior and composition of acoustic communities and, as a result, to better monitor their dynamics and interactions with habitats [29]. These results support the possibility that PAM could potentially offer a more comprehensive picture of biodiversity than traditional inspection [63].
Under the high pressure of a surplus of data, and facing the lack of technology, funding and standardized protocols [11], most passive monitoring now lasts one to three years at most [105], while ecosystem conservation and ecological change detection usually require at least ten years. In particular, there is usually a lag period when measuring the benefits of planted forests, as individual trees need to grow and stands need to mature to form a stable structure [4]. Short-term monitoring may lead to a reduction in the quality and reliability of data [106]. This emphasizes the significance of utilizing and applying Borrowing a framework from landscape ecology, many types SFMs can effectively be used to interpret different aspects of acoustic information and different components of the soundscape, which is presumably attributed to the mathematical properties of SFMs and to the introduction of a spatial concept. SFMs calculated in this study, such as area, compactness, roundness, and shape index of the patches, could be easily generated in eCognition developer. In addition, many existing open-source platforms (e.g., package landscapemetrics in R) or software have integrated huge workflows, which can provide similar functions (e.g., QGIS). Measurement of biophony from multiple dimensions has been considered useful for detecting variations in the behavior and composition of acoustic communities and, as a result, to better monitor their dynamics and interactions with habitats [29]. These results support the possibility that PAM could potentially offer a more comprehensive picture of biodiversity than traditional inspection [63].
Under the high pressure of a surplus of data, and facing the lack of technology, funding and standardized protocols [11], most passive monitoring now lasts one to three years at most [105], while ecosystem conservation and ecological change detection usually require at least ten years. In particular, there is usually a lag period when measuring the benefits of planted forests, as individual trees need to grow and stands need to mature to form a stable structure [4]. Short-term monitoring may lead to a reduction in the quality and reliability of data [106]. This emphasizes the significance of utilizing and applying PAM within the framework of a monitoring strategy, with defined objectives, effective indicators, and standardized protocols [20].
There is developing acknowledgment from governments and related sectors that urban greenery is not monitored adequately to satisfy its crucial roles in biodiversity provisioning and ecosystem support [33,107]. A rich and diverse biophony usually indicates a stable and healthy ecosystem [26]. With its government-led design, planning, and implementation, the in-depth greening project in Beijing has indeed enhanced green space in the plain area, if only considering the total increased amounts of trees and connected urban forest and park patches [108]. However, the large-scale transition between cropland and forest generated by the afforestation process has the potential to lead to original wildlife habitat loss. By long-term monitoring of biodiversity patterns and processes, we can better assess the positive and negative impacts of afforestation projects.
According to our preliminary results, the proposed approach (with high computational efficiency and accuracy) may benefit further research on the rapid assessment and prediction of biodiversity in urban forests, providing an indirect but immediate measurement of bird activity dynamics across enhanced spatio-temporal scales. This would facilitate the application of PAM and the formulation of a standardized sampling protocol. Furthermore, a robust automated approach could support PAM as part of citizen science research. This would benefit developing countries that lack financial budgets, experts, and capacity for massive data processing. Globally, only 5% of PAM studies are conducted in regions of Asia, western Oceania, northern Africa, and southern South America, where some countries still have no record of using PAM [20]. As our approach does not require a priori data, it facilitates the implementation of long-term ecosystem monitoring in developing countries where baseline data are not available.

Conclusions
In 2019, the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES) warned that the unprecedented deterioration of natural resources was deracinating millions of species and reducing human well-being worldwide. However, global biodiversity loss has not attracted as much public attention as global climate change.
To evaluate the diversity of coexisting birds in urban forests and ultimately facilitate the assessment of afforestation benefits in urbanized Beijing, we developed an automated approach to extract and quantify bird vocalizations from spectrograms, integrating the well-established technology of object-based image analysis. The approach could achieve recognition accuracy (93.55%) of acoustic events at much higher efficiency (eight times faster) than traditional inspection methods. In addition, it also provided multiple aspects of acoustic traits, such as quantity, song length, frequency bandwidth, and shape information, which can be used to predict bird biodiversity. In our case, 57% of the variance in bird species richness could be explained by the acoustic and morphological features selected. The proposed soundscape evaluation method sheds light on long-term biodiversity monitoring and conservation during the upcoming global biodiversity crisis. Acknowledgments: We thank Hanchen Huang for bird species identification, Junyou Zhang for writing advice, and Shi Xu, Qi Bian for field investigation.

Conflicts of Interest:
The authors declare no conflict of interest.  The number of pixels forming a patch. If unit information is available, the number of pixels can be converted into a measurement. In scenes that provide no unit information, the area of a single pixel is 1 and the patch area is simply the number of pixels that form it. If the image data provides unit information, the area can be multiplied using the appropriate factor.

Appendix A
8 Border_ind (Border index) The Border Index feature describes how jagged a patch is; the more jagged, the higher its border index. This feature is similar to the Shape Index feature, but the Border Index feature uses a rectangular approximation instead of a square. The smallest rectangle enclosing the patch is created and the border index is calculated as the ratio between the border lengths of the patch and the smallest enclosing rectangle.