Soundscape Characterization Using Autoencoders and Unsupervised Learning

Passive acoustic monitoring (PAM) through acoustic recorder units (ARUs) shows promise in detecting early landscape changes linked to functional and structural patterns, including species richness, acoustic diversity, community interactions, and human-induced threats. However, current approaches primarily rely on supervised methods, which require prior knowledge of collected datasets. This reliance poses challenges due to the large volumes of ARU data. In this work, we propose a non-supervised framework using autoencoders to extract soundscape features. We applied this framework to a dataset from Colombian landscapes captured by 31 audiomoth recorders. Our method generates clusters based on autoencoder features and represents cluster information with prototype spectrograms using centroid features and the decoder part of the neural network. Our analysis provides valuable insights into the distribution and temporal patterns of various sound compositions within the study area. By utilizing autoencoders, we identify significant soundscape patterns characterized by recurring and intense sound types across multiple frequency ranges. This comprehensive understanding of the study area’s soundscape allows us to pinpoint crucial sound sources and gain deeper insights into its acoustic environment. Our results encourage further exploration of unsupervised algorithms in soundscape analysis as a promising alternative path for understanding and monitoring environmental changes.


Introduction
Acoustic landscape ecology, also known as soundscape ecology, is a field within ecoacoustics dedicated to describing and studying the sounds present in natural landscapes [1].The goal is to extract information about various types of sounds or sources originating from human activity (anthropophonies), physical phenomena (geophonies), and biotic sources (biophonies), the latter referring to sounds emitted by living organisms such as animal vocalizations [1][2][3].These acoustic signals serve as valuable indicators for assessing species diversity and abundance, habitat use, and population dynamics [4].Additionally, changes in the acoustic landscape, often influenced by human activities, can reflect the impacts of habitat loss and degradation on wildlife populations [5].In this context, the soundscape emerges as a crucial area for monitoring the ecological integrity of landscapes and detecting early signals of ecological change.Studying soundscapes can, thus, significantly contribute to biodiversity conservation efforts.
Soundscape studies employ acoustic recording units (ARUs) to capture sound over predefined periods [6], which can range from days to weeks or even months.ARUs are programmed to be activated at specified intervals, recording audio for a predetermined duration before entering a standby mode for a set period and then resuming recording.Recent advancements in recording technology, characterized by low energy consumption and increased storage capacity, have facilitated operation over extended sampling periods and recording times.Consequently, significant volumes of soundscape data are being generated, creating a demand for the development of automated tools for efficient data processing and analysis [7].
Machine learning is now a popular solution for developing innovative data processing frameworks due to its exceptional performance across various domains, including computer vision, semantic analysis, natural language processing, automatic speech, audio recognition, and machinery fault prevention and diagnosis [8,9].Within the realm of ecoacoustics and soundscape analysis, there is extensive usage of machine learning algorithms such as Random Forests [10], Support Vector Machines [11], and Neural Networks [1,12] to extract meaningful features from acoustic data.Recent advancements in ecoacoustics and soundscape research have leveraged deep features and Neural Networks to enhance sound type identification.For instance, Dufourq et al. [13] employed transfer learning to adapt existing Convolutional Neural Networks (CNNs) for bioacoustic classification.Furthermore, architectures such as ResNet, EfficientNet, MobileNet, and DesNet have been deployed to accurately identify acoustic scenes involving humans, birds, insects, and silence [14], and the MobileNetV2 architecture has been successfully employed for classifying biophonies, geophonies, anthropophonies, and silence [1].These advancements underscore the growing contribution of machine learning to advance our understanding of acoustic environments.
Supervised methodologies in acoustic analysis often yield high-accuracy results but face challenges due to their reliance on labeled data for model training and testing.These challenges include the need for expert analysis to identify patterns in audio [15,16], assumptions about data structure [10] such as the set of possible animal vocalizations [17], time-consuming sample labeling with practical issues like lacking specific timestamps and frequency bands, and handling overlapping acoustic events in time and frequency [15,16].Moreover, feature extraction methods are sensitive to noisy data [7].These limitations drive the exploration of unsupervised learning as a compelling alternative.Unsupervised methods can circumvent many difficulties inherent in supervised approaches.For instance, Keen et al. proposed a framework combining unsupervised Random Forest, K-means clustering, and t-SNE to evaluate acoustic diversity [10].Ulloa et al. introduced the Multiresolution Analysis of Acoustic Diversity (MAAD) method, which decomposes the acoustic community into elementary components called soundtypes based on their time and frequency attributes [18].These contributions pave promising paths for leveraging unsupervised learning to understand and explore soundscape composition beyond specific sound types, often limited to biophonies, thus capturing valuable information across acoustic environments.
In this paper, we introduce a novel methodology for characterizing soundscapes using autoencoders, providing a fresh perspective that goes beyond traditional species-specific approaches.Our approach utilizes autoencoders to effectively uncover large-scale patterns within sound recordings, thereby capturing the broader ecological context.By focusing on feature extraction with autoencoders, we aim to bridge the gap between identified patterns, metadata, and the ecological attributes of the landscape.This integration enables a comprehensive evaluation of the soundscape, considering both acoustic patterns and their ecological significance.Moreover, we introduce an unsupervised framework designed to explore novel landscape attributes through the lens of soundscape heterogeneity.This framework promotes a holistic understanding of the acoustic environment, facilitating the identification and interpretation of previously unexplored patterns and associations.In summary, the novelty of our work lies in the synthesis of cutting-edge autoencoder technology with soundscape ecology, offering a forward-thinking approach to characterize and interpret acoustic landscapes within a broader ecological context.

Related Work
The majority of deep learning methodologies for ecoacoustics applications reported in the literature rely on supervised approaches, whereas end-to-end unsupervised frameworks have not been sufficiently studied.However, some approximations have emerged, e.g., Rowe et al. [19] used autoencoders to characterize sound types and identify groups corresponding to species vocalizations.Nevertheless, although the methodology is unsupervised, it requires prior information about the species for validation.Dias et al. [20] also explored autoencoders for feature extraction and data visualization; additional, they used acoustic indices and spectral features to characterize sites from Costa Rica and Brazil.Best et al. [21] introduced a new method for encoding vocalizations, using an autoencoder network to obtain embeddings from eight datasets across six species, including birds and marine mammals, also employing clustering and dimension reduction techniques such as DBSCAN and UMAP.Akbal et al. [22] collected a new anuran sound dataset and proposed a hand-modeled sound classification system through an improved one-dimensional local binary pattern (1D-LBP) and Tunable Q Wavelet Transform (TQWT), obtaining a 99.35% accuracy in classifying 26 anuran species.Gibb et al. [15] discuss the limitations of supervision in soundscape analysis and propose variational autoencoders to embed latent features from acoustic survey data and evaluate habitat degradation.Rendon et al. [23] proposed the Uncertainty Fréchet Clustering Internal Validity Index, which was assessed using real-world and synthetic data, including a soundscape dataset identifying the transformation of ecosystems.On the other hand, Allaoi et al. [24] investigated the problem of treating embedding and clustering simultaneously to uncover data structure reliably by constraining manifold embedding through clustering and introduce the UEC method.Given the evolution and significant interest within the scientific community regarding this study area, we conducted a comprehensive review of machine learning applications in soundscape ecology and ecoacoustics.In this review, we compiled a list of methods encompassing both supervised and unsupervised learning, as well as deep and traditional approaches [25].Similarly, we summarize the latest highly related works in Table 1.This review of related work highlights the importance of further investigating unsupervised methodologies to facilitate the exploration of acoustic data and assist with the identification of representative patterns associated with landscape heterogeneity.Moreover, one observes a trend towards investigating and relating heterogeneity with landscape attributes indicators of ecosystem health.

Materials and Methods
The methodology introduced in this work is summarized in Figure 1.The initial step following data acquisition involves generating time-frequency spectrogram representations, which serve as input to train an autoencoder architecture for feature extraction.Two data processing strategies are executed: a supervised pathway and an unsupervised pathway.The supervised data processing pathway gives us a baseline to compare the performance of the autoencoder features against more standard representations adopted in soundscape ecology.We evaluated the performance of a Random Forest classifier on the features learned by the proposed autoencoder, taking, as baselines, features obtained with the VGGish architecture and a feature vector of acoustic indices.Acoustic indices have been proven effective to capture acoustic variations reflecting ecosystem attributes, while the VGGish [26] Deep Neural Network is commonly employed for video and audio analyses.In the unsupervised data processing pathway, features extracted by the autoencoder and their projections are clustered using K-means, and the resulting clusters are characterized based on their temporal patterns and spectral ranges.We utilized multiple metrics of cluster cohesion and separation to establish a quantitative baseline for assessing cluster quality.Further details on this process are provided below.

Study Site
Our experimental investigation utilized a dataset sourced from the Jaguas Colombian tropical forest.ARUs were placed near the Jaguas hydroelectric power plant in the northern region of Antioquia (6°26 ′ N, 75°05 ′ W; 6°21 ′ N, 74°59 ′ W) (refer to Figure 2 for the location of the recorders).Jaguas spans a protected area of 50 km², characterized by an elevation gradient from 850 to 1300 m above sea level.The reserve predominantly comprises secondary forests (70%), with the remaining areas consisting of a vegetation mosaic (23%), degraded surfaces (5%), and grassland (2%).Renowned for its rich communities of terrestrial vertebrates, the protected area plays a crucial role in biodiversity conservation at the regional scale [27,28].For data acquisition, 31 Song Meter SM4 devices (Wildlife Acoustics, Inc., Maynard, MA, USA) were deployed throughout the study site (technical details are summarized in Table 2).Data were sampled for one minute at 44.1 kHz every 15 min, resulting in four .wavfiles per hour and a total of 20,068 audio files overall.We sub-sampled the dataset to 22,050 because the audible soundtypes of interest occur mostly under 12 KHz.The recorders were configured with 16 dB of gain at a resolution of 16 bits and stereophonic sound.The complete dataset at its original sampling frequency had a size of 212.4 GB.The dataset was collected by the herpetological group of Antioquia (GHA) between 11 May and 26 June 2018.Table 2. Data acquisition and sampling parameters: Rec refers to the recorders, NSS to the number of sites studied, RD to the recording duration, SR to the sample rate, SS to the sub-sampling rate, RP to the periodicity of recording, MU to the memory used to store the recordings, AL to the cover type labels available across sites.

Pre-Processing
We implemented the method described in [29] for automatically detecting and removing recordings with heavy rainfall.These recordings were excluded due to their high signal-to-noise ratio and the presence of various sound types, such as geophonies (e.g., rivers) and biophonies (e.g., cicadas), which often mask a wide range of frequencies in ecological audio.Removing these instances was necessary to prevent overfitting or biases in the feature extraction and clustering tasks.As a result of this pre-processing step, 16,968 (12,313 forest and 46,151 non-forest) recordings remained for subsequent analysis.

Spectrograms Computation and Parameterization
The data were processed in batches using Python 3.8 and PyTorch 2.2 software.Following standard practice in ecoacoustics data analysis and the knowledge generated around audio featuring using spectral representations [17,[30][31][32][33], the audio was converted to the time-frequency domain spectrogram representation using the Short-Time Fourier Transform (STFT), as shown in Equation ( 1).Each one-minute recording was split into five 12 s segments.This division allowed us to obtain spectrograms with dimensions of 515 × 515 by applying a Hamming window of length 1028 and an overlap equal to half the window size.To maintain a squared output, the number of frequency bins was kept equal to the window length.
where n and k are the time and frequency indices, respectively, and l is the relative displacement of the current audio segment in terms of steps of L samples [34].The Hamming window is defined as in Equation (2): An audio segment duration of twelve seconds is unlike the standard practice in the field, which is typically less than five seconds [19].Our choice of 12 s segments was influenced by the Short-Time Fourier Transform (STFT) parameters, which yield 515 bins in the frequency axis.The effective management of these segments demanded the development of a customized data loader for loading and processing individual segments using the STFT.This process allowed us to generate spectrograms in batches of 14 audio files, yielding 70 sub-audios per batch.

Autoencoders
Autoencoders are Deep Neural Network architectures tailored to leverage the potential of deep learning for automatic and unsupervised feature extraction.The primary objective is to train a multi-layer network capable of learning a low-dimensional embedding from the input high-dimensional data while preserving relevant patterns [35].Within this architecture, there exists a middle layer where features are extracted via abstract representations derived from convolutional operations across each level of the network [36].This intermediate layer is commonly referred to as the latent space, and the process of generating this space is known as the encoding phase.Following the encoding phase, a decoding process is carried out using the learned representation within the latent space.The Neural Network is configured to execute the encoding process in an "inverse direction", meaning features are projected from a low-dimensional representation back to the input space, thereby reconstructing the original data.Theoretically, the outputs of the autoencoder should closely resemble the original data.Consequently, the latent space effectively represents the original input data using fewer dimensions, implying efficient information compression.
Mathematically, let x ∈ R D be an input vector; an autoencoder maps x to a latent space z ∈ R D ′ using a deterministic function f θ e (x) = s(Wx + b), according to the parameters θ = {W, b}, where W is a weight matrix and b is a bias.The new representation vector z can be returned to the original input space using a second function [37,38].This is represented in Equation (3).
Due to y resulting from transformations performed with the encoding and decoding functions f θ , g θ ′ , the parameters θ * e and θ * d can be expressed in terms of x as in Equation ( 4): where L denotes the Magnitude Square Error (MSE) loss function.This metric calculates the average Euclidean distance between corresponding pixels of the original and reconstructed spectrograms.The optimization process involves separately adjusting the parameters θ e and θ d to minimize the MSE, thereby enhancing the fidelity of the reconstructed spectrograms.

Feature Projection
After processing spectrograms through the autoencoder architecture, the dimension of the latent space is determined by the size of the resulting images and the number of channels.The resulting feature vectors have a size of img_width × img_height × num_channels = 9 × 9 × 64 = 5184, which represents a significant reduction in input dimensionality, compared to the original spectrogram dimensions of 515 × 515 = 265,225.Thus, the original information can be preserved using representations with only 2% of the original image size.
We experimented with PCA, t-SNE, and UMAP for dimension reduction in order to have grounds for a comparative analysis within the context of soundscape data.The techniques have been employed both for visualization purposes and also to reduce the dimensionality of the autoencoder feature space prior to classification.

Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is employed for the dimensionality reduction of datasets characterized by a substantial number of dependent variables.The primary objective is to preserve essential information within the dataset through the transformation of variables into uncorrelated variables known as Principal Components (PCs) [39].By eliminating redundancy, PCA enhances computational efficiency and reduces the risk of overfitting [40].We selected the number of principal components maintaining a 90% of the variance.

t-Distributed Stochastic Neighbor Embedding (t-SNE)
t-distributed Stochastic Neighbor Embedding (t-SNE) was introduced as a technique to project high-dimensional data into a low-dimensional representation space [41].It is a variation of Stochastic Neighbor Embedding (SNE), which compares conditional probabilities representing similarities between data points in different dimensions [41].
Conditional probabilities are calculated from the Euclidean distance between data points x i and x j as in Equation (5).
where σ i is the variance of a Gaussian centered on x i .Then, the conditional probability q j|i of the projected counterparts y i and y j is obtained as in Equation ( 6).
The technique seeks to minimize the mismatch between conditional probabilities p i|j and q i|j .

Uniform Manifold Approximation and Projection (UMAP)
The UMAP (Uniform Manifold Approximation and Projection) dimension reduction algorithm is grounded in a rigorous mathematical foundation rooted in Riemannian geometry and algebraic topology.It stands as an alternative to t-SNE that notably boasts significantly faster performance compared to most t-SNE implementations, rendering it more efficient for handling large datasets.Moreover, its mathematical formulation seeks the preservation of both local and global structures inherent in the data [42].Over time, UMAP has emerged as a highly popular non-linear projection technique, particularly for visualizing intricate patterns delineated by features in two or three dimensions [43].

K-Means Clustering
Clustering serves as an unsupervised learning technique for exploratory data analysis, particularly useful when there is limited prior knowledge about the data and its distribution [44].The underlying principle involves comparing intrinsic features and generating clusters based on similarities or minimum distances computed from the data features.Clustering algorithms are generally categorized into two main groups: probability model-based approaches and non-parametric approaches [45].The latter is further subdivided into hierarchical or partitional methods, offering a variety of algorithms to choose from [24,[45][46][47].In this study, we employed the K-means algorithm, which falls under the category of partitional algorithms [47,48].K-means clustering partitions the data into distinct clusters based on the similarity of data points, providing a straightforward and efficient approach to clustering analysis.
Given a dataset X = x 1 , . . ., x N , where N is the number of samples, the K-means algorithm finds k centroids C = c 1 , . . ., c k minimizing the mean distance between each data sample in x and their nearest centroids.The objective function is defined in Equation (7).
where E is the energy, and a i is the minimum distance between samples and centroids, i.e., a i = arg min j∈1,...,k ||x i − c j ||.
There are many variations of K-means [49]; in our study, we employed the traditional Lloyd's algorithm [44].We did not perform comparisons with other clustering algorithms because our scope is centered on features and the information they extract.Moreover, previous work has shown that K-means is accurate and computationally efficient compared to others [47,49].Additionally, several authors have succeeded in combining K-means clustering with Deep Neural Networks [48], suggesting it is a suitable choice for our purposes of exploring feature distributions in the data.

Performance Metrics
The following metrics have been employed to assess the performance of classification models in the supervised pathway: • Accuracy: provides a global assessment of the model's correctness by quantifying the ratio of correctly predicted instances to the total number of instances.Accuracy is defined in Equation (8).
• Recall: also known as sensitivity, or the true positive rate, recall measures the model's capability to accurately identify positive instances from the entire pool of actual positive instances, as depicted in Equation ( 9).

Recall =
True Positives True Positives + False Negatives ( • F1-score: the F1-score is obtained as the harmonic mean of precision and recall, offering a balanced measure that considers both false positives and false negatives, as described in Equation (10).
The following metrics have been employed to assess the quality of clustering models in the unsupervised pathway:

•
Silhouette Coefficient: provides a measure of cluster cohesion and separation, as described in Equation (11).Cohesion is assessed based on the similarity of data instances within a single cluster, while separation is determined by the dissimilarity between instances from different clusters.
where a(i) is the average distance from the i-th data point to other data points in the same cluster (cohesion) and b(i) is the smallest average distance from the i-th data point to data points in a different cluster, minimized over clusters (separation).The silhouette score for the entire dataset is the average of the silhouette score for each instance.The overall silhouette score can be calculated as in Equation (12).
• Calinski-Harabasz (CH) index: measures the ratio of between-cluster variance to within-cluster variance.It helps in assessing how well-separated the clusters are from each other.This index is calculated using Equation (13).
where B k is the between-cluster scatter matrix, W k is the within-cluster scatter matrix, N is the total number of data points, and k is the number of clusters.• Davies-Bouldin (DB) index: computes the average similarity between each cluster and its most similar cluster.It provides insights into the compactness and separability of the clusters.The DB index is computed as in Equation (14).
where n is the number of clusters, σ i is the average distance from the centroid of cluster i to the points in cluster i, c i is the centroid of cluster i, and d(c i , c j ) is the distance between centroids c i and c j .

Experiments
We conducted experiments on the spectrograms generated from the Jaguas dataset processed as detailed in Section 3.1.We also considered metadata such as the timestamp of each recording, the recorder location, and the recording site cover type (forest or non-forest).

Autoencoder Architecture and Training
We allocated 98% of the dataset for training a vanilla autoencoder [35] with the architecture outlined in Figure 3.To assess the model's performance and generalization capability, we reserved the remaining 2% of the dataset for testing purposes.During each epoch of training, this subset was utilized to evaluate the Mean Squared Error (MSE) as a test error metric.Additionally, spectrograms from this test subset were passed through the autoencoder to obtain reconstructions, enabling a visual comparison between the original and reconstructed spectrograms.This approach allowed us to monitor the autoencoder's performance on unseen data and verify its capability to reconstruct spectrograms from the test set.
The encoding section of the autoencoder network comprises four convolutional layers with Rectified Linear Unit (ReLU) activation functions interspersed between them.Similarly, the decoding section follows the same structure, employing four deconvolutional layers with ReLU activation functions and a sigmoidal activation function for the final layer.The embedding space was derived using the encoding network by applying a flattening operation over the final layer.As a result, the dimension of the output was determined by the number of channels and the residual image after the convolutional layers, resulting in an embedding space of 5184 features.This outcome was achieved because the output channels were fixed to 64, and the residual image dimension was 9 × 9 (64 × 9 × 9 = 5184).

Supervised Learning Approach
We executed the supervised data processing pathway to compare the effectiveness of autoencoder features with more standard approaches in soundscape ecology.Considering that the dataset lacked information about biodiversity, species richness, or the presence of specific sound types, we relied on labels indicating the landscape type of the recording site, as either a forest or non-forest, provided by researchers involved in the sampling and analysis effort.We, thus, ran the RF classifier on the three input feature spaces:

•
The autoencoder features, extracted from our trained autoencoder architecture.

•
Feature vectors comprising sixty distinct acoustic indices computed using the scikitmaad Python module [50].
• The VGGish feature embedding, obtained from a pre-trained Convolutional Neural Network (17 layers) inspired by the VGG networks typically used for sound classification.
For the classification experiments, the dataset was divided into 80% of the samples for training and 20% for testing.We maintained uniform parameters and distributions across all classification models to ensure a direct comparison of results.In the case of the RF classifier, we set the maximum depth to 16 and the random state to 0. We applied dimensionality reduction to reduce the size of the feature space and use a feature space of the same size across all methods.The RF algorithm was then trained to predict landscape types using the reduced feature spaces obtained from each method.
Model performance was evaluated in terms of accuracy, recall, and F1-Score metrics, as described in Section 3.7.These metrics provide a comprehensive assessment of the performance of our classification models that take class imbalance into account.The analysis offers insights into the effectiveness of the three feature extraction techniques in predicting landscape types from the acoustic data.

Unsupervised Learning Approach
In the non-supervised data processing pathway, we employed the embedded autoencoder features to explore potential relationships between the temporal occurrences of acoustic events and the clusters identified with the K-means clustering algorithm.Furthermore, we employed PCA, t-SNE, and UMAP to generate two-dimensional projections of the features for visualization purposes.The visualizations are enriched with metadata concerning the recording hour and location in order to highlight potential temporal and spatial patterns.This allowed us to gain insight into the temporal and spatial distribution of acoustic events.By adopting this comprehensive approach, we were able to uncover valuable patterns and associations within the acoustic data, thereby contributing to a deeper understanding of the underlying dynamics of the soundscapes under investigation.
We evaluated the clustering quality using three commonly employed metrics: the Silhouette Coefficient (SLT), the Calinski-Harabasz (CH) index, and the Davies-Bouldin (DB) index.These metrics were applied while varying the number of clusters K from 3 to 35, allowing us to analyze the clustering performance across different cluster numbers.

Autoencoder Training
We conducted a thorough evaluation of the vanilla Autoencoder Neural Network, analyzing its learning rate curve over ten epochs.The Mean Squared Error (MSE) between the input and reconstructed spectrograms was computed, providing insights into the network's performance.Additionally, we visually compared the reconstructions generated by the autoencoder with the corresponding input data samples.The autoencoder's learning curve exhibited a rapid convergence, reaching the inflection point or "elbow" in less than one epoch (approximately 500 iterations) (Figure 4), indicating its efficiency.Following the inflection point, the mean error stabilized around 0.16, with a slight decrease observed in subsequent epochs (again, see Figure 4).Despite small fluctuations observed in the MSE across successive epochs, discernible improvements in the reconstructions were notable.Specifically, there was an accentuation of temporal patterns and an enhancement of background delineation.Accurate reconstructions of soundtype patterns, particularly in low and middle frequencies, were observed.However, challenges were encountered with higher frequencies beyond 8 kHz, where weak sounds became imperceptible in the reconstructions.Soundtypes with broad spectral ranges required multiple epochs to adjust to the original pattern.On the other hand, concerning background noise, the network initially depicted remarkable repeated patterns, gradually diminishing with each epoch.This Deep Neural Network was employed to extract a feature embedding, which served as an input for both the supervised and unsupervised data analysis pathways.

Feature Projections
We generated reduced embeddings of the original autoencoder's feature space using various dimensionality reduction techniques.Initially, we employed the traditional Principal Component Analysis (PCA) method to identify and organize components based on decreasing variance.This approach effectively determined a minimal number of components that accounted for over 90% of the data variance, facilitating feature space reduction while preserving performance and information content.Although 30 components captured 90% of the variance, we opted for 60 components to align with the dimensionality of the feature vector of acoustic indices.In addition to PCA, we utilized T-SNE and UMAP for dimensionality reduction, considering them state-of-the-art methods.We preserved the same number of components (60) to maintain consistency across feature vectors.Consequently, feature vectors of size 60 were input to the classification and clustering tasks.For visualization purposes, only two dimensions per method were used to showcase data distribution, as depicted in Figure 5.By examining these visualizations, one can discern patterns and relationships within the data based on the recording hour or location labels, providing insights into the temporal and spatial distributions of acoustic events across the dataset.
In Figure 5, we compare distributions using the recording time and recorder location as labels for each data point.In terms of the temporal aspect, PCA effectively distinguishes data points across different hours.However, more favorable outcomes were observed in the views obtained with UMAP and t-SNE, which demonstrated superior capability to separate hours into distinct distributions.On the other hand, discerning a clear pattern in the distribution of recording locations is not straightforward.Nevertheless, UMAP and t-SNE again exhibited superior separation for certain segregated points.Moreover, the absence of discernible segregation among points corresponding to various recording locations is regarded as a favorable outcome, as we did not expect to discover notorious patterns that might be introduced from the bias of the recorders.Therefore, we consider that spatial patterns deserve further investigations, given the lack of distinct separations, indicating that a more in-depth exploration of potential spatial correlations is required.

Classification of Landscape Type Using Supervised Learning
We trained a Random Forest classifier using the data point labels of "forest" or "nonforest", considering the reduced feature spaces relative to the three feature extraction methodologies.The forest and non-forest classification was based on a land cover map derived from satellite images.Forest cover exclusively included forested areas, while non-forest cover comprised all other land cover types (such as grasslands and secondary vegetation) [51].Our objective in this experiment was to compare the performance of the features obtained with the proposed autoencoder with methods recognized as effective in the literature, such as acoustic indices and the VGGish architecture.Additionally, we aimed to enrich our analysis by incorporating contextual information.Specifically, we were interested in leveraging satellite-derived labels and biological content extracted from the audio of the sites.This was motivated by the understanding that various factors, including degradation, landscape transformation, and other conditions, can interfere with acoustic patterns.Figure 6 presents the results pertaining to the original autoencoder embedding and after applying dimensionality reduction to 60 components, along with the acoustic indices and VGGish features.These results offer insights into the effectiveness of different feature extraction methodologies in distinguishing between forest and non-forest environments.The normalized autoencoder features exhibited the best classification metrics, with an F1 score of 90.4%, recall of 88.7%, and accuracy of 92.8%.Additionally, its projections, particularly with UMAP and 60 components, yielded comparable results, indicating that reducing the original feature space from 5184 to 60 components is computationally efficient while preserving relevant information.While PCA-based projections showed improved results compared to the baseline methodologies of acoustic indices and VGGish, there was a notable discrepancy compared to the original features and UMAP.This suggests that UMAP has superior capacity to compress relevant information in certain contexts.However, it is worth noting that results may vary depending on the specific task and landscape attribute being assessed.Conversely, the feature spaces computed with the baseline methodologies exhibited comparatively inferior performance and demonstrated limitations despite their frequent use.In the case of VGGish, although it computes spectrograms approximately every minute, compared to our solution which computes spectrograms every twelve seconds, there is a significant difference in all the classification metrics, indicating that spectrograms extracted from longer time series than those reported in the state of the art can reveal discriminant patterns.

Unsupervised Learning Results
To explore the potential of unsupervised methods in understanding landscape heterogeneity patterns without relying on species-specific analysis, we investigated temporal and sound type relationships within the clusters obtained with K-means, following the methodology outlined in [52].We systematically generated clusters with K ranging from three to thirty.For each clustering iteration, we computed both individual and average Silhouette scores (SLT), as well as the Davies-Bouldin and Calinski-Harabasz indices, serving as metrics of cluster quality.
Table 3 shows the results for models obtained with varying values of K and four input feature spaces (the autoencoder features and the three-dimensional reduced spaces obtained with PCA, t-SNE and UMAP).One observes that the best scores in all metrics were obtained with three clusters in most cases, and the scores tended to decrease as the number of clusters increased.However, there are a few exceptions, e.g., the UMAP features yielded better DB index scores with 35 clusters, whereas the t-SNE features yielded the best DB index with 7 clusters and a better CH index with 30 clusters.Notably, the UMAP features yielded better cluster quality metrics, indicating they consistently excelled in creating welldefined, compact, and segregated clusters, yielding an SLT of 0.41, DB index of 0.85, and CH index of 86,427.27,as detailed in Table 3.Moreover, when computing the metrics' means and standard deviations, UMAP showed the best mean values for SLT and CH and the best deviation value for DB; in contrast, t-SNE had the best values of DB mean and Silhouette and CH deviations.Nevertheless, both t-SNE and PCA contributed to score improvements, with PCA showing a slight enhancement, and t-SNE achieved the highest DB index using seven clusters as mentioned before, indicating that although UMAP and t-SNE are similar theoretically, spaces projected by these methods are well differentiable.On the other hand, contrasting the number of clusters among projection methods, we identified that when clusters increased, several patterns in alternating bands of frequency were accentuated, indicating that clusters could be considered similar based on the quantitative metrics, but there were variations in the biological content associated with animal vocalizations, i.e., for unsupervised studies, the metrics should be aligned to the biological content of recordings due to clusters with lower scores being able to capture relevant information and patterns which are sensitive to being masked by other frequency ranges.We have confirmed our hypothesis regarding the effectiveness of deep features in revealing macro patterns within the soundscape.We computed centroid spectrograms by passing the centroid feature vector in each cluster to the decoder part of the autoencoder.In this manner, we reconstructed spectrograms associated with inputs generated by the K-means clustering algorithm.This process allowed us to visualize the representative spectrograms corresponding to each cluster, providing insights into the characteristic acoustic patterns captured by the clusters (refer to Figure 7).This part of our methodology is a novel way to represent clusters' information for soundscape applications due to the fact that, in general, cluster prototypes are ignored in clustering procedures, despite containing valuable information about the composition of the clusters.
Specifically, the distinct centroid spectrograms (refer to Figure 7) effectively delineate noticeable changes in frequency bands, primarily reflecting distinct occupied bands.Our analysis revealed that cluster one mostly represented a quiet soundscape.Upon further investigating the cluster contents, we found mainly light rainfall, some insect sounds, and natural sounds like river sounds for cluster one.Given the frequent occurrence of cicadas and rainfall in most recordings and the study area, this result was expected.Cluster two also featured recordings of insects, albeit with additional biophonies such as anuran vocalizations represented in lower frequency ranges around 2 KHz, as it can be seen in Figure 7. Cluster three captured a medium-high spectral band, predominantly around 4 kHz, showcasing anuran and bird vocalizations; also, we can specify that these patterns correspond to animal vocalizations because there are accentuated intensities for thin frequency bands (2 Khz to 3 Khz, 3.5 Khz to 4 Khz, and 5 Khz to 6 Khz).These frequency ranges are commonly utilized by animals as part of their acoustic niche and are relevant for studying bioacoustic richness [53].Finally, we computed histograms for the three clusters discussed above to examine temporal trends within the groups, as depicted in Figure 8.This analysis unveiled distinct temporal distributions among the clusters.For instance, we discerned two primary time intervals in the three clusters.The most pronounced temporal interval spanned clusters three, encompassing a nocturnal time-frame from 6 p.m. to 5 a.m.Given Colombia's proximity to the equator, the sunrise and sunset hours remain nearly the same all yearround, typically around 6 a.m. and 6 p.m., respectively.A diurnal time-frame was prevalent in the remaining clusters.However, it is noteworthy that cluster two exhibited broader temporal ranges.Most recordings in cluster two primarily spanned 10 a.m. to 6 p.m., while in cluster one, recordings between 4 a.m. and 11 a.m.prevailed.The latter indicates the presence of acoustic landscapes characterized by silence, indicating a minimal detection of biophonies, geophonies, and anthropophonies.

Discussion and Future Work
We introduce an unsupervised learning framework based on an autoencoder architecture to extract representative features from soundscapes of a Colombian region, in association with the K-means clustering algorithm to reveal patterns present in acoustic recordings.Following ten epochs, we assessed the accuracy of the representations using Mean Squared Error (MSE) and the learning rate curve of the Deep Neural Network, and we conducted a visual inspection by comparing spectrogram inputs and outputs.During evaluation, we identified challenges in assessing the reconstructions despite achieving low MSE values.The absence of a defined baseline or cut-off point made it difficult to establish a clear threshold to assess reconstruction accuracy.We addressed this issue by conducting supervised learning experiment using landscape labels to discriminate between forest and non-forest areas.This allowed us to compare the performance of autoencoder features with baseline methodologies such as acoustic indices and VGGish.The classification results revealed that autoencoder features outperformed several metrics, especially after reducing the feature space with dimensional reduction methods (PCA, t-SNE adn UMAP).Despite UMAP not yielding the highest classification scores, it significantly reduced the embedded space while maintaining performance close to that obtained using the full feature space.
For the unsupervised learning experimentation, because UMAP allows us to project back features to the original space, we could create centroid spectrograms, which are spectrograms made from the reduced information of the reduced space, giving a better interpretability to the clusters and guiding along with conventional metrics such as Silhouette for the selection of a correct number of clusters.Additionally, employing metadata of the dataset was possible to establish temporal dynamics which are associated with acoustic biodiversity represented by the occupied range of frequencies.
In the unsupervised learning experimentation, as the autoencoder enables the projection of features back to the original space, we could generate cluster centroid spectrograms from the reduced information of the feature space.These spectrograms enhanced cluster interpretability, and in conjunction with conventional metrics, they can aid in finding an appropriate number of clusters.Additionally, the metadata of the dataset enabled us to establish temporal dynamics associated with acoustic biodiversity in the clusters, represented by the occupied range of frequencies.
We are confident that our proposal will expand the research horizons of non-speciesspecific soundscape studies, further exploring the multifaceted dimensions of biocomplexity.Nevertheless, ongoing work persists, especially in unsupervised studies within soundscape and ecoacoustics.Establishing a precise distribution of the data remains challenging, given the presence of not only explicit and observable patterns but also intricate dynamics within ecosystems.Deep learning algorithms hold promise in capturing these complex dynamics; however, challenges persist in interpreting and understanding ecological behaviors.Despite these obstacles, our efforts contribute to advancing the field by addressing these complexities and striving for deeper insights into the dynamics of soundscapes and ecoacoustic environments.

Conclusions
Despite the significant advances and remarkable progress in leveraging machine learning for ecoacoustic analysis, several challenges persist regarding conservation and monitoring strategies.A primary obstacle is the limited availability of labeled data essential for training machine learning algorithms, hindering the accurate recognition of vocalizations at the species level for animals such as birds and insects.Moreover, the dynamic and intricate nature of the acoustic landscape demands algorithms capable of handling vast amounts of data in real time and scaling to manage the environment's complexity effectively.Thus, ecoacoustics, on a large scale, is imperative to glean environmental attributes and changes.Unsupervised learning emerges as a practical solution for analyzing and exploring soundscapes capable of addressing some of these challenges.Unsupervised learning offers several advantages, including independence from labels, versatility in examining multiple sound sources present in recordings, and the automatic exploration of relationships among acoustic patterns.However, it is crucial to note that unsupervised learning relies on informative features to perform clustering effectively.Efforts to develop robust feature representations remain integral to advancing unsupervised learning methods in ecoacoustic analysis.
Our findings pave the way for further explorations in soundscape ecology, advocating for the integration of unsupervised learning approaches to gain a deeper understanding of landscape heterogeneity.The success of autoencoder-based features in classification tasks, alongside the identification of meaningful patterns in unsupervised clustering, highlights their potential as valuable tools in soundscape research.Moreover, our research underscores the importance of selecting appropriate dimensionality reduction techniques for unsupervised learning in soundscape studies.UMAP emerges as a robust method, demonstrating its ability to reveal meaningful patterns and heterogeneity within acoustic landscapes.Moving forward, future research endeavors should prioritize the refinement of spatial analysis techniques, the exploration of alternative unsupervised learning methodologies, and a comprehensive comparison of ecological and biological information across various dimensionality reduction methods tailored to the specified number of clusters.These initiatives will further enhance our understanding of acoustic environments and their ecological significance.

Figure 1 .
Figure 1.Overview of the principal stages in the methodological framework based on pre-processing, supervised learning, unsupervised learning, clustering analysis and results, and classification results.
Geographical distribution of recorders (31 Song Meter SM4 devices) in the Jaguas protected area in Antioquia, Colombia.

Figure 3 .
Figure 3. Proposed Autoencoder Neural Network.The embedded space is reduced to 5184 by applying a flattening to the residual image after the convolutional layers.

Figure 4 .
Figure 4. Autoencoder Learning Rate Curve: The similarity between reconstructions (upper figures) and original inputs (lower figures) was evaluated visually.Additionally, the difference between inputs and reconstructions was quantified using the Mean Squared Error (MSE) across ten epochs.

Figure 5 .
Figure 5.The projections of the feature space obtained with PCA, t-SNE, and UMAP are displayed in the rows.The points in the plots represent the audio segments, each distinct color shade maps a specific hour or location label.This setup favors a comparative analysis of data distributions across different dimensions and labeling schemes.

Figure 6 .
Figure 6.Results of classification with Random Forest using autoencoder features (before and after dimensionality reduction) as compared with a vector of acoustic indices and VGGish features.

Figure 7 .
Figure 7. Centroid spectrograms computed by passing the feature vector of each cluster centroid to the decoder part of the autoencoder.

Figure 8 .
Figure 8. Histograms of audio membership in each cluster using K-Means and UMAP projections at different times of day.

Table 1 .
Review of related work: Our selection criteria focused on articles that utilize autoencoders for representing soundscape data, perform clustering analysis on spectrogram data, and introduce novel methods for enhancing clustering performance and integrating features into clustering for embedded representations.

Table 3 .
Quality metrics (Silhouette coefficient (SLT), Davies-Bouldin index (DB), and Calinski-Harabasz index (CH)) of cluster models obtained with K-means over the autoencoder features and UMAP, t-SNE, and PCA projections, for varying values of K.The best values for each configuration are shown in bold, and the best global values are shown in blue.SLT values are in the range of [−1, 1], where higher values indicate better cohesion and separation.Lower values of DB are better, indicating dense and well-segregated clusters, whereas higher values of CH index are better.
MINCIENCIAS (Colombia).This research and APC were funded by Colombian National Fund for Science, Technology and Innovation, Francisco Jose de Caldas-MINCIENCIAS grant number 111585269779.Not applicable.