Amidst the growing recognition of the immense value of forest ecosystems in combating climate change and supporting biodiversity, the demand for rapid, precise, and robust methods to monitor these vital ecosystems is on the rise [
1]. Forests, which shelter the majority of terrestrial biodiversity, span approximately 4.06 billion hectares, covering 31% of the world’s land surface. They function as crucial carbon reservoirs and play an indispensable role in climate regulation [
2]. To effectively track the progress toward these goals, as well as to monitor deforestation, degradation, and forest responses to climate change, there is an increasing need for large-scale, cost-effective monitoring, ideally with automated data collection and processing up to the final information product. In the last decade, there has been a notable increase in the accessibility and application of remote sensing (RS) technologies, providing data with resolutions sufficiently fine to discern individual trees, as demonstrated in various references employing high-resolution data [
3,
4,
5,
6,
7,
8,
9,
10,
11]. While this presents opportunities for enhancing our understanding of forests, it also poses challenges in data interpretation [
12,
13].
Ongoing research explores capitalizing on symmetries and commonalities within network layers to enhance decision making and model robustness. However, methods (2) and (3) entail manual, non-standardized alterations to either the training data or the network architecture, making them suitable for specific applications but unsuitable for benchmarking ML and DL algorithms.
As both (2) and (3) require manual and non-standardized modifications of the training data (2) and/or the network (3), they are well suited for specific applications but not for ranking ML and DL algorithms. This underscores the need for standardized and extensive training datasets, particularly when benchmarking the performance of various ML and DL models in forest management.
1.1. Related Benchmark Datasets
Back in 2018, the International Society for Photogrammetry and Remote Sensing (ISPRS) published true orthophotos and surface models together with semantic labels describing the apparent land cover class like impervious surface, building, low vegetation, tree, car [
21]. These datasets have already been widely used for the testing and ranking of machine learning algorithms.
Recently, a new multimodal benchmark dataset for RS (MDAS) was added to the ML4Earth platform that comprises SAR, multi-spectral, and hyper-spectral imagery, as well as a surface model [
22]. The annotation provides pavement, low vegetation, soil, tree, roof, and water as labels. This dataset is dedicated to the use of satellite data by Sentinel-1 and -2 and EnMAP [
23].
The focus of all benchmark datasets so far—to the best of our knowledge—lies on the semantic interpretation, i.e., distinct classes in a nominal scale are assigned.
Contrastingly, high-resolution remote sensing imagery (HR-RSI) benchmark datasets released in recent years, including WHU-RS19 [
24], UC Merced [
25,
26], PatternNet [
13], and RESISC-45 [
27], focus on object classes and have proven successful for classification tasks. The diversity in structure and design of these datasets presents a unique opportunity to explore the utility of combining them into a larger meta-dataset (MDS) [
28], addressing challenges such as heterogeneous image sizes within class and varied spatial resolution within class.
Adding to this landscape, several new benchmark datasets complement these efforts: ReforesTree [
1] focuses on forest carbon stock estimation within carbon offsetting certification standards, outperforming satellite-based estimates. Addressing tree species classification in central Europe, TreeSatAI [
29] leverages multi-sensor data. Designed for training deep neural networks, Barknet 1.0 [
30] comprises over 23,000 high-resolution bark images from 23 different tree species in eastern seaboard forests of Canada. NeonTreeEvaluation Benchmark Data [
31] assess crown detection in the United States using RGB, LiDAR, and hyper-spectral data. LuoJiaSET [
32] is a large-scale training sample database system for intelligent interpretation of remote sensing imagery.
Moreover, the OpenEarthMap dataset [
33] introduces a benchmark for global high-resolution land cover mapping, comprising 2.2 million segments of 5000 aerial and satellite images. With manually annotated eight-class land cover labels at a 0.25–0.5 m ground sampling distance, OpenEarthMap allows semantic segmentation models trained on it to generalize worldwide, providing a valuable resource for advancing remote sensing methodologies.
The next step from a mathematical point of view is the use of parameters in cardinal scale. This leads directly to a regression approach because not distinct classes but continuous values have to be predicted from satellite data. Such a dataset—and especially for forest characterization—has not been presented so far. This is exactly the point where our project idea of Wald5Dplus comes into play: the creation of a labeled benchmark for forest characterization, not classification. Continuous parameters like the crown area per tree type, the tree height, and the crown volume amongst others are attached to each pixel in the ARD cube. The annotation of parameters in cardinal scale places new demands on the labeling that are briefly explained in the following.
1.2. Requirements on Training Data
In the domain of RS and the application of AI techniques, a series of imperative requirements concerning training data come to the fore. These requisites hold considerable significance, especially within the context of RS, encompassing both general and forest-specific aspects.
High-Quality and Well-Labeled Data: Foremost, a fundamental requirement in the field of RS pertains to the availability of high-quality datasets that are meticulously labeled. This precision ensures that AI algorithms can be effectively trained and validated on information that is both accurate and dependable. It is notable that machine learning, a cornerstone of AI, relies heavily on data quality. This aligns with findings from research emphasizing that the successful utilization of machine learning techniques in RS applications necessitates high-quality data, particularly well-labeled datasets [
34]. Such well-labeled data serve as the bedrock upon which AI models can be constructed and validated.
Accessibility of Publicly Available Datasets: A pivotal requirement arises from the accessibility of publicly available datasets, accompanied by validation data. These datasets serve as indispensable benchmarks for the development and validation of algorithms, allowing researchers to evaluate their methods against established standards. This aligns with the notion that publicly available datasets with validation data are crucial for researchers to verify their developed algorithms and compare them with state-of-the-art methods. In various standard RS applications, frequently employed datasets serve as reference points for algorithm testing. The abundance of such datasets underscores the importance of making high-quality training samples available, as highlighted in the literature [
14,
35,
36].
Diversity and Representativeness: Training data should encompass a wide range of scenarios to enable AI models to generalize effectively. This diversity is crucial in the field of RS, where real-world conditions can vary significantly. Machine learning techniques, which are foundational in AI, rely on diverse training data to ensure that models can adapt to different conditions and achieve robust generalization. In this context, the need for variations in land cover, seasonal changes, and different environmental conditions is essential to ensure that AI models can effectively handle the intricacies of RS applications [
37].
Spatial and Temporal Coverage: The training dataset should provide comprehensive spatial and temporal coverage [
37,
38], ensuring that AI models can effectively adapt to diverse regions and monitor temporal dynamics with precision. High-quality training samples, representative of a wide range of geographic locations and temporal changes, are fundamental in addressing the challenges of RS data, especially in the context of forests. The utilization of data with broad spatial and temporal coverage aligns with the need to capture fine-grained spatial and temporal changes in RS applications.
Data Resolution: Training data should align with the spatial and temporal resolution of the RS data used for analysis. This matching resolution is essential for enabling AI models to capture and respond to temporal changes with accuracy, as highlighted in the literature. Aligning training data resolution with RS data resolution is a crucial aspect of ensuring the effective application of AI in RS [
14].
Quantity and Sample Size: Adequate training samples are pivotal for optimizing AI models [
34,
39]. The quantity of training data should align with the complexity of the analysis task and the specific requirements of the AI model under consideration. The importance of having an ample sample size to mitigate the risk of model underfitting is well established in the field of machine learning, including in the realm of RS.
Consistency and Continuity: Consistency in labeling and data quality [
38] throughout the training dataset is imperative for ensuring the reliability of AI models in RS tasks, especially those involving time-series data. Additionally, maintaining continuity in data collection is essential for effectively monitoring changes and trends. Such consistency and continuity are essential components of robust AI model development, as recognized in the existing body of literature.
Annotated Metadata: Annotated metadata [
14,
32], providing comprehensive information regarding the data’s source, acquisition date, geographical location, and any preprocessing steps applied, enhance the interpretability and utility of training data. These metadata are vital in providing context to the training data, enabling researchers to better understand the information used for AI model development. Researchers in the field have acknowledged the critical role that annotated metadata play in the effective utilization of training data.
Data Balance: Maintaining a balanced representation of classes or categories within training data is vital, especially in classification tasks. Unbalanced datasets can present a substantial obstacle in the process of model optimization, especially when specific classes are infrequent or not well represented [
38,
40]. Ensuring equitable representation of classes is a recognized strategy to prevent biases and skewed results. Achieving data balance is crucial for accurate classification of RS data, a concept well supported by prior research [
34].
In summary, these requirements collectively underscore the pivotal role of training data in the accuracy and effectiveness of AI models in RS applications, particularly in the forest context. These requirements align with the findings from [
34,
35,
36,
37,
39] and are fundamental in ensuring that AI techniques are effectively applied to RS data, ultimately advancing the field and promoting robust, reliable, and insightful analyses of RS data.
1.3. Concept of the Wald5Dplus Benchmark Data Cube
In addressing the challenges posed by RS data and the development of a benchmark dataset that integrates RS data and reference data, it is imperative to devise effective strategies. These strategies are vital for ensuring that the benchmark dataset meets the rigorous requirements demanded in the field of ML and DL while also capitalizing on the power of AI techniques. To achieve this, a multifaceted approach has been adopted, which will be described in this section (
Figure 1).
The Key Data Source Sentinel: One of the cornerstones of this approach is the utilization of data acquired by the Sentinel satellite missions. Sentinel-1 and Sentinel-2, part of the European Space Agency’s (ESA) Copernicus program, offer substantial advantages. Sentinel-1, through its radar technology, provides insights into the forest canopy, offering a unique view into the dense vegetation. Sentinel-2, on the other hand, offers a view of the forest foliage through multi-spectral imagery. The temporal frequency of data acquisition by these missions, with their weekly and bi-weekly revisit times, provides an ideal basis for monitoring temporal dynamics and capturing variations in forest attributes.
Fusion of Sentinel Data: While both Sentinel-1 and Sentinel-2 (©ESA) sensors offer unique advantages by themselves, the fusion of Sentinel-1 and Sentinel-2 data provides an outstanding opportunity to derive rich information about forest attributes, including tree species and canopy height.
Variability: Variability refers to the diversity and differences present in the input data. In RS, this can manifest as variations in the data captured due to differences in environmental conditions, sensor characteristics, or the objects being observed (e.g., different tree species in a forest). ML algorithms, including AI techniques, thrive when they are exposed to a diverse range of input data. The reason for this is that these algorithms can learn and adapt better when they encounter a wide array of situations and patterns.
For effective training and deployment of AI models, the benchmark dataset aims to address the challenges posed by this variability in RS data. While traditional methodologies often attempted to reduce variability by using techniques like channel combinations (e.g., combining different spectral bands to calculate indices like the NDVI), AI methods are more adept at handling high variability in data. Unlike traditional methods that try to simplify the data by reducing variability, AI techniques have the capability to work with data that exhibit a wide range of attributes, including outliers or extreme data points.
The benchmark dataset leverages the inherent tolerance of AI techniques to diverse data attributes, including those that might deviate significantly from the norm. Rather than eliminating variability, it adapts to it. Additionally, preprocessing techniques are implemented to prepare the data in a format suitable for AI algorithms. This preprocessing may include techniques like explicit normalization, which ensures that data are scaled or adjusted to be in a standardized format [
41]. Normalization is particularly important for some machine learning techniques, such as support vector machines, which rely on the data being in a specific range or format to work effectively.
Challenges with Training Data Volume: Generating large training datasets, especially in forest-related applications, presents a unique set of challenges. Unlike land cover identification or agricultural land identification, where readily available datasets such as LUCAS points [
42] or INVEKOS data can be employed, forest classification faces distinct challenges. The spatial and temporal heterogeneity in forests, including variations in tree species and canopy height, necessitate extensive and specific training datasets. To address this, the benchmark dataset leverages the copious data provided by Sentinel-1 and Sentinel-2 (©ESA, 2020 and 2021), enhancing the ability to generate large-scale training data. This is particularly important in cases where existing data, such as the Bundeswaldinventur data [
43], do not align with the required spatial and temporal coverage for forest classification.
Meeting Training Data Requirements: The benchmark dataset is meticulously crafted to address the multifaceted requirements of training data in the realm of RS and AI applications, capitalizing on the abundance of data from distinct geographical regions, each of vital significance in enhancing the robustness of AI models.
Data Quality and Meticulous Labeling: The cornerstone of this dataset is an unwavering commitment to data quality and meticulous labeling. This ensures that every data point is characterized by a high degree of precision, free from errors, and labeled with painstaking accuracy. The quality of labeling is central to the successful development of AI models. Furthermore, the dataset offers comprehensive and consistent data quality throughout, maintaining the highest standards for accurate and reliable information.
Diversity and Representativeness: To enable AI models to generalize effectively and tackle the intricacies of real-world RS applications, the benchmark dataset encompasses a wide range of forest scenarios. These scenarios span across different geographical regions, including the Bavarian Forest National Park (2016), the Steigerwald Forest (2017), and the Kranzberg Forest (2020), all situated in southeastern Germany (
Table 1). While these regions share a common geographical location, they exhibit distinct characteristics due to variations in environmental conditions, tree species, and forest structure. This diversity ensures that AI models can adapt to the heterogeneous nature of RS data and make accurate predictions across a spectrum of scenarios.
Spatial and Temporal Coverage: Comprehensive spatial and temporal coverage is a pivotal aspect of the benchmark dataset. It spans various geographic locations, each with its unique ecological and environmental attributes. Moreover, the dataset captures changes over time, providing temporal dynamics with precision. This broad coverage equips AI models with the ability to adapt to diverse regions and monitor temporal changes, enhancing their capacity to analyze RS data effectively.
Data Resolution: The benchmark dataset is meticulously aligned with the spatial and temporal resolution of RS data used for analysis. This strategic alignment ensures that AI models can effectively process and interpret the level of detail present in RS data. By matching the resolution, the dataset empowers AI models to capture and respond to temporal changes with a high degree of accuracy, contributing to the robustness of their performance.
Quantity and Sample Size: Adequate training samples are pivotal in optimizing AI models for RS applications. The benchmark dataset takes this requirement seriously, ensuring that the quantity of training data is commensurate with the complexity of the analysis task and the specific needs of the AI models under consideration. The provision of large yet manageable training datasets minimizes the risk of model underfitting, fostering the development of accurate and reliable AI models.
This paper emphasizes (1) the development of a unique benchmark dataset, explicitly crafted to meet the exacting requirements in the field of ML and DL, particularly within RS and forest-related applications. By synergistically integrating Sentinel-1 and Sentinel-2 (©ESA, 2020 and 2021) data with AI techniques, it effectively bridges the divide between diverse RS data and AI model application, significantly enhancing the ability to address data variability. The results (2) highlight its success in facilitating the monitoring of forests and a wide array of tree parameters. Moreover, the challenges associated with generating large training datasets are met with a strategic focus on harnessing the capabilities of these satellite missions. By creating large yet manageable training datasets with minimized variability, the benchmark dataset intends to empower AI models to unlock the wealth of information present in RS data, ultimately advancing the field of RS and AI applications in forest contexts.