1. Introduction
Point cloud data are frequently encountered in data collection within geospatial applications, particularly with regards to data collected for the retrievals of light detection and ranging (LiDAR) returns [
1]. As spatial coverage for LiDAR data is expanding to include more localities, in addition to increasing point density and return counts for LiDAR retrievals, the amount of LiDAR data amassed can rapidly accumulate and manifest as a big data issue. The subsequent processing of these data can become challenging and even overwhelming, especially if access to proprietary software and systems commonly utilized for transforming geospatial point cloud data is inadequate. For tree species classification and digital soil mapping applications, there is a need for wall-to-wall coverage of environmental covariate data throughout study areas that can encompass hundreds, if not thousands, of square kilometers. As a consequence, there is an interest in a streamlined process to generate a variety of environmental covariates from LiDAR data.
For terrain analysis, topographic covariates are typically constructed from a digital terrain model (DTM) [
2]. Generally, continuous surfaces corresponding to a triangulated irregular network (TIN) [
3,
4] of meshes are derived from LiDAR point cloud data, from which covariates such as a DTM or digital surface model (DSM) are produced. Oftentimes, if these features are being analyzed in conjunction with data attained from different sensors or imagery of pixelized covariates, they are transformed to output as rasterized environmental covariate layers. These rasterized layers each comprise one resultant value per pixel, and their computations are optimized when appropriate resources are employed. However, when modifying a calculation or processing point cloud data for larger study areas, these methods can be difficult to implement without sufficient computational resources.
Structured query language (SQL) can be exploited to process features from point cloud data. Within the confines of SQL [
5], each point cloud data point can constitute as a record, that is, as a row, in a data table. Associated keys for identification can be linked to each record for storage in structured form within a database, from which SQL can expedite merging with other data sources. Queries can readily be customized to control what data are selected, so as to better screen data for the consequent data analysis. Specific clauses within SQL can aggregate data to the scale within set coordinate grids such that each record constitutes a pixel for a rasterized layer. This can effectively summarize the data to resolve certain features, which can be wielded as environmental covariates and predictors in subsequential research analysis.
This approach of harnessing SQL to rasterize covariates has been demonstrated on LiDAR point cloud data collected for a boreal study region in Northern Ontario, Canada. An overview of the LiDAR point cloud data structure is depicted. SQL commands are deployed to generate basic features of DTM, DSM, and canopy height model (CHM) from LiDAR point cloud data, as well as a gap fraction, as calculated within the following steps. The rasterized output from these SQL queries are displayed in the results section. Also conveyed is how important these LiDAR-derived covariates were for a subsequent tree species classification. Further enhancements for deriving covariates with SQL, as well as other considerations, are also briefly discussed.
2. Materials and Methods
2.1. Data
LiDAR point cloud data were obtained from Land Information Ontario (LIO) [
6], which were retrieved by aerial survey in 2016 and 2017. This region corresponded to an expanse of the District of Cochrane centered along the Ontario Highway 11 corridor, in Northern Ontario, Canada. The point cloud data [
6] were conserved in LAS (LASer) format, with each line of data conforming to a retrieval. An easting coordinate (X) [meter], northing coordinate (Y) [meter], and vertical elevation (Z) [meter] were each recorded, with return counts per retrieval for the LiDAR also indicated. These coordinates for the eastings and northings conformed to the NAD (North American Datum) 1983 UTM (Universal Transverse Mercator) Zone 17 N (North) projection. Other fields pertaining to LiDAR intensity, scan angle, and time [second] with regards to the sensor were also noted. A depiction of the LiDAR data structure as retained within LAS format, is presented in
Table 1.
These LiDAR LAS tile files each corresponded to 1 km
2, and were recorded with approximate point density of 8 retrievals per m
2 [
6]. Each LAS file conformed to about 0.5 GB in size. In total, 6085 LAS tiles [
6] of analogous size were amassed for the study region, from which a few separate study areas were delineated.
2.2. Methodology
When obtained from LIO, the LAS files were compressed in the LAZ format. To recover the LAS files, the LAZ files were subsequently converted in mass via LASzip version 3.2.9 software [
7]. The laspy library in Python was employed for reading in the contents of the LAS files. The variant of SQL adopted was SQLite [
8]; the SQL commands were executed via a cursor of the sqlite3 module in Python.
SQL can produce a data table with summary variables, achieved via the group by clause [
5]. Functions can be utilized to set the aggregation for the group by, so as to obtain one row value for what can represent one pixel of a rasterized layer. The round function can be exploited to generate the pixel coordinates, here per 30 m by 30 m cell size. These resulting eastings and northings can constitute the keys for each record, that is, pixels for the rasterized data. Furthermore, these can comprise the same keys used for merging with other rasterized layers that are saved as SQL data tables. An example of an SQL command that can be executed for computing the DTM, DSM, and CHM (defined here as DSM subtracting DTM) is depicted in
Figure 1.
Note from
Figure 1 how the round function has been adapted to create the lower grid coordinates for the pixels for the desired spatial resolution of 30 m. To derive features by quick computation, extreme values have been incorporated for the metrics; the DSM relates to the maximum and the DTM to the minimum elevations of the LiDAR point cloud data within each specified pixel, respectively. SQL queries have been carried out to generate a gap fraction covariate, specified here as the fraction of retrievals per pixel with only one return. The SQL command for forming a summary table with these calculated features of DSM, DTM, CHM, and gap fraction, each rounded to two decimal places here for formatting purposes, is also detailed in
Figure 2.
The SQL commands presented in
Figure 1 and
Figure 2 were executed per LAS file. In practice, the study areas corresponded to domains greater than 1 km
2; some study areas encompassed more than 1000 km
2. In that case, numerous LAS files needed to be processed sequentially. Batch scripts were compiled to read in one LAS file at a time, execute these queries, and then append the summary table output to a table containing that data for the previous files processed for the study area. The output was summarized per easting and northing coordinate, with one unique value for each combination of easting and northing coordinates. Batch processing was performed within Python via the os and glob modules for reading files with a specified pathname pattern as with the LAS files [
6].
For the subsequent rasterization, the appended summary data needed to be restacked, with consecutive northing coordinates ranging across the rows and consecutive easting coordinates across the columns. It was crucial to have the data restacked in the proper order of the grid coordinates, with no gaps; a missing value specification was set for those grid coordinates with no LiDAR summary information. Pandas [
9] was exploited to pivot data into two-dimensional arrays corresponding to rasterized layers. This was performed for the DTM, DSM, CHM, and gap fraction from the resulting appended summary table. Afterwards, these pivot tables were each exported as a text file. Relevant header information pertaining to the number of rows and columns, the lower left coordinates for each of the easting and northing for the study area, the cell size for the spatial resolution, and the no data specification for missing values, respectively, were added to the beginning of each of these text files. These files were then each loaded in ArcGIS Pro version 2.2.0 software via the conversion tool for ASCII (American Standard Code for Information Interchange) to raster, from which the relevant coordinate projection was then specified for each resulting raster layer. ArcGIS software was utilized to display the rasterized covariate figures.
3. Results
3.1. Feature Generation
The LiDAR data were processed on a desktop computer with an Intel® Core™ i7-7700 central processing unit (CPU) at 3.60 GHz and 8.00 GB of random access memory (RAM). A 10 TB capacity external hard drive stored the LiDAR data. Per day, about 350 km2 of LiDAR data, with each corresponding LAS file encompassing 1 km2, was processed. This amounted to around 3 to 5 min to process a 0.5 GB size LAS file covering 1 km2. To ensure faster computing, approximately 200 GB of LAS files per day were transferred over to an internal hard drive on the desktop computer. A database contained the resulting output tables from the SQL, from which the respective features were generated.
A DTM computed following the SQL from
Figure 1 is depicted in
Figure 3. For comparison purposes,
Figure 3 also shows a DTM of 30 m spatial resolution that was obtained from the Ontario Ministry of Natural Resources (MNR), compiled in 2013.
The provincial DTM was generated from a combination of a DSM derived from RADAR (radio detection and ranging), an Ontario base mapping dataset, and DTM points and contours [
10]. The LiDAR-derived DTM in
Figure 3 offered more precision, to the centimeter level, in elevation. The LiDAR retrievals attained via airborne survey provided measurement for remote forested localities that were difficult to survey by other conventional means.
The DSM and CHM, with CHM calculated as the difference between the DSM and DTM, are both presented in
Figure 4. Compared to the LiDAR-derived DTM in
Figure 3, the DSM in
Figure 4 is grainier due to the variation in vegetation within the overlying canopy. However, the land use is apparent when the CHM is computed, as shown in
Figure 4; agricultural fields, infrastructure lines, water bodies, and forested tracts are clearly distinguished. Also discerned are localities of primary forest, with higher CHM values, and wetlands with typical lower CHM values; for this study area, this was due to the prevalence of stunted black spruce (
Picea mariana) within the wetland environments.
The gap fraction, as derived in
Figure 2, is shown in
Figure 5. Note here that the gap fraction is correlated with the CHM and discerns locations with no canopy vegetation cover. This covariate is relevant for land use classification, and assists with differentiating by vegetation density, so as to identify localities of more dense forest that do not necessarily have taller tree heights within the canopy.
3.2. Research Application
The aforementioned process was wielded to compute fine spatial resolution (2 m by 2 m) features for a tree species classification [
11] within the Abitibi River Forest (ARF) in the District of Cochrane, Ontario, Canada. Here, the cell size, as specified in
Figure 1, was modified from 30 m to 2 m, adapting to a finer scale so as to facilitate a tree species classification representative of various settings within a boreal forest context. For this study, surface reflectance was obtained from WorldView-2 satellite imagery of the same spatial resolution. LiDAR data were processed to generate a DTM and CHM, from which topographic features were derived via SAGA (System for Automated Geoscientific Analyses) GIS (geographic information system) version 7.6.3 software [
12] from the DTM. A listing of the top 10 most important features for this study, as ascertained by mean decrease in Gini from a random forest [
11], is presented in
Table 2.
As indicated in
Table 2, LiDAR-derived features attained the highest respective variable importance. Most of these features conformed to topography, with the exception of those for surface reflectance and CHM. For tree species classification within boreal environs, topographic attributes were of pertinence, as topography impacts where certain tree species grow [
13]. In this study [
11], the random forest attained an overall site level accuracy of 0.79, with corresponding Cohen’s kappa of 0.69, when assessed on a validation set. Thus, the adoption of finer spatial resolution features enhanced the classification for trees species, which otherwise was less accurate when utilizing coarser spatial resolution products.
4. Discussion
Depending on the variant of SQL implemented, other aggregation functions beyond arithmetic mean, maximum, or minimum can be available, specifically conforming to the median or other metrics. Nonetheless, calculating a metric such as the median can also be accomplished by sorting and indexing operations, with additional querying by means of intermittent data tables for a multistep calculation. The complexity of the query command can be augmented via a where clause, to include subqueries of a nested structure to take into account other considerations for selecting data. Queries can also be executed to determine outliers within the point cloud data; these outliers can correspond to erroneous measurements, tower structures, or flocks of birds flying underneath the airborne LiDAR sensor.
The simplified procedure discussed in the methodology section for computing the DSM and DTM, as in
Figure 1, can be sufficient for generating features for the sakes of digital soil mapping research. However, if necessary, these calculations can be further refined, particularly with respect to deriving what can be considered a more representative DTM. For that, depending on the desired complexity, a combination of queries can be executed and stored within intermittent data tables, to take into account other conditions for criteria. Note that the effect of extreme values can be mitigated by initially selecting a smaller cell size for the evaluation of the maximum and minimum of LiDAR data point elevations, respectively. From this, these corresponding values can be aggregated by averaging over the intended larger cell size, to obtain a more characteristic measurement value for the pixel. Hence, SQL can be regarded as a powerful approach for deriving sophistical features from point cloud data.
CHM and gap fraction features, as calculated from
Figure 1 and
Figure 2, were important predictors for digital soil mapping research [
14]. In addition, the LiDAR-derived DTM was employed for the generation of a suite of topographic features, some of which were subsequently ranked with high variable importance and thus wielded as predictors for soil property modeling. As sections of the study region were forested and remote with respect to accessibility, the LiDAR-derived DTM likely attained better accuracy for some localities than did the provincial DTM compiled from RADAR and other products [
10].
SQL can facilitate data fusion with other feature variables, when merged within database confines. Specifically, SQL can be adopted for effectively extracting rasterized imagery data for just the locations of sampling sites. Due to the structured essence of data tables within SQL, analysis can be conducted with more control and precision, improving replicability and overall assurance with the data analysis.
Other covariate conceptualizations, particularly in deriving vegetation features relating to canopy structure from LiDAR retrievals of higher point density and return counts, will be an objective for future research. Other supplemental attribute information recorded with the retrievals, as well as different types of LiDAR sensors, will also be investigated for feasibility, especially for computing features for tree species classification.
Author Contributions
Conceptualization, R.P. and B.H.; methodology, R.P.; software, R.P.; validation, R.P.; formal analysis, R.P.; investigation, R.P.; resources, R.P.; data curation, R.P. and B.H.; writing—original draft preparation, R.P.; writing—review and editing, B.H.; visualization, R.P.; supervision, B.H.; project administration, B.H.; funding acquisition, B.H. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded in part by the Ontario Ministry of Agriculture, Food and Rural Affairs (OMAFRA), grant number ND2017-3179, and the Natural Sciences and Engineering Research Council of Canada (NSERC), for a Discovery Grant awarded to Dr. Baoxin Hu.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The raw data supporting the conclusions of this article will be made available by the authors on request.
Acknowledgments
LiDAR data were obtained from the Ontario Ministry of Natural Resources (MNR) via Land Information Ontario (LIO) and contain information licensed under the Open Government license—Ontario.
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
- Garcia, M.; Saatchi, S.; Ferraz, A.; Silva, C.A.; Ustin, S.; Koltunov, A.; Balzter, H. Impact of Data Model and Point Density on Aboveground Forest Biomass Estimation from Airborne LiDAR. Carbon Balance Manag. 2017, 12, 1–18. [Google Scholar] [CrossRef] [PubMed]
- Franklin, S.E. Interpretation and Use of Geomorphometry in Remote Sensing: A Guide and Review of Integrated Applications. Int. J. Remote Sens. 2020, 41, 7700–7733. [Google Scholar] [CrossRef]
- Khosravipour, A.; Skidmore, A.K.; Isenburg, M.; Wang, T.; Hussin, Y.A. Generating Pit-Free Canopy Height Models from Airborne Lidar. Photogramm. Eng. Remote Sens. 2014, 80, 863–872. [Google Scholar] [CrossRef]
- Southee, F.M.; Treitz, P.M.; Scott, N.A. Application of Lidar Terrain Surfaces for Soil Moisture Modeling. Photogramm. Eng. Remote Sens. 2012, 78, 1241–1251. [Google Scholar] [CrossRef]
- Sumathi, S.; Esakkirajan, S. Structured Query Language. In Fundamentals of Relational Database Management Systems; Springer: Berlin/Heidelberg, Germany, 2007; Volume 47, pp. 111–212. [Google Scholar]
- Airborne Imaging. Final Report For Project Cochrane LiDAR; A Clean Harbors Company: Calgary, AB, Canada, 2018. [Google Scholar]
- Isenburg, M. LASzip. Photogramm. Eng. Remote Sens. 2013, 79, 209–217. [Google Scholar] [CrossRef]
- Hipp, R.D. SQLite. Available online: https://www.sqlite.org/index.html (accessed on 20 June 2024).
- McKinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010. [Google Scholar]
- Spatial Data Infrastructure. Provincial Digital Elevation Model Technical Specifications v3.0; Queen’s Printer for Ontario: Peterborough, Canada, ON, 2013. [Google Scholar]
- Pittman, R.C.; Hu, B. Contribution of Topographic Features and Categorization Uncertainty for a Tree Species Classification in the Boreal Biome of Northern Ontario. GIScience Remote Sens. 2023, 60, 2214994. [Google Scholar] [CrossRef]
- Conrad, O.; Bechtel, B.; Bock, M.; Dietrich, H.; Fischer, E.; Gerlitz, L.; Wehburg, J.; Wichmann, V.; Böhner, J. System for Automated Geoscientific Analyses (SAGA) v. 2.1.4. Geosci. Model Dev. 2015, 8, 1991–2007. [Google Scholar] [CrossRef]
- Kulha, N.; Pasanen, L.; Holmström, L.; De Grandpré, L.; Kuuluvainen, T.; Aakala, T. At What Scales and Why Does Forest Structure Vary in Naturally Dynamic Boreal Forests? An Analysis of Forest Landscapes on Two Continents. Ecosystems 2019, 22, 709–724. [Google Scholar] [CrossRef]
- Pittman, R.; Hu, B.; Webster, K. Improvement of Soil Property Mapping in the Great Clay Belt of Northern Ontario Using Multi-Source Remotely Sensed Data. Geoderma 2021, 381, 114761. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).