1. Introduction
Sinkholes are common and naturally occurring geological hazards that are habitually responsible for the sudden and often devastating ground collapse and can potentially damage nearby infrastructure, cause loss of life, and threaten water and environmental resources [
1,
2,
3]. Sinkholes commonly form in karstifiable rocks, such as carbonates (e.g., dolomite and limestone) and evaporites (e.g., salt, gypsum, and anhydrite) [
1,
4]. In the United States (US) alone, the annual cost due to damages caused by sinkholes exceeds
$300 million [
5]. However, the cost of damages caused by sinkholes in the US is expected to be higher due to the need for national tracking [
5]. In the past fifty years in South Africa (SA), thousands of sinkholes, subsidence, and crack events have transpired in the Gauteng province (the country’s most populous province), whose surface area is approximately
dolomite [
6,
7]. According to Buttrick et al. [
7], in 2011, a community of approximately 30,000 households was relocated to more stable ground (i.e., ground experiencing less instability due to dolomite), which exceeded the cost of US
$600 million at that time. Buttrick et al. [
7] further reported that approximately 4–5 million people worked and resided on dolomitic land in 2011 in SA.
The occurrence of sinkholes is worldwide [
7], with 15.2% of the global ice-free continental surface containing karstifiable carbonate rock [
4]. In 2020, Goldscheider et al. [
4] reported that from a global perspective, 1.18 billion people (approximately 16.5% of the global population at that time) lived on karst terrain. Three common types of sinkholes form in karstifiable terrain: dissolution, cover–subsidence (C–S), and cover–collapse (C–C) [
8]. It is also worth mentioning that a sinkhole can be a combination of the three common types of sinkholes [
8].
Infrastructure that is highly susceptible to damage resulting from sinkhole formations is linear infrastructure (such as roads, bridges, and railways) [
9,
10,
11]. In the US, roadways are especially susceptible to damage caused by sinkhole formations due to the presence of carbonate minerals and carbonate rock along the roadways [
11]. According to Rizzo and Bryson [
11], an essential aspect of preventing significant traffic caused by damage resulting from sinkholes is to have a system that detects soil raveling or voids early on. Regarding railways, the emergence of sinkholes along or under railway tracks may lead to high maintenance costs, service interruptions, and, worst case scenario, the derailment of the train, leading to loss of human life [
10]. Guerrero et al. [
10] list several factors that affect the probability of train derailment due to the formation of sinkholes occurring along the railway tracks. One of the factors Guerrero et al. [
10] listed that affects the probability of train derailment is the vulnerability of the infrastructure (i.e., the railway tracks). The vulnerability of the infrastructure depends mainly on the incorporation and culmination of sinkhole protection measures in the design of the infrastructure, as well as the installation of sinkhole monitoring systems [
10]. The sinkhole monitoring system, in this case, should be capable of detecting subtle ground deformations and anticipating sinkhole collapse [
10].
As a result of the threat that sinkholes pose to infrastructure and lives, Sinkhole Detection and Monitoring Systems (SDMSs) must be implemented to ensure that engineering countermeasures can be planned and carried out to prevent catastrophic infrastructure damage and loss of life from occurring. The preceding statement is supported by the literature, where Möller et al. [
9] state that monitoring systems that can provide a continuous record of potential indicators of sinkhole formation are recommended, and Nam et al. [
12] state that early prediction and detection of sinkholes are pertinent measures for protecting citizens and infrastructure. In addition, Hoai et al. [
13] state that accurate tracking of sinkholes is essential to protect human life and prevent property damage, and Möller et al. [
9] further state that developing a sinkhole early warning detection system would protect infrastructure and save lives. Rizzo and Bryson [
11] also state that early sinkhole detection along roadways will enable the Department of Transportation in the US to track the sinkhole’s growth rate, which will aid in deciding when to disrupt traffic flow for repairs to be conducted and budget for the repairs. Planning and budgeting the repairs enables a proactive maintenance approach instead of a reactive one.
Only the formation of the cover–subsidence (C–S) sinkhole is discussed in this paper as it is the type of sinkhole that is the paper’s focus. Understanding the formation process of the C–S sinkhole was essential for developing the experimental methodology, which the authors previously published in [
1].
C–S sinkholes develop when carbonate bedrock (which contains openings or cavities) is covered in an overburden of permeable sediment and sand [
8]. The granular sediment in the overburden spalls into the carbonate bedrock via the openings in the bedrock [
8]. As the sediment continues to spall into the carbonate bedrock, a column of the sediment forms in cavities in the bedrock; this process is commonly referred to as
piping [
8]. As the process of infilling in the bedrock continues, a noticeable depression in the surface of the overburden forms [
8]. The surface depression will grow as long as the infilling process continues, resulting in a surface depression that can be several meters in depth and diameter [
8]. The resulting surface depression is a C–S sinkhole.
Yumba et al. [
2] published a conference paper in 2023 on the same experiment the authors did. Yumba et al. [
2] utilized fibre Bragg grating (FBG) strain data gathered from a C–S sinkhole (sinkhole in dry sand) experiment and from another experiment where a cover–collapse sinkhole (sinkhole in wet sand) formed to compare the strain distribution of the two types of sinkholes (i.e., comparing the strain distribution of a cover–collapse sinkhole and a C–S sinkhole). According to Yumba et al. [
2], the strain distribution along the vertical centerline above the C–S and cover–collapse sinkhole cavities provides insightful information into the deformation pattern of the soil. The strain values recorded by the FBG sensors, placed along the vertical centerline above the sinkhole cavity, indicated the extent and magnitude of the strain induced by the sinkhole event [
2]. In other words, the strain recorded by the FBG sensors can help quantify the impact of the C–S sinkhole and provide insights into the physical changes happening in the surrounding area. Yumba et al. [
2] found that strain variation for forming a C–S sinkhole had three phases: before, during, and after the sinkhole has formed.
Figure 1 shows the induced strain during the three phases of the C–S sinkhole formation process.
Figure 1 shows that during phase one (before sinkhole formation), the strain recorded by the FBG sensors was constant and near zero; during phase two (during the sinkhole formation), the strain changed significantly [
2]. During phase three (after C–S sinkhole formation has ceased), the strain values stabilized at different values.
The work presented in this paper contributes towards developing a sinkhole monitoring system that can geometrically characterize a sinkhole in terms of its volume using FBG sensor strain data. Of the three common types of sinkholes listed earlier, the current research study focuses on C–S sinkholes. The main aim of the current paper was to conduct an in-depth Exploratory Data Analysis (EDA) on all three phases of the collected FBG strain data and present a Machine-Learning (ML) model that can output the volume of the C–S sinkhole when given strain measurements from the FBG strain sensors. The purpose of conducting the more in-depth EDA was to develop an understanding of the connectivity, patterns, and relationships between the FBG strain data and the volume of the C–S sinkhole. From there, the developed understanding was used to guide our choice of a supervised ML algorithm to output the volume of the C–S sinkhole when given the FBG strain data.
The remainder of the paper is structured as follows:
Section 2 discusses the experimental methodology used to collect the FBG strain and C–S sinkhole volume data from a C–S sinkhole,
Section 3 discusses the theoretical background of the probabilistic and statistical analysis methods used to conduct an EDA on the collected FBG strain and C–S sinkhole volume data,
Section 4 discusses the data analysis procedure,
Section 5 discusses the data analysis results,
Section 6 discusses the results from the Weighted Least Squares (LWS) regression, Support Vector Regression (SVR), and eXtreme Gradient Boosting (XGBoost) models which were fitted to the phase 2 data which modelled the volume of the C–S sinkhole, and
Section 7 concludes the paper and discusses future work.
2. Materials and Methods of the C–S Sinkhole Formation
This section provides a comparative discussion with prior research studies, highlighting advancements introduced in the present work. In addition, this section discusses the experimental methodology used to collect the FBG sensor strain and C–S sinkhole volume data from a C–S sinkhole. Yumba et al. [
14] conducted an experiment to investigate the FBG strain profile during a C–C sinkhole formation by placing 18 FBG strain sensors at varying levels above the cavity. Yumba et al. [
14] placed nine FBG strain sensors vertically and nine FBG strain sensors horizontally above the cavity. It is important to note that the study conducted by Yumba et al. [
14] focused on a C–C sinkhole formation rather than a C–S sinkhole formation. In addition, the study conducted by Yumba et al. [
14] focused on investigating the strain profile during the C–C sinkhole formation rather than determining the volume of the C–C sinkhole. Möller et al. [
9] conducted an experiment to investigate the feasibility of utilizing Distributed Fibre Optic Sensing (DFOS) as an early warning system for sinkholes. The type of sinkhole investigated by Möller et al. [
9] was a C–S sinkhole because the setup utilized mimicked progressive subsurface migration of the overburden material into a void, which is characteristic of C–S sinkhole formations. Möller et al. [
9] placed DFOS cables horizontally at various depths. It is important to note that the study conducted by Möller et al. [
9] utilizes DFOS instead of FBG and that the DFOS cables were placed horizontally. In addition, the study conducted by Möller et al. [
9] was not focused on using the data obtained from the DFOS cables to determine the volume of the C–S sinkhole but was instead focused on early warning detection. Labuschagne et al. [
3] conducted a study investigating the use of FBG strain sensors for C–S sinkhole monitoring. While the study conducted by Labuschagne et al. [
3] focused on a C–S sinkhole, the FBG strain sensors used in the study were placed horizontally above the cavity. Labuschagne et al. [
3] used twelve FBG strain sensors across four fibre cables (3 FBG strain sensors per fibre cable). The fibre cables were placed horizontally at varying depths [
3].
From the discussion above, the other studies involving optical fibre sensing and sinkholes were mainly focused on investigating strain profiles and the feasibility of using optical fibre sensing technology to detect sinkhole formations. None of the studies discussed above were focused on using the data collected from the sensors to determine the volume of a sinkhole. In addition, not all of the studies were focused on C–S sinkholes, and the studies that were focused on C–S sinkholes had placed the FBG strain sensors horizontally above the cavity and were only focused on investigating strain profiles. The novelty of the work presented in this paper is the orientation in which the FBG strain sensors were placed above the cavity and the approach of using the data collected from the FBG strain sensors to determine the volume of a C–S sinkhole. The FBG strain sensors were placed vertically above the cavity of the C–S sinkhole. The decision was made to install FBG strain sensors vertically rather than horizontally in order to minimize soil disturbance during installation. In field applications, horizontal installation would require removing large amounts of the surface layer, which increases the risk of triggering a sinkhole formation. Vertical placement reduces excavation effort and preserves ground stability. Furthermore, the work presented in this paper uses fewer FBG strain sensors than the studies discussed above. The study presented by Yumba et al. [
14] used nine FBG strain sensors placed vertically above a C–C sinkhole cavity, and the study presented by Labuschagne et al. [
3] used twelve sensors placed horizontally above a C–S sinkhole cavity. The work presented in this paper uses three FBG strain sensors on one fibre cable installed vertically above a C–S sinkhole cavity.
A C–S sinkhole was formed under controlled conditions in a laboratory by placing Cullinan Silica sand (obtained from Cullinan Mine, Pretoria, Gauteng, South Africa) over a water-filled balloon (which acted as the cavity) in a Perspex box, embedding FBG strain sensors vertically above the cavity, and allowing the water-filled balloon to deflate using a calibrated flow-valve. As the water-filled balloon deflated, FBG strain sensor measurements were recorded, and the volume of the C–S sinkhole was calculated. The deflation of the water-filled balloon allowed for a surface deformation to occur (i.e., a C–S sinkhole). More details on the experimental methodology can be found in [
1]. It should be noted for the rest of the paper that FBG strains S1, S2, and S3 refer to FBG strain sensors 1, 2, and 3 from [
1], respectively.
4. Data Analysis Procedure
It should be noted that in the following discussions, the term “data” refers to the collected FBG strain measurements and the C–S sinkhole volume. The data were analyzed to determine the relationships amongst the variables (e.g., check if there was a correlation between the strain measurements from the different sensors and the C–S sinkhole volume) and to determine if the data were normally distributed. Collected data from the individual phases were analyzed in isolation (i.e., data from phase one were analyzed first, followed by phase two, and, lastly, phase three). After all the data from all the phases were analyzed in isolation, the data analysis results from all three phases were compared. The reason for comparing the data analysis results from all three phases is because it is desirable to understand how the forming C–S sinkhole may affect the data points over time (i.e., how the structure and relationships amongst the collected data change). Insights into how the C–S sinkhole formation affects the structural changes and patterns in the strain data aided in determining which ML algorithms are appropriate for determining the C–S sinkhole volume.
Figure 2 shows the data analysis procedure conducted on the data in all three phases of the C–S sinkhole formation process.
As seen in
Figure 2, the data analysis process began with outlier detection and handling of the outliers (i.e., deciding whether to remove the detected outliers or not). Outlier detection was achieved by using box plots. After conducting outlier detection, tests for normality were conducted. As can also be seen in
Figure 2, the normality tests included analyzing the collected data’s histograms. The reason for including histograms as part of the data analysis procedure is that, according to Jesussek and Volk-Jesussek [
15], analytical methods used to test for normal distribution (such as the K–S, S–W, and A–D tests) have significant drawbacks, and it is recommended to use analytical methods in conjunction with graphical methods (such as histograms and Quantile–Quantile plots). The collected data were tested for normality, and the relationships among the FBG strain sensors were also assessed. If the data were found to be linearly related and normally distributed, Pearson’s correlation coefficient was conducted to determine the strength of the correlation. After this, Spearman’s Rank correlation was conducted to determine the strength of the monotonic correlation between the FBG strain sensors and C–S sinkhole volume. DataTab [
35] is an online subscription-based statistic calculator that was used to create the relevant visualizations (histograms and scatter plots) and perform the appropriate statistical test (K–S, S–W, and A–D tests as well as Pearson’s correlation coefficient and Spearman’s Rank correlation coefficient). DataTab was also utilized to conduct hypothesis testing on the calculated Pearson’s correlation and Spearman’s Rank correlation coefficients. Computational R was used to plot line charts.
The following section presents and discusses the results obtained from the data analysis procedure.
5. Data Analysis Results and Discussion
This section presents and discusses the data analysis results on all three phases of the C–S sinkhole formation process.
Figure 3 shows the raw FBG strain data for FBG strain sensors
versus time.
In
Figure 3, the three vertical dashed lines indicate the transition from phase one to two and phase two to three. As shown in
Figure 3, there was significant strain during phase two of the C–S sinkhole formation. In [
1], the authors reported that the C–S sinkhole formed for approximately
. However, in
Figure 3, phase two stopped
after the C–S sinkhole had started forming. Thus, there was a period of
after the control valve had closed, and strain fluctuations were still recorded. The strain fluctuations during the
period after the control valve was closed were due to the Cullinan Silica sand still settling. In order to have a clearer view of the strain data patterns and trends, a moving average with a window size of 150 was applied to the collected data.
Figure 4 shows the moving average of the FBG strain data for sensors
with a window size of 150 versus time.
Figure 4 shows that the FBG strain for all three sensors was relatively stable around the
x-axis during phase one. During phase two, the FBG strain changed drastically, which was also found by Yumba et al. [
2] when conducting a temporal analysis on the same dataset. In addition, during phase three, the strain for all three FBG strain sensors stabilized below the
x-axis.
Figure 5 shows the raw C–S sinkhole volume versus time.
In
Figure 5, the volume of the C–S sinkhole was calculated using the methodology outlined in [
1]. The volume of the C–S sinkhole is the inverse of the volume of the balloon (i.e., cavity).
Figure 5 shows that the change in C–S sinkhole volume was linear, which was expected since the methodology used in [
1] for calculating the volume was a linear function. The data collected during phases one, two, and three is represented by
,
, and
, respectively. The dataset consists of
data points. Each data point consists of the C–S sinkhole volume at that particular point during the experiment and the FBG strain measurements from
,
, and
. Therefore, each data point has four features. In the dataset, we have the following:
,
, and
. Notably,
and
. Therefore, when choosing ML algorithms to classify the data points as belonging to
,
, or
, special consideration will have to be given to ML algorithms that are robust against imbalanced datasets. The following section presents and discusses outlier detection in the FBG strain data across all three phases of the C–S sinkhole formation process.
5.1. Outlier Detection in FBG Strain Data
Depending on the phase in which we are considering the strain data, S1, S2, or S3, we may observe either no notable outliers or some outliers above the upper whisker and/or below the lower whisker of a box plot. The decision was made across the board not to remove the outliers because: 1. The dataset is already imbalanced, and it is decided not to further imbalance it; 2. Outliers will impact statistical and ML implementation results, which may aid in understanding how robust future implemented ML models are to outliers in FBG strain data; and 3. The outliers may indicate new phenomena that warrant further investigation of the interaction and stress transfer mechanisms between fibre cables and soil during the various phases of the C–S sinkhole formation process. The following section presents and discusses the results of normality testing on the collected data during phase one.
5.2. Normality Test on FBG Strain Data During Phase One
It should be noted that since this section analyzed data during phase one of the C–S sinkhole formation process (i.e., before the C–S sinkhole started forming), it was assumed that the C–S sinkhole volume was zero. As a result, no C–S sinkhole volume data were analyzed and presented during the phase one data analysis process. The assumption that the C–S sinkhole volume during phase one was zero aided in avoiding the inclusion of non-relevant metrics. Furthermore, during the phase one data analysis process, the focus was primarily on precursor conditions leading to the C–S sinkhole formation (i.e., phase two).
Figure 6 shows the histograms of the FBG strain measurements for S1, S2, and S3 during phase one of the C–S sinkhole formation process with a normal distribution overlay and the locations of their respective means.
It can be seen in
Figure 6 that the strain data from FBG S1, S2, and S3 did not follow a normal distribution because none of the histograms were symmetrical about the mean and did not follow the typical bell-shaped curve (shown by the normal distribution overlay).
Table 2 shows the K–S, S–W, and A–D test results for FBG strain S1 during phase one of the C–S sinkhole formation process. It should be noted that with regard to the
-values in
Table 2,
Table 3 and
Table 4, the null hypothesis was that the data followed a normal distribution, and the alternative hypothesis was that the data did not follow a normal distribution. The significance level that was selected was
. For the remainder of the paper and with regards to normality tests, it should be noted that
, and
represent the
-values for the K–S, S–W, and A–D tests, respectively.
From
Table 2, the K–S, S–W, and A–D test results indicated that the data distribution for FBG strain S1 during phase one was not normally distributed,
,
,
,
and
.
Table 3 shows the K–S, S–W, and A–D test results for FBG strain S2 during phase one of the C–S sinkhole formation.
From
Table 3, the K–S, S–W, and A–D test results indicated that the data distribution for FBG strain S2 during phase one was not normally distributed,
,
and
.
Table 4 shows the K–S, S–W, and A–D test results for FBG strain S3 during phase one of the C–S sinkhole formation.
From
Table 4, the K–S, S–W, and A–D test results indicated that the data distribution for FBG strain S3 during phase one was not normally distributed,
,
and
.
All tests for normality (histograms, K–S test, S–W test, and A–D test) indicate that the strain data from all three FBG strain sensors deviate significantly from the normal distribution during phase one of the C–S sinkhole formation. The implication thereof is that statistical methods and ML algorithms that assume the normality of data should not be used during phase one of the formation of the C–S sinkhole. The following section discusses the scatter plots from the data collected during phase one.
5.3. Scatter Plots for FBG Strain During Phase One
Figure 7 shows the scatter plots and regression lines between all three FBG strain sensors during phase one of the C–S sinkhole formation process.
Figure 7 shows that all three FBG strain sensors exhibited a high positive linear correlation with respect to each other during phase one of the C–S sinkhole formation process, which was expected since, as can be seen in
Figure 3 and
Figure 4, all three FBG strain sensors were approximately stable around the
x-axis. The high positive linear correlation implies that linear ML regression algorithms may be appropriate for the data. However, high levels of multicollinearity may be present, given the high positive linear correlations between the sensors. To conduct hypothesis testing with Pearson’s correlation coefficient, the variables (i.e., the FBG strain data in this case) must be normally distributed, and there must be a linear relationship between the variables [
15]. Thus, while Pearson’s correlation coefficient could be calculated, no statistical inferences from the correlation could be made due to the FBG strain data not following the normal distribution. As such, Pearson’s correlation coefficient analysis was not conducted on the FBG strain data collected during phase one of the C–S sinkhole formation process.
5.4. Spearman’s Rank Correlation Analysis During Phase One
Table 5 provides Spearman’s Rank correlation analysis results between the three FBG strain sensors and their corresponding
p-values. For the calculated
p-values, the null hypothesis was that there was no monotonic correlation between the FBG strain sensors, and the alternative hypothesis was that there was a monotonic correlation at a significance level of
between the FBG strain sensors during phase one.
Table 5 shows a very high positive monotonic correlation between all three FBG strain sensors during phase one. The correlation between all three FBG strain sensors was statistically significant, with all their
-values being less than
. The statistically significant and very high positive correlations between the three FBG strain sensors indicate that high multicollinearity levels might be present in the data during phase one. In regression analysis, multicollinearity refers to the situation when two or more predictor variables are highly correlated (i.e., one predictor variable can be represented by a linear combination of the other predictor variables) [
15,
17,
18,
36]. Multicollinearity amongst predictor variables may result in regression coefficients being unstable and not interpretable, and in the case where the predictor variables’ relations are rigorously linear, the coefficients become biased [
15,
18]. It is pertinent to note that multicollinearity does not impact the prediction outcome in regression analysis [
15]. The degree of multicollinearity we may be seeing here makes sense, given that the sensors were monitoring the same phenomenon, were placed vertically above the cavity, and were on the same fibre cable. Since the FBG strain sensors were on the same fibre cable, the FBG strain sensors were not independent. Hence, high levels of multicollinearity were expected.
5.5. Normality Test on FBG Strain Data During Phase Two
Figure 8 shows the histograms of the FBG strain data for S1, S2, and S3, as well as the C–S sinkhole volume during phase two with a normal distribution overlay and locations of their respective means. Upon visual inspection, it was evident that the histograms in
Figure 8a–c did not conform to the normal distribution overlay. Specifically, the histograms exhibited pronounced peaks around their respective means. Furthermore, the histograms were not symmetrical around their respective means. From (
Figure 8d, it can be seen that while the histogram of the C–S sinkhole volume was approximately symmetrical about the mean, the histogram did not conform to the normal distribution overlay and exhibited a more uniform distribution.
Table 6 shows the K–S, S–W, and A–D test results for FBG strain S1 during phase two of the C–S sinkhole formation process. It should be noted that, with regard to the
p-values in
Table 6,
Table 7,
Table 8 and
Table 9, the null hypothesis was that the data followed a normal distribution, and the alternative hypothesis was that the data did not follow a normal distribution.
From
Table 6, the K–S and S–W test results indicate that the data distribution for FBG strain S1 during phase two was not normally distributed,
and
. However, the A–D test result indicates that the data distribution for FBG strain S1 was normally distributed during phase two,
. Given that the K–S test and S–W test, along with visual inspection of the histogram, indicate that the data distribution for FBG strain S1 during phase two was not normally distributed, we reject the null hypothesis and state that it is highly unlikely that the alternative hypothesis is due to chance during phase two.
The contradictory normality test results between the K–S test, S–W, histogram, and A–D test highlight the importance of conducting more than one test for normality on the collected data, as they increase confidence in any conclusions made regarding the normality of the data.
Table 7 shows the K–S, S–W, and A–D test results for FBG strain S2 during phase two of the C–S sinkhole formation process.
From
Table 7, the K–S, S–W, and A–D test results indicated that the data distribution for FBG strain S2 during phase two was not normally distributed,
,
,
,
and
.
Table 8 shows the K–S, S–W, and A–D test results for FBG strain S3 during phase two of the C–S sinkhole formation.
From
Table 8, the K–S and S–W test results indicate that the data distribution for FBG strain S3 during phase two was not normally distributed,
and
. However, the A–D test result on DataTab returned infinity for the test statistic. Indicating that the test statistic had become excessively large, resulting in the impossible computation of the
-value. Given that three of the four tests (K–S test, S–W test, and histogram) indicate that the data distribution for FBG strain S3 during phase two was not normally distributed, we reject the null hypothesis and state that it is highly unlikely that the alternative hypothesis is due to chance during phase two.
Table 9 shows the K–S, S–W, and A–D test results for the C–S sinkhole volume during phase two.
From
Table 9, the K–S, S–W, and A–D test results indicate that the C–S sinkhole volume data distribution during phase two was not normally distributed,
,
and
. We reject the null hypothesis for the C–S sinkhole volume and state that it is highly unlikely that the alternative hypothesis is due to chance during phase two.
Tests for normality indicated that the strain data from all three FBG strain sensors and the C–S sinkhole volume deviated significantly from the normal distribution during phase two of the C–S sinkhole formation. The implication thereof is that statistical methods and ML algorithms that assume the normality of data should not be used during phase two of the formation of the C–S sinkhole. The following section presents and discusses the scatter plots obtained from the data collected during phase two.
5.6. Scatter Plots of FBG Strain and C–S Sinkhole Volume During Phase Two
Figure 9 shows the scatter plots and regression lines between all three FBG strain sensors and the C–S sinkhole volume during phase two of the C–S sinkhole formation process.
Figure 9a–c indicate that the FBG strain sensors exhibited no strict linear correlation during phase two. The areas where FBG strain S1 exhibited a negative linear correlation with FBG strain S2 and S3 were expected because, as seen in
Figure 4, the strain profile of FBG strain S1 is inversely proportional (i.e., mirrored) to the strain profiles of FBG strain S2 and S3. As FBG strain S1 experienced an increase in tensile strain, FBG strain S2 and S3 experienced an increase in compressive strain. The increase in compressive strain for FBG strain S2 and S3 was expected because, as the C–S sinkhole continued to grow in size, more and more of the Cullinan Silica sand was displaced closer towards the cavity, where FBG strain S2 and S3 were located. In other words, as the C–S sinkhole grew, FBG strain S2 and S3 were more compacted by the Cullinan Silica sand due to the piping process. FBG strain S1 experienced the opposite. FBG strain S1 was located closer to the surface; as the C–S sinkhole grew, FBG strain S1 was stretched due to the piping process. A resulting implication is that there is an inflection point between FBG strain S1 and S2 whereby the direction of the strain was reversed.
The regression line in
Figure 9d indicates that there was no linear relationship between FBG strain S1 and the C–S sinkhole volume during phase two. In
Figure 9e,f, although linear regression lines were fitted to the scatter plots, the data distribution did not exhibit a clear linear correlation between FBG strain S1 and S2 and the C–S sinkhole volume. The variability and dispersion of the data points indicate that a linear model does not adequately describe the relationship between these variables. Thus, linear ML algorithms are not appropriate for use during phase two of the C–S sinkhole formation to determine the volume of the C–S sinkhole. Similarly to phase one, the FBG strain data and C–S sinkhole volume data were found to be not normally distributed, resulting in Pearson’s correlation coefficient analysis not being conducted on the phase two data since no statistical inferences could be made from Pearson’s correlation coefficient.
5.7. Spearman’s Rank Correlation Analysis During Phase Three
Table 10 provides Spearman’s Rank correlation analysis results between the three FBG strain sensors and the C–S sinkhole volume and their corresponding
-values during phase two. For the calculated
-values, the null hypothesis was that there was no monotonic correlation between the FBG strain sensors, and the alternative hypothesis was that there was a monotonic correlation at a significance level of 0.05 between the FBG strain sensors during phase two.
From
Table 10, Spearman’s Rank correlation showed a statistically significant and very high, positive monotonic correlation between FBG strain S3 and S2 during phase two. Spearman’s Rank correlation also showed that FBG strain S1 exhibited a statistically significant and moderate positive monotonic correlation between FBG strain S2 and S3. The results of Spearman’s Rank correlation analysis between all of the FBG strain sensors during phase two were found to be statistically significant, with all of their respective
-values being less than
. Thus, we reject the null hypothesis and conclude that the alternative hypothesis (i.e., a monotonic correlation between the FBG strain sensors during phase two) is highly unlikely to be due to chance.
From
Table 10, Spearman’s Rank correlation showed a statistically significant and negatively moderate correlation between FBG strain S1 and the C–S sinkhole volume during phase two. Spearman’s Rank correlation analysis also found that the C–S sinkhole volume exhibited a statistically significant and very high negative correlation between FBG strain S2 and S3. The results of Spearman’s Rank correlation analysis between all of the FBG strain sensors and the C–S sinkhole volume during phase two were found to be statistically significant, with all of their respective
-values being less than
. Thus, we reject the null hypothesis and conclude that the alternative hypothesis (i.e., a monotonic correlation between the FBG strain sensors and the C–S sinkhole volume) is highly unlikely to be due to chance. Hence, we cannot definitively claim that the strain data cannot be utilized as predictors for the C–S sinkhole volume during phase two. Thus, nonparametric ML algorithms should be used when determining the C–S sinkhole volume, given the raw FBG strain data.
5.8. Normality Test on FBG Strain Data During Phase Three
In this section and the subsequent sections, FBG strain data collected during phase three of the C–S sinkhole formation process were analyzed by following the data analysis procedure outlined in
Figure 2. During phase three, the formation of the C–S sinkhole ceased. As a result, the C–S sinkhole’s volume was constant and not analyzed further.
Figure 10 shows the histograms of the FBG strain from S1, S2, and S3 during phase three of the C–S sinkhole formation process, with a normal distribution overlay and the locations of their respective means.
It can be seen in
Figure 10 that the strain data from FBG S1, S2, and S3 did not follow a normal distribution due to asymmetry and skewness and because the histograms did not follow the typical bell-shaped curve (shown by the normal distribution overlay).
Table 11 shows the K–S, S–W, and A–D test results for FBG strain S1 during phase three of the C–S sinkhole formation process. It should be noted that with regard to the
-values in
Table 11,
Table 12 and
Table 13, the null hypothesis was that the data followed a normal distribution, and the alternative hypothesis was that the data did not follow a normal distribution. The significance level that was selected was
.
From
Table 11, the K–S and S–W test results indicate that the data distribution for FBG strain S1 during phase three was not normally distributed,
and
. However, the A–D test result on DataTab returned infinity for the test statistic. Indicating that the test statistic had become excessively large, resulting in the impossible computation of the
-value. Given that three of the four tests (K–S test, S–W test, and histogram) indicate that the data distribution for FBG strain S1 during phase three was not normally distributed, we reject the null hypothesis and state that it is highly unlikely that the alternative hypothesis is due to chance during phase three for FBG strain S1.
Table 12 shows the K–S, S–W, and A–D test results for FBG strain S2 during phase three.
From
Table 12, the K–S and S–W test results indicate that the data distribution for FBG strain S2 during phase three was not normally distributed,
and
. However, similarly to FBG strain S1, the A–D test result on DataTab returned infinity for the test statistic. Given that three of the four tests (K–S test, S–W test, and histogram) indicate that the data distribution for FBG strain S2 during phase three was not normally distributed, we reject the null hypothesis and state that it is highly unlikely that the alternative hypothesis is due to chance during phase three for FBG strain S2.
Table 13 shows the K–S, S–W, and A–D test results for FBG strain S3 during phase three.
From
Table 13, the K–S, S–W, and A–D test results indicate that the data distribution for FBG strain S3 during phase three was not normally distributed,
,
and
. Thus, we reject the null hypothesis and state that it is highly unlikely that the alternative hypothesis is due to chance during phase three for FBG strain S3.
Tests for normality indicated that the strain data from all three FBG strain sensors deviated significantly from the normal distribution during phase three of the C–S sinkhole formation. The implication thereof is that statistical methods and ML algorithms that assume the normality of data should not be used during phase three of the formation of the C–S sinkhole. The following section discusses the scatter plots from the data collected during phase three.
5.9. Scatter Plots of FBG Strain Data During Phase Three
Figure 11 shows the scatter plots of the FBG strain sensors during phase three of the C–S sinkhole formation process.
Figure 11a indicated that there was a strong positive linear correlation between FBG strain S1 and S2 during phase three. In
Figure 11b,c indicated that FBG strain S3 exhibited a weak to moderate correlation with FBG strain S2 and S1, respectively. The correlations shown can be explained by the fact that the C–S sinkhole has stopped forming, and, as shown in
Figure 3 and
Figure 4, all three FBG strain sensors stabilized during phase three.
Similarly to phases one and two, the FBG strain data were found to be not normally distributed, resulting in Pearson’s correlation coefficient analysis not being performed on the phase three data since no statistical inferences could be made from Pearson’s correlation coefficient.
5.10. Spearman’s Rank Correlation During Phase Three
Table 14 provides Spearman’s Rank correlation analysis results between the three FBG strain sensors and their corresponding
-values during phase three. For the calculated
-values, the null hypothesis was that there was no monotonic correlation between the FBG strain sensors, and the alternative hypothesis was that there was a monotonic correlation at a significance level of 0.05 between the FBG strain sensors during phase three. From
Table 14, FBG strain S1 exhibited a very high and monotonically positive correlation with FBG strain S3. Also, in
Table 14, FBG strain S2 exhibited moderate and monotonically positive correlations with FBG strain S1 and S3.
All of the results from Spearman’s rank correlation analysis were statistically significant since all of the respective -values between the FBG strain sensors were less than the significance level . Therefore, we reject the null hypothesis and conclude that it is highly unlikely that the alternative hypothesis is due to chance during phase three.
5.11. Phase One, Two, and Three Data Analysis Comparison
Across all three phases, it was found that none of the collected data followed the normal distribution. In phase one, the scatter plots indicated a positive linear correlation between all three FBG strain sensors. However, during phase two, the correlations did not remain the same, with none of the FBG strain sensors exhibiting a strict linear correlation with respect to each other. No strict linear relationship was found between the C–S sinkhole volume and the FBG strain sensors. During phase three, the three FBG strain sensors exhibited weak to moderate linear correlations with respect to each other. However, since the criteria for normality were not met in any of the phases, Pearson’s correlation analysis could not be conducted in any of the three phases. As such, statistical and ML techniques that assume normality and linearity should not be used on the collected data in any of the three phases. During phase one, all three FBG strain sensors exhibited statistically significant high to very high monotonically positive correlations with respect to each other. However, during phase two, the monotonic correlation between FBG strain S2 and S3 increased while the monotonic correlation between FBG strain S1 decreased with respect to FBG strain S2 and S3. The reason is that, during phase one, all three FBG strain sensors were stable around the x-axis. During phase two, FBG strain S1 experienced predominantly tensile strain. In contrast, FBG strain S2 and S3 experienced lower tensile strain levels and transitioned from tensile strain to compressive strain earlier than FBG strain S1. During phase three, all three FBG strain sensors experienced statistically significant moderate to very high monotonically positive correlations with respect to each other. The following section presents and discusses the results of ML implementation during phase two.
7. Conclusions and Future Work
The data collected across all three phases were not normally distributed, and the FBG strain sensors did not maintain strict linear correlations with respect to each other. Therefore, statistical techniques and parametric ML algorithms, which require the data to be normally distributed and the predictor variables to be linearly correlated, are inappropriate for the collected data. The FBG strain sensors exhibited statistically significant and moderate to very high monotonic correlations with respect to each other across all three phases. High levels of multicollinearity may be present due to statistically significant and very high monotonic correlations between the three FBG strain sensors across the different phases. All of the FBG strain sensors exhibited statistically significant and medium to very high monotonic correlations with respect to the C–S sinkhole volume during phase two. Therefore, we cannot definitively claim that the strain data cannot be utilized as predictors for the C–S sinkhole volume during phase two. WLS, SVR, and XGBoost were fitted to phase two of the collected data. WLS obtained the lowest values and the highest RMSE values. SVR showed significant improvement over WLS. XGBoost effectively captured the relationship between the FBG strain sensors and the C–S sinkhole volume. Thus, XGBoost is a strong candidate for determining the volume of the C–S sinkhole. If more sophisticated methods are used to obtain greater precision of the C–S sinkhole volume, then more accurate implementations of XGBoost can be obtained.
The strain measurements obtained from the FBG strain sensors are promising for using AI to determine when a C–S sinkhole has started growing, is growing, and has stopped growing since the strain patterns vary significantly from phase to phase. Future work will include using more nonparametric AI algorithms and time series analysis to determine which phase the C–S sinkhole is in, investigating more sophisticated techniques for determining the volume of the C–S sinkhole, investigating multicollinearity amongst the predictor variables (i.e., the FBG strain sensors) and investigating data transformations to linearize the relationships between the three FBG strain sensors across all three phases of the C–S sinkhole formation process. In addition, future work will also include investigating how different soil compositions, sensor placements, and sinkhole conditions impact sinkhole volume estimation using FBG strain data.