Big Data Analysis Framework for Water Quality Indicators with Assimilation of IoT and ML

: According to the United Nations, the Sustainable Development Goal ‘6’ seeks to ensure the availability and sustainable management of water for all. Digital technologies, such as big data, Internet of Things (IoT), and machine learning (ML) have a signiﬁcant role and capability to meet the goal. Water quality analysis in any region is critical to identify and understand the standard of water quality and the quality of water is analyzed based on water quality parameters (WQP). Currently, water pollution and the scarcity of water are two major concerns in the region of Uttarakhand, and the analysis of water before it is supplied for human consumption has gained attention. In this study, a big data analytics framework is proposed to analyze the water quality parameters of 13 districts of Uttarakhand and ﬁnd the correlation among the parameters with the assimilation of IoT and ML. During the analysis, statistical and fractal methods are implemented to understand the anomalies between the water quality parameters in 13 districts of Uttarakhand. The variation in WQP is analyzed using a random forest (RF) model, and the dataset is segmented location wise and the mean, mode, standard deviation, median, kurtosis, and skewness of time series datasets are examined. The mean of the parameters is adjusted with the coefﬁcient of variation based on the standard values of each parameter. The turbidity in almost all the experimental sites has a normal distribution, with the lowest mean value (0.352 mg/L) and highest (11.9 mg/L) in the Pauri Garhwal and Almora districts, respectively. The pH of the water samples is observed to be in the standard range in all the experimental sites, with average and median values being nearly identical, at 7.189 and 7.20, respectively. However, the pH mode is 0.25. The Cl − concentration varies with mean values from the lowest (0.46 mg/L) to the highest (35.2 mg/L) over the experimental sites, i.e., the Bageshwar and Rudraprayag districts, respectively. Based on the analysis, it was concluded that the water samples were found to be safe to drink and in healthy condition in almost all the districts of the state Uttarakhand, except for the Haridwar district, where some increase in contaminants was observed.


Introduction
According to the Sustainable Development Goal '6', there is a need to ensure the availability and sustainable management of water and sanitation for all [1]. Concerning

•
A big data analytics-based framework is proposed in this study for water quality parameter analysis.

•
The variation in WQP is analyzed using a random forest (RF) model, and the mean of the parameters is adjusted with the coefficient of variation based on the standard values of each parameter.

•
The predictability analysis of 13 experimental sites is carried out with the fractal approach.

•
The results were obtained from the statistical analysis over the different experimental sites using two approaches i.e., skewness, kurtosis curves and correlation matrix tables for each parameter over the studied experimental sites.
Electronics 2022, 11,1927 3 of 35 This research paper is organized as follows. The WQI applications in the context of the extended fractal analysis for water quality parameters (WQPs) are reviewed in Section 2. Section 3 covers the materials and methods. Section 4 covers the results and discussion. The last and final Section 5 concludes the paper.

Background
WQIs provide the environment of a water body in terms of pollution load segmentation and identification of water quality. Dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), pH, NH3-NL, and SS are the parameters for these indices and there is a high need for spatial and temporal monitoring [16]. The comprehensive analysis with the simulation of the water quality datasets is highly required to a large extent, as it is challenging to assess spatial variation among the WQIs. The remote sensing technique and ground-level survey of the parameters will be useful in a spatial and temporal range study of the quality assessment for policymakers [17]. Each water quality parameter may be used to analyze the quality of drinking water and its quality management initiatives for stakeholders. The influence of water parameters and quality index assessment was investigated for 28-years of data in the upper uMngeni watershed for the 11 samples (pH, electrical conductivity, temperature, turbidity, total suspended particles, NH4-N, NO3-N, PO4-P, and total phosphorus [18]. There is an urgent need for a comprehensive analysis of WQIs, especially in the Himalayan region including statistical, fractal, and geospatial techniques [19]. The identification of interlinkages between environment and water pollutants can help to enhance the forecasting of chemical exposure and helps in the evaluation of pollutant export over complex ecosystems [20]. The links between some particular water pollutants and the selected hydrological character variables were elucidated using a modified least-square analysis, due to the substantial co-dependence of the hydrological processes [21]. The WQIs were used to categorize groundwater from the permanent wells in the Kanchipuram district of Tamil Nadu, where 13 groundwater quality parameters were used to conclude water quality [22,23]. There are two traditional approaches to analyzing the WQI data and they are statistical and fractal approaches. The statistical and fractal approaches may provide the variation in the parameters spatially, as the Himalayan region has undulating geography, as well as huge network sources of surface water. However, information from only ground observations of a single indicator fails to resolve the complicated geographical terrain, which has an influence mostly on the hydrological cycle either directly or indirectly [24]. Fractal analysis was performed on the quality of rural domestic wastewater under the condition of dissolved oxygen stability [25]. Fractal analysis i.e., the Hurst exponent (H), fractal dimension (FD), and predictability (PI) of the water parameters will lead to better knowledge, as well as the ability to explain the better-approximated values [5] and statistically, it can lead to a reduction in the uncertainty in the analysis of large datasets. The water quality has been evaluated at multiple places for each water parameter using statistical methods. To determine the trend and predictability of water quality and regression, correlation coefficient, autoregressive integrated moving average (ARIMA), Box-Jenkins, residual autocorrelation function (ACF), residual partial autocorrelation function (PACF), lag, fractal, Hurst exponent, and predictability index were calculated [26].
Regularly inspecting the quality of the water might be good practice for ensuring that the water is safe to drink. Therefore, it is suggested [27] that a real-time monitoring system is highly required to assess and forecast future water quality indicators using ML and the IoT. The time series pattern in the data was extracted using a long short-term memory neural network (LSTM NN) and the wWQPs were obtained using sensors such as a pH sensor, turbidity sensor, and total dissolved solids (TDS) sensor, and the data was utilized to forecast future parameter values [28]. The sensors, Arduino, and NodeMCU in the IoT module can be embedded in the water supply to continuously monitor the parameters and this assists users to receive early alerts in case of increased containment WQPs [29]. The water quality prediction model necessitates the usage of high-quality data. Big data Electronics 2022, 11, 1927 4 of 35 is produced at a high rate in the process of building and operating smart water quality monitoring systems based on the IoT, making water quality data difficult. A network model was examined [30] and found that it is capable of merging distributed observation data with geologically separated local models using smart sensor-based federated learning. Furthermore, an optimal scheduler is proposed to improve the overall network system's efficiency by making use of real-time massive data arrivals. A deep learning approach for forecasting water quality in IoT systems with the long-short term memory (LSTM) algorithm can be used for forecasting the water quality indicators [31].
In the Karoon River of Iran, the ANN approaches, such as multilayer perceptron (MLP), radial basis network, and adaptive neuro-fuzzy inference system (ANFIS) models are used [32] to compute DO, BOD, and chemical oxygen demand COD levels. The models included nine input water quality variables, including EC, pH, Ca, Mg, Na, turbidity, PO 4 , NO 3 , and NO 2 , all of which were measured in river water. In addition, when RMSE, and MAE indices, and coefficients of correlations (r2) for predicting DO, BOD, and COD were compared, the MLP/BP model outperformed the ANN-RBF and ANFIS models [33]. It is suggested that the ANN can be utilized as an effective tool to compute and forecast river water quality metrics.
These approaches provide complete and efficient results, which are useful in the comprehension of surface water recharge modeling, especially in delicate ecology [33]. The understanding of water-sensitive parameters and major contaminated source identification is advantageous for drinking water model construction and water quality analysis modeling.

Materials and Methods
In this section, we present the water quality parameter that is used for the analysis. Along with this, we have presented the statistical and fractal approach for the ground water quality analysis over the 13 districts of Uttarakhand.

Big Data-Based Framework for Water Quality Analysis
In this study, we proposed a big data-based framework for carrying out the water quality analysis of Uttarakhand. Figure 1 illustrates the framework, which comprises three different components named as ground water quality monitoring, big data analytics platform, and water data. The first component describes the problems that are identified as relevant to the ground water quality, which consist of water quality degradation, land subsidence, interconnected ground water depletion, and ground water storage reduction. At present, this study focused on the problems based on water quality degradation, and land subsidence. In water quality degradation, the quality of water is analyzed based on the contamination level identified in the water. This study carried out the research by analyzing the WQP, such as turbidity, Cl − , Fe 2+ , As ++ , NO 3 − , pH, Ca, Mg, FluoF-), total dissolved salts (TDS), alkalinity, hardness, and sulphate (SO 4 2− ) of 13 districts to conclude the water quality.
Land subsidence is a broad term that indicates the lower vertical movement of the Earth's surface, produced by both natural and human forces. Regarding the data, the field data, remote sensing data, and simulated data are considered for the water quality analysis. In the big data analytics platform, the data of the 13 district WQPs are collected based on WQP, and the usable data are concluded from the data processing. Downscaling refers to the strategies used to regionalize information from global climate models and generate fine-scale climate change projections. Fine resolution information is required to better localize WQP. Through the process of downscaling, big data analytics solutions can address the mismatch between regional size data and local-scale information. Statistical modeling, data mining, and pattern discovery and categorization are the analytics methods used to analyze the collected data for making the decisions. on WQP, and the usable data are concluded from the data processing. Downscaling refers to the strategies used to regionalize information from global climate models and generate fine-scale climate change projections. Fine resolution information is required to better localize WQP. Through the process of downscaling, big data analytics solutions can address the mismatch between regional size data and local-scale information. Statistical modeling, data mining, and pattern discovery and categorization are the analytics methods used to analyze the collected data for making the decisions.  Figure 2 illustrates the architecture that presents the implementation of IoT and ML for the identification of the WQIs of every site based on the WQP data. The sensor node is obtained from the field data related to the water parameters of every district of the Uttarakhand. The field data are communicated to the cloud server LoRa and gateway. The field data available in the cloud server are pre-processed for the transforming of the raw data into an understandable format. Currently, there are different data-driven models, statistical multivariate methods, such as weights of evidence, logistic regression, and a set of AI or ML methods, such as genetic algorithms, adaptive neuro-fuzzy inference system, support vector machines (SVM), artificial neural networks (ANN), and, more recently, random forest (RF) that exist [34]. In hydrogeology investigations, multivariate statistical approaches and ANN are the most widely utilized [35]. Unfortunately, these techniques have a multitude of limitations, including their sensitivity to outlier values in logistic regression and the opacity of neural networks. Currently, ML has advanced significantly in recent years, and new methods have been developed to tackle some of the concerns stated for the frequently used methods [36].

Assimilation of IoT and ML (References)
The developing form of the ML technique that employs ensembles of regressions is gaining attention and this ensemble learning technique employs the same basic process to generate repeated numerous predictions, which are then averaged to form a unique model [37]. RF is one of ensemble learning, which is rapidly used for land-cover categorization from sensed data, as well as other domains connected to the environment and water resources [38]. An ensemble of regression (or classification) tree models used in the RF algorithm technique and a succession of separate trees is constructed based on random sub samples from the original data [39]. Each subsample has a decision tree, which is used to  Figure 2 illustrates the architecture that presents the implementation of IoT and ML for the identification of the WQIs of every site based on the WQP data. The sensor node is obtained from the field data related to the water parameters of every district of the Uttarakhand. The field data are communicated to the cloud server LoRa and gateway. The field data available in the cloud server are pre-processed for the transforming of the raw data into an understandable format. Currently, there are different data-driven models, statistical multivariate methods, such as weights of evidence, logistic regression, and a set of AI or ML methods, such as genetic algorithms, adaptive neuro-fuzzy inference system, support vector machines (SVM), artificial neural networks (ANN), and, more recently, random forest (RF) that exist [34]. In hydrogeology investigations, multivariate statistical approaches and ANN are the most widely utilized [35]. Unfortunately, these techniques have a multitude of limitations, including their sensitivity to outlier values in logistic regression and the opacity of neural networks. Currently, ML has advanced significantly in recent years, and new methods have been developed to tackle some of the concerns stated for the frequently used methods [36].

Assimilation of IoT and ML (References)
The developing form of the ML technique that employs ensembles of regressions is gaining attention and this ensemble learning technique employs the same basic process to generate repeated numerous predictions, which are then averaged to form a unique model [37]. RF is one of ensemble learning, which is rapidly used for land-cover categorization from sensed data, as well as other domains connected to the environment and water resources [38]. An ensemble of regression (or classification) tree models used in the RF algorithm technique and a succession of separate trees is constructed based on random sub samples from the original data [39]. Each subsample has a decision tree, which is used to forecast the response variable (or a class). The integration of many trees increases the likelihood of developing an effective prediction model. The Algorithm 1's accuracy is mostly determined by the strength of the individual tree classifiers and their interdependence [40].
Here, an RF model is applied on the pre-processed data to identify the variability in WQIs of every site. The WQPs of 13 districts (Uttarakhand) are retrieved in the excel format. The dataset is classified as per the latitude and longitude of the respective district. Mean, mode, standard deviation, median, kurtosis, and skewness are used to examine the time series datasets. On the basis of the standard values of each parameter in every district (Site), the variation in WQP is evaluated using the random forest (RF) model [41] and the means of the parameters are tuned with the coefficient of variation (CV). Now, the generated time series dataset is employed in the fractional Brownian motion (FBM). determined by the strength of the individual tree classifiers and their interdependence [40].
Here, an RF model is applied on the pre-processed data to identify the variability in WQIs of every site. The WQPs of 13 districts (Uttarakhand) are retrieved in the excel format. The dataset is classified as per the latitude and longitude of the respective district. Mean, mode, standard deviation, median, kurtosis, and skewness are used to examine the time series datasets. On the basis of the standard values of each parameter in every district (Site), the variation in WQP is evaluated using the random forest (RF) model [41] and the means of the parameters are tuned with the coefficient of variation (CV). Now, the generated time series dataset is employed in the fractional Brownian motion (FBM).  Algorithm 1 Proposed random forest algorithm 1: Given a training set X = x 1, x 2 . . . x n 2: For each element x ∈ X, there is a response y: y ∈Y, Where Y = y 1 , y 2 . . . y n 3: Each set is bagging B times, where B is independent parameter 4: This training set selects a random sample with replacement of the training set and also fits decision trees to these sample.

Water Quality Parameters Used
The current study is carried out in the various stress conditions, which include the different elevations. The stress conditions refer to the geographical and geological variation in 13 districts of the state Uttarakhand. The datasets of the surface water quality parameters used in the study are obtained from the website of the Ministry of Water and Sanitation, the Govt. of India, for the year 2016. Daily monitoring of the WQPs at each site show anomalies, which is an indicator of the irregular pattern of water quality, and it is essential to analyze the linkages among them.
The fractal indices FD, H, and PI of the mentioned thirteen water quality indices (WQIs) are generated. The statistical analyses, such as mean, mode, standard deviation, median, kurtosis, skewness, and coefficient of variation (CV) of each parameter, along with each experimental site are computed and compared with the geospatial distribution. The box plots of all 13 water quality parameters (1-turbidity, 2-chloride (Cl − ), 3-iron (Fe 2+ ), 4-arsenic (As ++ ), 5-nitrate (NO 3 − ), 6-pH, 7-calcium (Ca), 8-magnesium (Mg), 9-fluoride (F − ), 10-total dissolved salts (TDS), 11-alkalinity, 12-hardness, 13-sulphate (SO 4 − )obtained from different experimental sites (13 districts) are shown in Figure 3. In India, water quality assessment and regulation will be taken care of by the Ministry of Drinking Water and Sanitization, Govt. of India, which has given a standard limit of WQPs. Table 1 illustrates the WQPs as per the Indian standard acceptance range (source: Methodology Manual for Groundwater quality mapping, Rajiv Gandhi national drinking water mission [42]).

Statistics Analysis
The statistics, including indices and multivariate of WQP time series, are analyzed using the mode, mean and median to calculate the standard deviation from the series datasets, and spatially averaged frequency values. The sample dataset variability is assessed using standard deviation (Std), and kurtosis for data peakedness estimation. Skewness techniques are used to estimate the symmetry between the data points in a particular location. The range of variation in the data series within a sample time series is determined by the coefficient of variation. Regression analysis is carried out to understand the interrelationship in the water quality parameters (WQPs) through regression between the dependent variable (Y) and independent variable (X), which is represented by the following regression equation:  Table 1. Observed water quality parameters with their symbols and site wise distribution of the observed water quality parameters [42].

Statistics Analysis
The statistics, including indices and multivariate of WQP time series, are analyzed using the mode, mean and median to calculate the standard deviation from the series datasets, and spatially averaged frequency values. The sample dataset variability is assessed using standard deviation (Std), and kurtosis for data peakedness estimation. Skewness techniques are used to estimate the symmetry between the data points in a particular location. The range of variation in the data series within a sample time series is determined by the coefficient of variation. Regression analysis is carried out to understand the interrelationship in the water quality parameters (WQPs) through regression between the dependent variable (Y) and independent variable (X), which is represented by the following regression equation: where C is the integration constant.
The predicted values for the variables are ψ(X), ψ(Y), ψ(XY) and σ is the standard deviations of the variables.

Mathematical Analysis
WQPs are used in fractal prediction patterns, such as the chaotic, random, or deterministic structural form, to analyze the irregular pattern in the data time series. Identification, categorization, and mapping of intensive water characteristics necessitate a continual assessment of the natural resource data.

Fractal Dimension (FD)
Scientists and hydrologists are concerned about distinguishing between both clean and polluted water and developing relationships between their characteristics. Furthermore, local officials, especially those in developing countries, are finding it difficult to make drinkable water available with the increased levels of contaminants. The oscillations of environmental and also soil parameters, along with the dynamic interactions between subsurface hydrology, influence the statistical association of water systems.

Hurst Exponent (H)
The Hurst exponent (H) is calculated using typical wavelet approaches and evaluated by the regression equation. The coefficient of each water parameter indicates whether or not it has Brownian time series (or true random walk) behavior with other variables. Table 2 summarizes the statistical and fractal analyses and each WQI analysis.
The fractal and fractal dimension concept [43] was formulated by Benot Mandelbort in the year 1975 and defined the "fraction or fractured", which are governed by their selfsimilarity characteristic. This means that they exhibit similar features across a wide range of scales and that a single system has comparable qualities to the entire fractal. Owing to its efficiency and automated computational complexity, fractal dimension estimates from a fractal set have a variety of approaches. In several domains, box-counting is a popular technique for analyzing image properties such as texture segmentation, classification, and graphic analysis. Other prominent advantages of FD approaches include being able to distinguish deterministic and randomness in time series datasets in the form of variance and spectra distribution. The fractal is a mathematical approach used in fractal geometry to investigate naturally complex phenomena, such as the structures of clouds and geographic boundaries, as well as to differentiate between glacial and fluvial morphology. Fractal based on the degree of multifractality analysis for landscapes of glaciers and rivers has been examined and has found that more glaciers have a more complicated structure than rivers. The Hausdorff dimension is the most fundamental definition of FD; however, other common interpretations that are simple to calculate include box-counting and box dimension.
The Hurst exponent (H) is a factual variable analysis with exponential-scaling; it is an indicator of a time series long-term memory. The H and FD are also inextricably linked, indicating the roughness of a surface. The H can be persistent (0.5 < H ≤ 1) or antipersistent 0 ≤ H < 0.5) in a time series, and when the data are not inter-correlated, H = 0.5 indicates that the series is unpredictable. Because it provides statistical self-similarity relationships, this method is applied in a variety of complicated engineering domains. The Hurst exponent of time series is defined as follows in terms of the exponential growth scaling relation. The H of a real-valued time series is defined as follows in terms of the exponential growth scaling relation: = Cn H , as 'n' approaches to infinity (4) C indicates constant, angular brackets · · · imply the anticipated value, S(n) is the standard deviation of the initial 'n' data of the series {X 1 , X 2 , · · · , X n }; R(n) is their range, which is as follows: H is calculated from the R/S technique and calculated from the wavelets approach for the time series.

Evaluation of Wavelets Approach for H
If f (t) is a self-affine random process and position parameter 't' (i.e., time or distance), a > 0 is a dilation factor and w(t) is an initial wavelet.
Electronics 2022, 11, 1927 12 of 35 If the continuous wavelet transform of f (t) is its shifted w t,a (t ), dilated, and scaled version, then f (t) is specified as the following: Trend analysis f (t) is self-affine and the dispersion of W(t, a) will range monotonically with the following expansion factor: when the exponent 'δ' is in the middle of −1 and 3 (i.e., −1 ≤ δ ≤ 3), the H is specified as the following: FGN implies fractal Gaussian noise, FBM implies fractional Brownian motion; H is defined in terms of fractal dimension (D) [43] as the following: where H is the Hurst exponent; D is the fractal dimension lying between 1.0 and 2.0. Now, the predictability index (PI) is given as The data series is uncertain for values of PI near to zero, whereas, the data series is predictable for values close to 1. Figure 4 shows the quantitative correlation between the major water quality metrics, indicating that fluctuation in these characteristics is caused by variability throughout the originating ecosystem and is influenced by the topographical conditions through which it flows down.

Distribution of the WQPs under Different Stress Conditions: A Quantitative and Qualitative Analysis
The mean, median, mode, and standard deviation of the daily data series for the year 2016 have been plotted for all 13 experimental sites. In site 3 (Chamoli district), the highest mean value is observed particularly for the Cl − , Fe 2+ , As ++ , NO 3 − and hardness, and apart from this, the TDS is the second highest among all the experimental sites. The mean value and standard deviation (127.6) of Cl − is observed to be the highest in experiment site 4 (i.e., Champavat). Ca mean concentration is found to vary among the experiment sites.
The mode values of water quality parameters for turbidity, Cl − , Fe 2+ , As ++ , NO 3 − , pH, and SO 4 2− are within almost same range (0-1) in all experimental sites, except for Ca, Mg, alkalinity, and hardness, which show the highest variation in the district Nainital, Bageshwer, Champavat, USN and Haridwar, respectively.
The median values of the parameters are also almost within the same range for all experimental sites, except Cl − , which is found to have the highest concentration in Chamoli and Cahasahigh, in Pithoragarh, followed by Bageshwer, USN, Chamoli district. The concentration of TDS is observed to be high in the experimental sites, i.e., Nainital, Pithoragarh, Almora, Uttarkashi, USN, Champavat, and Bageswar. The highest alkalinity is observed in Nainital followed by Uttarakashi, Pithoragarh, Bageswer, Pauri Garhwal, and Almora districts. Hardness concentration is found to be the highest in Haridwar and followed by Pithoragarh, Bageshwer, Dehradun, USN, Uttarakashi, and least in Almora.
The standard deviations of the parameters such as turbidity, Fe 2+ , As ++ , NO 3 − , pH, and F − are within almost the same ranges among all experimental sites, except some fluctuations were observed for Cl − , Ca, Mg, TDS, alkalinity, hardness and SO 4 2− . The statistical parameters, i.e., mean, mode, median, and standard deviation of hardness sampling, are found more in experimental site 3 (Chamoli district). The variation in the turbidity, Cl − , Fe 2+ , NO 3 − , pH and F − parameters have almost reached the satisfaction level in all experimental sites.
The data series is uncertain for values of PI near to zero, whereas, the data series is predictable for values close to 1. Figure 4 shows the quantitative correlation between the major water quality metrics, indicating that fluctuation in these characteristics is caused by variability throughout the originating ecosystem and is influenced by the topographical conditions through which it flows down The mean, median, mode, and standard deviation of the daily data series for the year 2016 have been plotted for all 13 experimental sites. In site 3 (Chamoli district), the highest mean value is observed particularly for the Cl _ , Fe 2+ , As ++ , NO3 − and hardness, and apart

Spatial Variation in WQPs in Different Stress Conditions
Geo-spatial maps are prepared using remote sensing and GIS techniques, which represent the distribution of WQPs over the region. The geo-spatial maps of the GWQ parameters under different geographical and geological stress conditions over the 13 districts of state Uttarakhand, which reveals through the extrapolation maps the distribution in different stress conditions, are shown in Figure 5. Minimum and maximum distribution of the WQPs is observed in the studied region of the state Uttarakhand. The minimum and maximum observed point values of As ++ are 0.0-0.007 mg/L. The maximum range is observed at the Rudrapryag district. The Cl distribution minimum of 4.43 mg/L and maximum of 36.1 mg/L content level in water is observed in the study region and higher quantities are found in the Rudrapryag and Champavat districts. Ca concentration is observed to be higher in the Pithoragarh, Bageshwar and Haridwar regions with minimum 7.77 mg/L and maximum 123.4 mg/L distribution. Higher 306.6 mg/L alkalinity in water quality is observed in the Haridwar district. Total dissolve salt concentration (TDS) varies from the minimum to maximum range, which is 3.08 to 258.5 mg/L, in the study region and a high level is observed in Nainital district.
F-varies from 0.10 to 2.66 mg/L and a higher concentration is observed in the USN district. The hardness level in the water quality is observed to vary from 45.1 to 371.8 mg/L in the whole study region and the highest concentration is found in the Haridwar district. Mg content level in the water samples is found to be between 5.03 and 55.9 mg/L and higher values are observed in Pithoragarh, followed by the Bageshwar district. Higher levels of SO 4− is observed in the USN district water level, which varies between 0.012 and 40.5 mg/L. NO3 − level is estimated and a higher level is observed in the Haridwar district, which is between 1.36 and 14.6 mg/L. The pH values lie almost in the range, except in Haridwar, Rudrapyag and some regions of the Dehradun and Almora districts. Fe 2+ concentration's minimum is 0.03 mg/L and maximum 1.05 mg/L and a higher concentration is found in the Haridwar region. Turbidity level in the water samples is found in the range of 0.35 to 11.9 NTU and a higher concentration is observed in the Almora district. Minimum and maximum distribution of the WQPs is observed in the studied region of the state Uttarakhand. The minimum and maximum observed point values of As ++ are 0.0-0.007 mg/L. The maximum range is observed at the Rudrapryag district. The Cl distribution minimum of 4.43 mg/L and maximum of 36.1 mg/L content level in water is observed in the study region and higher quantities are found in the Rudrapryag and Champavat districts. Ca concentration is observed to be higher in the Pithoragarh, Bageshwar and Haridwar regions with minimum 7.77 mg/L and maximum 123.4 mg/L distribution. Higher 306.6 mg/L alkalinity in water quality is observed in the Haridwar district. Total dissolve salt concentration (TDS) varies from the minimum to maximum range, which is 3.08 to 258.5 mg/L, in the study region and a high level is observed in Nainital district. F − varies from 0.10 to 2.66 mg/L and a higher concentration is observed in the USN district. The hardness level in the water quality is observed to vary from 45.1 to 371.8 mg/L in the whole study region and the highest concentration is found in the Haridwar district. Mg content level in the water samples is found to be between 5.03 and 55.9 mg/L and higher values are observed in Pithoragarh, followed by the Bageshwar district. Higher levels of SO 4− is observed in the USN district water level, which varies between 0.012 and 40.5 mg/L. NO 3 − level is estimated and a higher level is observed in the Haridwar district, which is between 1.36 and 14.6 mg/L. The pH values lie almost in the range, except in Haridwar, Rudrapyag and some regions of the Dehradun and Almora districts. Fe 2+ concentration's minimum is 0.03 mg/L and maximum 1.05 mg/L and a higher concentration is found in the Haridwar region. Turbidity level in the water samples is found in the range of 0.35 to 11.9 NTU and a higher concentration is observed in the Almora district.  In site 5, persistent behavior is observed between almost all WQIs, except Ca, TDS, and alkalinity. In experimental site 6, the persistence relationship is observed among almost all the WQIs, except F-. In site 7, persistence behavior is observed among Fe 2+ , pH, Ca, Mg, and hardness, but the anti-persistence relationship is observed partially in turbidity, Cl _ , NO3 − , F-, TDS, SO4 2− and alkalinity. In experimental site 8, all the WQIs, such as turbidity, Cl _ , Fe 2+ , NO3 − and TDS, have shown persistence behavior except pH, Ca, Mg, and partially hardness and SO4 2− , which have shown the anti-persistence relationship. In experimental site 9, turbidity, Cl _ , Fe 2+ , pH, Ca, Mg, NO3 − and TDS have shown persistence behavior and F-, hardness and SO4 2− have shown the anti-persistence relationship. In experimental site-0, turbidity, Cl _ , Fe 2+ , NO3 − , TDS and alkalinity have shown persistence behavior, and pH, Ca, Mg, F-, hardness and SO4 2− have shown the anti-persistence relationship. In the experimental site 11, all the studied WQIs have shown persistence behavior except the pH parameter and in site 12, all WQIs have shown unpredictability among each parameter.
In site 5, persistent behavior is observed between almost all WQIs, except Ca, TDS, and alkalinity. In experimental site 6, the persistence relationship is observed among almost all the WQIs, except F − . In site 7, persistence behavior is observed among Fe 2+ , pH, Ca, Mg, and hardness, but the anti-persistence relationship is observed partially in turbidity, Cl − , NO 3 − , F − , TDS, SO 4 2− and alkalinity. In experimental site 8, all the WQIs, such as turbidity, Cl − , Fe 2+ , NO 3 − and TDS, have shown persistence behavior except pH, Ca, Mg, and partially hardness and SO 4 2− , which have shown the anti-persistence relationship. In experimental site 9, turbidity, Cl − , Fe 2+ , pH, Ca, Mg, NO 3 − and TDS have shown persistence behavior and F − , hardness and SO 4 2− have shown the anti-persistence relationship. In experimental site-0, turbidity, Cl − , Fe 2+ , NO 3 − , TDS and alkalinity have shown persistence behavior, and pH, Ca, Mg, F − , hardness and SO 4 2− have shown the anti-persistence relationship. In the experimental site 11, all the studied WQIs have shown persistence behavior except the pH parameter and in site 12, all WQIs have shown unpredictability among each parameter.

WQPs Analysis
In this section, the WQPs are analyzed by the variation in each parameter under the different stress conditions or experimental sites. The variation in the parameters is analyzed by their distribution statistics, such as mean, median, mode and standard deviation (Figure 4) among the datasets and spatial variation in terms of skewness and kurtosis are represented in Figure 7. The correlation matrix is established among the different WQPs and is summarized in Table 3 for each experimental site (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13). Table 3 describes the correlation among the WQP of the 13 sites, where it presents the correlation of one WQP with another WQP. In site '1 , the alkalinity is strongly correlated with Ca and Mg. Hardness is also strongly correlated with Mg. It has been observed from 13 sites that the alkalinity and hardness have a strong correlation ratio with Mg, Ca and TDS, except in site 2, site 5, site 6, and site 7. This table demonstrates that the WQPs are within the standard values except a few sites, as the alkalinity is high in these few sites. Considering skewness, kurtosis, SD and correlation matrix results among the different data series over the different sites 1-13, a comprehensive analysis has been carried out and described in the following sub sections.           The pH of the water samples is observed to be in the standard range in all the experimental sites, with average and median values being nearly identical, at 7.189 and 7.20, respectively; however, the pH mode is 0.25. These numbers are nearly comparable and indicate that the behavior is normal and the pH is symmetrical because the standard deviation (SD) is 0.691. The negative skewness and corresponding negative kurtosis are observed in almost all the experimental sites, except in the Tehri and Pithogagarh site which shows high kurtosis; hence, except for these two sites, the curve is platykurtic. It is observed that the element pH has the Brownian time-series behavior with F-, persistent behavior with Ca, Mg, Fe 2+ and NO3 − , and anti-persistent behavior with various parameters such as turbidity, Cl _ , TDS, alkalinity, hardness and SO4 2− .
The TDS is observed for all the experimental site WQPs and it can be observed that the fitness curve does not follow the usual behavior because the mean, median, and mode values have a highly varied nature. The TDS values are not closely related within the time series dataset for all the experimental sites; hence, the standard deviation is very high. TDS has shown persistence behavior with the distribution of other elements, such as Cl-, pH, Fl, and SO4 2− , TDS, and turbidity, which exhibit a Brownian time series. It has a negative skew and a platykurtic curve for a few experimental sites. NO3 − , Cl-and alkalinity have persistent behavior in almost all the experimental sites, while Fe 2+ , Ca, Mg, and hardness parameters have anti-persistent activity.

Variation in Chloride, Flouride, Nitriate and Sulphate
The Cl-concentration varies with mean values from the lowest (0.46 mg/L) to the highest (35.2 mg/L) across the experimental sites, i.e., Bageshwar and Rudraprayag districts, respectively. Even the median varies from 0 to 1, except 3.1 and 9 for the Pithoragrah and Almora districts, respectively, although the mode values are in the order of 0, indicating that the data are laid normally between the sample points and there is also a normal distribution of the SD, except for the large standard deviation, i.e., 135.8 and 127.6 for the Champavat and Rudrapyagh districts. The kurtosis and skewness variation has shown that the curve is symmetrical, except for some experimental sites with larger kurtosis. With the parameters F-, SO4 2− , and turbidity, Cl-has Brownian time series (true random walk) behavior. As a result, the curve has a platykurtic shape. Turbidity, NO3 − , TDS and alkalinity show persistent behavior, while Fe 2+ , pH, Ca, Mg, and hardness show anti-persistent behavior.
The F-element datasets for all the experimental sites have shown normal behavior and also the mean and median values are closely related. The skewness and kurtosis curves indicate the platykurtic nature of all the sites, which also have a lower standard Alkalinity has shown the higher average value, which indicates the alkaline nature in the data series. In most of the experimental sites, the alkalinity has shown the highest median and mode values that are not substantially identical, indicating that the sample data are largely distributed.
In a few experimental sites, the dataset shows a high standard deviation and a skewness value that is close to zero. The curve between the skewness and kurtosis indicates that the series is symmetrical and platykurtic. In most of the experimental sites, alkalinity has shown persistence behavior with Cl − , F _ , TDS, and SO 4 2− parameters, and it shows Brownian motion, while Alkalinity has also shown an anti-persistence nature with turbidity, Fe 2+ , pH, Ca, Mg, and hardness parameters in few cases.
The hardness of the WQPs has shown large mean, median, and standard deviation values for almost all the studied experimental sites, which shows the distributed datasets that are diversely significant from the mode value and the data series does not behave normally. While the M=mode has zero values in most of the sites, a few sites have large values, especially for the locations of the Kumaun region. In most of the experimental sites, the datasets have a skewness value that is close to one or zero. The curve between skewness and kurtosis indicates that the series is symmetrical and platykurtic. The data series of the few sites have negative skewness values with the Brownian time series. Turbidity, Cl − , NO 3 − , pH, Ca, F _ , TDS, alkalinity, and SO 4 2− all exhibit persistent activity, while Fe parameters exhibit anti persistent behavior.
The pH of the water samples is observed to be in the standard range in all the experimental sites, with average and median values being nearly identical, at 7.189 and 7.20, respectively; however, the pH mode is 0.25. These numbers are nearly comparable and indicate that the behavior is normal and the pH is symmetrical because the standard deviation (SD) is 0.691. The negative skewness and corresponding negative kurtosis are observed in almost all the experimental sites, except in the Tehri and Pithogagarh site which shows high kurtosis; hence, except for these two sites, the curve is platykurtic. It is observed that the element pH has the Brownian time-series behavior with F − , persistent behavior with Ca, Mg, Fe 2+ and NO 3 − , and anti-persistent behavior with various parameters such as turbidity, Cl − , TDS, alkalinity, hardness and SO 4 2− . The TDS is observed for all the experimental site WQPs and it can be observed that the fitness curve does not follow the usual behavior because the mean, median, and mode values have a highly varied nature. The TDS values are not closely related within the time series dataset for all the experimental sites; hence, the standard deviation is very high. TDS has shown persistence behavior with the distribution of other elements, such as Cl − , pH, Fl, and SO 4 2− , TDS, and turbidity, which exhibit a Brownian time series. It has a negative skew and a platykurtic curve for a few experimental sites. NO 3 − , Cl − and alkalinity have persistent behavior in almost all the experimental sites, while Fe 2+ , Ca, Mg, and hardness parameters have anti-persistent activity.

Variation in Chloride, Flouride, Nitriate and Sulphate
The Cl − concentration varies with mean values from the lowest (0.46 mg/L) to the highest (35.2 mg/L) across the experimental sites, i.e., Bageshwar and Rudraprayag districts, respectively. Even the median varies from 0 to 1, except 3.1 and 9 for the Pithoragrah and Almora districts, respectively, although the mode values are in the order of 0, indicating that the data are laid normally between the sample points and there is also a normal distribution of the SD, except for the large standard deviation, i.e., 135.8 and 127.6 for the Champavat and Rudrapyagh districts. The kurtosis and skewness variation has shown that the curve is symmetrical, except for some experimental sites with larger kurtosis. With the parameters F − , SO 4 2− , and turbidity, Cl − has Brownian time series (true random walk) behavior. As a result, the curve has a platykurtic shape. Turbidity, NO 3 − , TDS and alkalinity show persistent behavior, while Fe 2+ , pH, Ca, Mg, and hardness show anti-persistent behavior.
The F − element datasets for all the experimental sites have shown normal behavior and also the mean and median values are closely related. The skewness and kurtosis curves indicate the platykurtic nature of all the sites, which also have a lower standard deviation value, demonstrating the closeness between the data points. F − has a Brownian time series (true random walk) with all the studied parameters, such as NO 3 − and hardness parameters. F − displays persistent behavior with turbidity, Cl − , SO 4 2− , TDS, and alkalinity, as well as anti-persistent behavior with Fe 2+ , pH, Ca, and Mg in a few experimental sites.
It is observed that for all the experimental sites, the NO 3 − element varies with normal distribution and it is evident from the series mean and median values. Apart from these, the standard deviation of the dataset overall experimental sites appears to be within the standard values, and this denotes that the sample data are near one another. The skewness and kurtosis variation is symmetrical; hence, the platykurtic curve, as with kurtosis, is less than 3. In most experimental sites, NO 3 − has persistent behavior with other elements, such as Cl − , F − , SO 4 2− , TDS, alkalinity, hardness, and turbidity, while Ca, Mg, Fe 2+ and PH parameters have anti-persistent behavior.
In almost all the experimental sites, SO 4 2− has less indication with the 0-mode value and the mean and median values show many differences and are not the same. The WQPs datasets are dispersive, as evidenced by the higher standard deviation. The skewness and kurtosis values of the different data series indicated that the series are symmetrical and platykurtic. SO 4 2− in most of the cases provides real random-walk flow and exhibits both persistent and anti-persistent behavior with turbidity, Cl − , TDS, and alkalinity parameters in comparing the different experimental sites.

Variation in Iron, Arsenic, Calcium and Magnesium
The average, median, and mode values of Fe 2+ are nearly equal; hence, its distribution is observed under the standard or normal conditions. Dataset variation is observed to be Close to each other and this is evident by the standard deviation among the WQPs, which vary from 0.03 to 7 in all the experimental sites. The skewness and kurtosis curves are platykurtic for almost all the experimental sites because of the low variations, while the high skewness indicates that it is not symmetrical and also the kurtosis value is too large. Brownian time series (True random walk) behavior is observed with NO 3 − , F − , and hardness parameters in the sample dataset having heavier outliers and Fe 2+ . It displays persistent behavior with pH, Ca, and Mg parameters, but anti-persistent behavior with Cl − , TDS, alkalinity, and SO 4 2− factors. The datasets for the As ++ element are found in a few experimental sites and observed with fewer amounts. The mean, median, and mode values of As ++ are nearly equal; hence, its distribution is observed under the standard or normal conditions. Dataset variation is observed to be close to each other and this is evident by the standard deviation, which varies from 0.0 to 0.07 in the respective experimental sites. The skewness and kurtosis curves are platykurtic for almost all the experimental sites because of the low range variations.
Ca element variation in all the experimental sites is observed, meaning that the mean, median, and mode values are largely dispersed, and abnormalities are found in a few datasets. The high standard deviation suggests that the Ca levels are extremely discrete and generally low values are observed. The curve between skewness and kurtosis is not platykurtic and is positively skewed for almost all the sites. Ca normally shows persistent behavior with the distribution of other elements, such as Cl − , TDS, alkalinity, hardness, Mg, F − , SO 4 2− , TDS, and turbidity, while the Fe 2+ and pH parameters show anti-persistent behavior.
The Mg element is observed to show normal variation except in the Kumaun region (plane regions of the Uttarakhand). The dataset series for all the experimental sites are found to be dispersive; hence, the skewness and kurtosis curves do not reflect normal behavior. The average, median, and mode of Mg are under the standard range. The standard deviation considerably indicates that the Mg values are widely dispersed. The curve is platykurtic and it is positively skewed for all the experimental sites, except for the Tehri Garhwal. In almost all the experimental sites, the Mg element shows a Brownian time series (true random walk) with pH and alkalinity parameters. Mg exhibits persistent behavior with turbidity, Cl − , NO 3 − , Ca, TDS, hardness, F − , and SO 4 2− , while it exhibits anti-persistent activity with Fe 2+ and As ++ .

The Findings from the Results
In this study, the analysis of the variations and distribution of the WQPs in 13 experimental sites are discussed with spatial variation in the WQP remote sensing and GIS methods under different stress conditions.
In addition to this, the predictability analysis from the fractal approach in 13 experimental sites is also evaluated. Skewness and kurtosis curves are implemented on each parameter of 13 experimental sites. In all 13 experimental sites (districts), the non-platykurtic curve was observed in the water quality parameters to identify the comprehensive characteristics of the water quality indices. The curve between skewness and kurtosis is not symmetrical, as evidenced by the positive skewness among the series of data points. Kurtosis has shown large values, which indicate that the curve for datasets such as turbidity, Cl − , Fe 2+ , Ca, and SO 4 2− are platykurtic. For almost all the experimental sites, the NO 3 − , Ca, Mg, TDS, alkalinity, and hardness variation is in normal form as the high coefficient of variation among each one is observed. In most of the WQIs in totality, Brownian time-series behavior has been observed. In most of the experimental sites, F − exhibits Brownian time series behavior, while Ca, Mg, Fe 2+ , and NO 3 − exhibit persistent behavior, and turbidity, Cl − , TDS, alkalinity, hardness, and SO 4 2− exhibit anti-persistent behavior. Turbidity, Cl − , NO 3 − , F − , TDS, alkalinity and SO 4 2− have a persistent nature with Fe, Ca, Mg, and NO 3 − for almost all the experimental sites, and all have a stable relationship with the hardness of the water. When comparing Fe 2+ with Cl − , TDS, alkalinity, and SO 4 2− parameters, an anti-persistent tendency can be observed. NO 3 − has an anti-persistent relationship with Ca, Mg, Fe 2+ , and pH, while with other factors, including turbidity, Cl − , TDS, alkalinity, hardness, SO 4 2− has an anti-persistent performance. Mg has anti-persistent behavior with only Fe 2+ , whereas Ca has anti-persistent activity with both Fe and pH parameters in almost all the sites. With Fe 2+ , pH, Ca, and Mg, the parameters F − , alkalinity, TDS, and SO 4 2− possess an anti-persistence nature. The various indices show a consistent pattern, indicating that the fluctuations in WQPs are within an adequate level of one another. The fractal and statistical analysis were combinedly found to be a better approach for calculating linkage among the water quality indices. The water samples were found to be safe to drink and in healthy condition in almost all the districts of the state Uttarakhand, except for the Haridwar district, where some increase in contaminants was observed.
The current proposed architecture enables us to identify the WQIs of every district based on the ground data from the 13 districts of the Uttarakhand state. The proposed architecture empowers us to confirm the water quality of every district. In addition to this, IoT and ML-based framework was implemented to carry out the statistical multivariate approaches using the RF model to find the correlation among the parameters. The broad execution of this architecture in the future scope allows for the administration to recognize which districts' ground water quality is worsening, as well as to envision water quality data in real time on a cloud server, enabling the administration to comply with the requirements to purify the water before supplying it to the people. The calibrating of the many sensors, as well as the availability of a few sensors with the technology for acquiring real-time data, are the suggested architecture's limitations.

Conclusions
Sustainable management and availability of water quality ensure the health of the human beings associated with that region. This can be achieved with the integration of big data, IoT and ML. In this study, a big data framework is proposed to find the correlation between WQPs of 13 districts of the Uttarakhand state. In addition to this, IoT and MLbased framework was implemented to carry out the statistical multivariate approaches using an RF model. The variation in WQP is analyzed using an RF model, and the dataset is segmented location wise and the mean, mode, standard deviation, median, kurtosis, and skewness of time series datasets are examined. The means of the parameters are adjusted with the coefficient of variation based on the standard values of each parameter. The water samples were found to be safe to drink and in healthy condition in almost all the districts of the state Uttarakhand, except for the Haridwar district, where some increase in contaminants was observed.