Thailand Raw Water Quality Dataset Analysis and Evaluation

: AbstractSustainable water quality data are important for understanding historical variability and trends in river regimes, as well as the impact of industrial waste on the health of aquatic ecosystems. Sustainable water management practices heavily depend on reliable and comprehensive data, prompting the need for accurate monitoring and assessment of water quality parameters. This research describes a reconstructed daily water quality dataset that complements rare historical observations for six station points along the Chao Phraya River in Thailand. Internet of Things technology and a Eureka water probe sensor is used to collect and reconstruct the water quality dataset for the period from June 2022–February 2023, with Turbidity, Optical Dissolved Oxygen, Dissolved Oxygen Saturation, Spatial Conductivity, Acidity/Basicity, Total Dissolved Solids, Salinity, Temperature, Chlorophyll, and Depth as the recorded parameters from six different stations. The presented dataset comprises a total of 211,322 data points, which are separated into six CSV ﬁles. The dataset is then evaluated using the Long Short-Term Memory (LSTM) algorithm with a Mean Squared Error (MSE) of 0.0012256, and Root Mean Squared Error (RMSE) of 0.0350080. The proposed dataset provides valuable insights for researchers studying river ecosystems, supporting informed decision-making and sustainable water management practices.


Summary
The assessment of environmental health can be accomplished by considering five key factors: soil, water, climate, natural vegetation, and landscapes.Out of these elements, water plays the most critical role in supporting human life and the survival of various ecosystems [1].Its importance extends to drinking, household use, food production, and recreation, making safe and clean water an essential requirement for public health [2].Therefore, it is vital to maintain proper water quality for preventing significant harm to human well-being and for maintaining an ecological balance for other species [3].Water pollution, a significant global problem, requires ongoing evaluation and international efforts to effectively manage water resources, from a broader perspective to individual Ref. [23]: the dataset focuses on evaluating the water quality of surface water in the Kalingarayan Canal, specifically concerning heavy metal pollution in the Tamil Nadu region [24] The dataset includes measurements of eight specific heavy metals, namely iron, copper, manganese, chromium, zinc, cadmium, lead, and nickel.

Dataset Description
The collected dataset contains various information such as error logs, wipe schedules, and sensor logs, therefore filtering data is carried out first to separate sensor logs from other data.The data obtained from the database are in the form of MySQL (.sql) files that are then filtered and converted into Comma Separated Values (.csv), which separates data between stations.The naming of "XX Logs.csv" was performed to make it easier to categorize by station, where XX is the station ID.There are six CSV dataset files named s1 Logs.csv, s2 Logs.csv, s3 Logs.csv, s4 Logs.csv, s5 Logs.csv, and s15 Logs.csv.In the dataset, the comma symbol (,) is employed as a separator between columns, while the dot symbol (.) is utilized to indicate decimal values.The initial row of the CSV file includes the titles for each data column, which can be observed in Figure 1.Additionally, the distribution of our dataset for station 1 is illustrated in Figure 2a-j.The turbidity (NTU) of water should be lower for better clarity, while a higher optical dissolved oxygen (HDO) level is desirable.Lower values of Spatial Conductivity (SPCOND) indicate less saltiness, and the pH range of water should ideally be between 6.5 and 8.5.Lower total dissolved solids (TDS) below 1000 are preferable.Salinity represents the dissolved salt content of a body of water, and the temperature typically falls within the range of 43 to 68 degrees Fahrenheit.

Dataset Description
The collected dataset contains various information such as error logs, wipe schedules, and sensor logs, therefore filtering data is carried out first to separate sensor logs from other data.The data obtained from the database are in the form of MySQL (.sql) files that are then filtered and converted into Comma Separated Values (.csv), which separates data between stations.The naming of "XX Logs.csv" was performed to make it easier to categorize by station, where XX is the station ID.There are six CSV dataset files named s1 Logs.csv, s2 Logs.csv, s3 Logs.csv, s4 Logs.csv, s5 Logs.csv, and s15 Logs.csv.In the dataset, the comma symbol (,) is employed as a separator between columns, while the dot symbol (.) is utilized to indicate decimal values.The initial row of the CSV file includes the titles for each data column, which can be observed in Figure 1.Additionally, the distribution of our dataset for station 1 is illustrated in Figure 2a-j.The turbidity (NTU) of water should be lower for better clarity, while a higher optical dissolved oxygen (HDO) level is desirable.Lower values of Spatial Conductivity (SPCOND) indicate less saltiness, and the pH range of water should ideally be between 6.5 and 8.5.Lower total dissolved solids (TDS) below 1000 are preferable.Salinity represents the dissolved salt content of a body of water, and the temperature typically falls within the range of 43 to 68 degrees Fahrenheit.

Dataset Description
The collected dataset contains various information such as error logs, wipe schedules, and sensor logs, therefore filtering data is carried out first to separate sensor logs from other data.The data obtained from the database are in the form of MySQL (.sql) files that are then filtered and converted into Comma Separated Values (.csv), which separates data between stations.The naming of "XX Logs.csv" was performed to make it easier to categorize by station, where XX is the station ID.There are six CSV dataset files named s1 Logs.csv, s2 Logs.csv, s3 Logs.csv, s4 Logs.csv, s5 Logs.csv, and s15 Logs.csv.In the dataset, the comma symbol (,) is employed as a separator between columns, while the dot symbol (.) is utilized to indicate decimal values.The initial row of the CSV file includes the titles for each data column, which can be observed in Figure 1.Additionally, the distribution of our dataset for station 1 is illustrated in Figure 2a-j.The turbidity (NTU) of water should be lower for better clarity, while a higher optical dissolved oxygen (HDO) level is desirable.Lower values of Spatial Conductivity (SPCOND) indicate less saltiness, and the pH range of water should ideally be between 6.5 and 8.5.Lower total dissolved solids (TDS) below 1000 are preferable.Salinity represents the dissolved salt content of a body of water, and the temperature typically falls within the range of 43 to 68 degrees Fahrenheit.The dataset consists of 16 columns with ID values as the primary one and station_id as markers for each station.collected from sensors at each station.The dataset obtained still contains noise in the form of lost values due to disconnected internet connections, therefore data cleansing is carried out using formula (1) where d i is noise data, and i−1 and i+1 correspond to the previous and next valid measurements relative to the missing data point i .Data preprocessing holds significant importance within the data analysis and machine learning pipeline.It encompasses the identification and rectification of errors, inconsistencies, and inaccuracies in a dataset to enhance its quality and reliability.In the current scenario, the provided datasets were collected from six distinct water stations, which has introduced inconsistencies in the data formats.To address the issue of missing data, the standard data range was outlined in Table 2.This step ensures that the dataset remains consistent and reliable for further analysis.
Table 3 shows the distribution of the sum, mean, standard deviation, minimum value, and maximum value for Station 1 after the data preprocessing step.The correlation between the collected sensor parameters is presented in Figure 3.One important relationship is between hdo_sat and hdo, which demonstrates a close correlation because the value of dissolved oxygen in units of mg/l is converted to a percentage (%) referred to as dissolved oxygen saturation.Additionally, spcond exhibits a close correlation with both tds and salinity, indicating their interdependence.Interestingly, tds, salinity, and spcond are negatively correlated with both hdo and hdo_sat, suggesting that an increase in these variables may result in a decrease in water quality.On the other hand, variables such as turb_ntu, pH, chl, and temp exhibit low to medium correlations with other variables, implying that their impact on water quality might be more nuanced and influenced by additional factors.Understanding these interconnections aids in comprehending the complex dynamics of water quality assessment and management.For more details, Figure 4 presents a graph showing the correlation between the Spatial Conductivity, TDS, and Salinity parameters, as these three variables influence each other.Figure 5 displays the correlation between HDO and HDO Saturation values, where HDO influences HDO Saturation.as these three variables influence each other.Figure 5 displays the correlation between HDO and HDO Saturation values, where HDO influences HDO Saturation.

Methods
This research selected six stations in Thailand based on the geographical location from upstream to downstream.In terms of implementation, a total of twenty-two stations have been strategically deployed along the Chao Phraya River.However, this dataset presents data on six stations that have obtained permission from the authorities.The location of each station is shown in Table 4. Figure 6a shows the specific location of the station installation in central Thailand.Figure 6b presents the distribution of stations along the river.Figure 6c-h provides detailed information on the location of the six stations from the satellite imagery.

Methods
This research selected six stations in Thailand based on the geographical location from upstream to downstream.In terms of implementation, a total of twenty-two stations have been strategically deployed along the Chao Phraya River.However, this dataset presents data on six stations that have obtained permission from the authorities.The location of each station is shown in Table 4. Figure 6a shows the specific location of the station installation in central Thailand.Figure 6b presents the distribution of stations along the river.Figure 6c-h provides detailed information on the location of the six stations from the satellite imagery.

Methods
This research selected six stations in Thailand based on the geographical location from upstream to downstream.In terms of implementation, a total of twenty-two stations have been strategically deployed along the Chao Phraya River.However, this dataset presents data on six stations that have obtained permission from the authorities.The location of each station is shown in Table 4. Figure 6a shows the specific location of the station installation in central Thailand.Figure 6b presents the distribution of stations along the river.Figure 6c-h provides detailed information on the location of the six stations from the satellite imagery.

Hardware Specification
The Eureka Manta +35 (Austin, TX, USA) multiprobe sensor (Figure 7) is used as a sensor to retrieve water quality data.Data from the sensors are then collected using a Mini PC K6-F13D (Bangkok, Thailand) as an IoT Gateway and are stored in a database.Figure 8 shows the deployment of one of the stations.To maintain the quality of the raw data

Hardware Specification
The Eureka Manta +35 (Austin, TX, USA) multiprobe sensor (Figure 7) is used as a sensor to retrieve water quality data.Data from the sensors are then collected using a Mini PC K6-F13D (Bangkok, Thailand) as an IoT Gateway and are stored in a database.Figure 8 shows the deployment of one of the stations.To maintain the quality of the raw data collected, the Metropolitan Waterworks Authority of Thailand conducts a sensor calibration process every once per month (Figure 9).collected, the Metropolitan Waterworks Authority of Thailand conducts a sensor calibration process every once per month (Figure 9).

System Overview
The data obtained are then stored in a database and displayed on http://rwc.mwa.co.th/page/info/ (access date 1 March 2023), which is a platform for displaying water quality data.The overview of the monitoring system used in collecting data is depicted in Figure 10.This system is divided into three main parts, namely Local Area Network (LAN), Cloud Server, and Web-Based data visualization.For the LAN section, the sensor is read in Python and then connected to the IoT gateway via a Serial Communication protocol.Furthermore, the data that have been obtained at each station are connected to the cloud server (MySQL) with the HTTP protocol.The data that have been collected can be accessed by the public on web-based applications.

System Overview
The data obtained are then stored in a database and displayed on http://rwc.mwa.co.th/page/info/ (access date 1 March 2023), which is a platform for displaying water quality data.The overview of the monitoring system used in collecting data is depicted in Figure 10.This system is divided into three main parts, namely Local Area Network (LAN), Cloud Server, and Web-Based data visualization.For the LAN section, the sensor is read in Python and then connected to the IoT gateway via a Serial Communication protocol.Furthermore, the data that have been obtained at each station are connected to the cloud server (MySQL) with the HTTP protocol.The data that have been collected can be accessed by the public on web-based applications.

System Overview
The data obtained are then stored in a database and displayed on http://rwc.mwa.co.th/page/info/ (accessed on 1 March 2023), which is a platform for displaying water quality data.The overview of the monitoring system used in collecting data is depicted in Figure 10.This system is divided into three main parts, namely Local Area Network (LAN), Cloud Server, and Web-Based data visualization.For the LAN section, the sensor is read in Python and then connected to the IoT gateway via a Serial Communication protocol.Furthermore, the data that have been obtained at each station are connected to the cloud server (MySQL) with the HTTP protocol.The data that have been collected can be accessed by the public on web-based applications.

Neural Network
The concept of a Neural Network (NN) is an imitation of the structure of the human brain's neural network.Neural networks, a fundamental component of deep learning, consist of multiple layers that work in harmony to process and analyze data.The key layers in a neural network include the input layer, hidden layer(s), and output layer.The input layer receives the initial data and serves as the network's entry point.Hidden layers, positioned between the input and output layers, extract meaningful patterns and representations from the input.Finally, the output layer provides the final predictions or results of the network's computations.

Neural Network
The concept of a Neural Network (NN) is an imitation of the structure of the human brain's neural network.Neural networks, a fundamental component of deep learning, consist of multiple layers that work in harmony to process and analyze data.The key layers in a neural network include the input layer, hidden layer(s), and output layer.The input layer receives the initial data and serves as the network's entry point.Hidden layers, positioned between the input and output layers, extract meaningful patterns and representations from the input.Finally, the output layer provides the final predictions or results of the network's computations.
Neural networks utilize forward propagation to generate predictions by passing data through the layers, and backward propagation to adjust parameters based on prediction errors during training [28].Many types of neural network algorithms are well known in the world.Neural networks use interconnected layers for processing data, with forward propagation generating predictions and backward propagation adjusting parameters to minimize errors.This iterative process enables accurate predictions and the learning of complex relationships in various tasks.Figure 11 shows the commonly used three-layer neural network structure, consisting of input, hidden, and output layers [29].Neural networks utilize forward propagation to generate predictions by passing data through the layers, and backward propagation to adjust parameters based on prediction errors during training [28].Many types of neural network algorithms are well known in the world.Neural networks use interconnected layers for processing data, with forward propagation generating predictions and backward propagation adjusting parameters to minimize errors.This iterative process enables accurate predictions and the learning of complex relationships in various tasks.Figure 11 shows the commonly used three-layer neural network structure, consisting of input, hidden, and output layers [29].

Neural Network
The concept of a Neural Network (NN) is an imitation of the structure of the human brain's neural network.Neural networks, a fundamental component of deep learning, consist of multiple layers that work in harmony to process and analyze data.The key layers in a neural network include the input layer, hidden layer(s), and output layer.The input layer receives the initial data and serves as the network's entry point.Hidden layers, positioned between the input and output layers, extract meaningful patterns and representations from the input.Finally, the output layer provides the final predictions or results of the network's computations.
Neural networks utilize forward propagation to generate predictions by passing data through the layers, and backward propagation to adjust parameters based on prediction errors during training [28].Many types of neural network algorithms are well known in the world.Neural networks use interconnected layers for processing data, with forward propagation generating predictions and backward propagation adjusting parameters to minimize errors.This iterative process enables accurate predictions and the learning of complex relationships in various tasks.Figure 11 shows the commonly used three-layer neural network structure, consisting of input, hidden, and output layers [29].The utilization of Backpropagation (BP) in neural network architectures involves various specific procedures, beginning with the initialization of weights and biases within the range of −1 to 1. Subsequently, the input value for each node in the hidden layer is calculated, as depicted in Equation (1).Equation ( 2) is then used to compute the output value for each node in the hidden layer.Equations ( 1) and ( 2) are reused to calculate input and output values at the output layer.The error value that occurs between the predicted and real values is calculated using Equation (3) and then BP is carried out.The error value in the hidden layer is calculated using Equation (4), then an update is made of the weight between each nerve node with Equations ( 5) and (6).Apart from the weight, the bias also needs to be updated using Equations (7) and (8).The process of calculating Equations ( 2)- (11) continues to be carried out sequentially until the conditions are met.Table 5 provides the symbols' descriptions for equations used in this article.

Dataset Experiments and Evaluation
In this study, LSTM is used to evaluate water quality datasets, especially those involving time series data.LSTM is an algorithm developed from the Recurrent Neural Network (RNN), and this algorithm is designed based on traditional RNN problems related to explosions and the loss of gradients from data stored for a long time [30].The significant difference seen in the standard RNN structure with the LSTM is the number of repeating modules.Standard RNN has a simple structure, for example, RNN only has one tanh layer, whereas LSTM has more than one tanh layer and they interact in a unique way [30].Figure 12 shows the three main parts of the LSTM architecture, namely Forget, Input, and Output Gate (FG, IG, OG).In calculations (10)- (15) it can be seen that h t−1 (which is output) and x t (which is input) are inputs from FG, IG, Cell Update, and OG at time t.
Data 2023, 8, x FOR PEER REVIEW 13 of 18

Dataset Experiments and Evaluation
In this study, LSTM is used to evaluate water quality datasets, especially those involving time series data.LSTM is an algorithm developed from the Recurrent Neural Network (RNN), and this algorithm is designed based on traditional RNN problems related to explosions and the loss of gradients from data stored for a long time [30].The significant difference seen in the standard RNN structure with the LSTM is the number of repeating modules.Standard RNN has a simple structure, for example, RNN only has one tanh layer, whereas LSTM has more than one tanh layer and they interact in a unique way [30].Figure 12 shows the three main parts of the LSTM architecture, namely Forget, Input, and Output Gate (FG, IG, OG).In calculations ( 10)-( 15) it can be seen that ℎ (which is output) and  (which is input) are inputs from FG, IG, Cell Update, and OG at time t. 
ℎ , LSTM's ability to capture long-term dependencies and handle sequential data makes it suitable for analyzing and predicting water quality parameters over time.The dataset used has gone through the process of data preparation and pre-processing.At this experiment and evaluation stage, data obtained at the S1-Sam Lae station were used.The dataset is then divided into two for training and testing purposes as shown in Figure 13.The training dataset spanned from July 2022 to December 2022 (75%), while the testing dataset covered the period from January 2023 to February 2023 (25%).
The parameter settings used in this study can be seen in Table 6, where Adam is used as the optimizer algorithm.We conducted an assessment of the LSTM model's predictive LSTM's ability to capture long-term dependencies and handle sequential data makes it suitable for analyzing and predicting water quality parameters over time.The dataset used has gone through the process of data preparation and pre-processing.At this experiment and evaluation stage, data obtained at the S1-Sam Lae station were used.The dataset is then divided into two for training and testing purposes as shown in Figure 13.The training dataset spanned from July 2022 to December 2022 (75%), while the testing dataset covered the period from January 2023 to February 2023 (25%).
The parameter settings used in this study can be seen in Table 6, where Adam is used as the optimizer algorithm.We conducted an assessment of the LSTM model's predictive capabilities over a 45-day horizon, revealing its accurate prediction of a 10-day span.This outcome precisely corresponds to a calculated accuracy of 22.2%.In this scenario, the error rate is exceptionally high, leading to a correspondingly low level of accuracy achieved, which is because fine-tuning of the model has not been carried out yet.Evaluation of the LSTM model trained using test data is carried out by calculating performance metrics such as mean squared error (MSE) and root mean squared error (RMSE) to assess model accuracy in predicting water quality parameters.Statistical results for evaluating turbidity predictions can be seen in Figure 14 and Table 7.The MSE and RMSE metrics are in common use and are especially suitable when the underlying data distribution follows a Gaussian behavior assuming normality of the data in this research.While the choice of MSE and RMSE is reasonable based on Gaussian assumptions, it is important to recognize that real-world datasets may exhibit deviations from this ideal distribution.Ref. [31] provides illustrative examples of situations where the data behavior deviates from normality.This reference highlights the importance of considering non-Gaussian behavior in practical applications, particularly in the context of water analysis and risk assessment.This can be material for further research for other researchers.
capabilities over a 45-day horizon, revealing its accurate prediction of a 10-day span.This outcome precisely corresponds to a calculated accuracy of 22.2%.In this scenario, the error rate is exceptionally high, leading to a correspondingly low level of accuracy achieved, which is because fine-tuning of the model has not been carried out yet.Evaluation of the LSTM model trained using test data is carried out by calculating performance metrics such as mean squared error (MSE) and root mean squared error (RMSE) to assess model accuracy in predicting water quality parameters.Statistical results for evaluating turbidity predictions can be seen in Figure 14 and Table 7.The MSE and RMSE metrics are in common use and are especially suitable when the underlying data distribution follows a Gaussian behavior assuming normality of the data in this research.While the choice of MSE and RMSE is reasonable based on Gaussian assumptions, it is important to recognize that realworld datasets may exhibit deviations from this ideal distribution.Ref. [31] provides illustrative examples of situations where the data behavior deviates from normality.This reference highlights the importance of considering non-Gaussian behavior in practical applications, particularly in the context of water analysis and risk assessment.This can be material for further research for other researchers.

Data 2023, 8 ,
x FOR PEER REVIEW 10 of 18

Figure 14 .
Figure 14.Experimental result of LSTM in predicting turbidity value.

Figure 14 .Table 6 .
Figure 14.Experimental result of LSTM in predicting turbidity value.Table 6. LSTM tuning parameters setting.Parameters Value This transfers function information from the input layer to the hidden layer Sigmoid The function responsible for activating the neural network Tanh The function used for optimizing the neural network Adam The count of elements in the input layer 1 The number of neurons in the hidden layer 64 The count of elements in the output layer 1 The size of each batch used for training 32 The time step used in the neural network model 60 The rate at which the neural network learns and adjusts its weights 0.001

Table 1 .
Previous river water quality dataset.

Table 2
describes each of the water quality parameters' data

Table 2 .
Dataset column name and type.

Table 4 .
Locations of six stations.

Table 4 .
Locations of six stations.