1. Introduction
High-quality water resources are vital in the supply of necessary drinking water for humans and natural ecosystems, but also to guarantee human activities and development [
1,
2]. Nowadays, it is well-studied that several factors interacting in complex systems among them such as population growth, intensive agriculture, urbanization, and industrial activity, increase the water need, especially facing an uncertain context of climate change [
3]. According to a recent United Nations (UN) report, 1.5 million people die each year because of diseases caused by contaminated water because water contamination causes 80% of health problems in low-income countries [
4]. In fact, five million fatalities and 2.5 billion illnesses were accounted for during the time of this report. Therefore, the assessment and prediction of water quality are required to set up whether water is suitable for a certain use and, if not, to find relevant remedies or precautions; however, water quality is determined by many measures that quantify dissolved substances. Due to this, assessing all interacting factors in a groundwater bodies (and/or in a water surface lagoon) is insufficient in low-income countries because the process is expensive and exhausting [
5]. As a result, minimizing the subjectivity and the cost-effectiveness of water quality assessment is a major challenge and several tools are being developed to determine its cleanliness and purity [
6,
7].
The design of an accurate and adapted Water Quality Index (WQI) is a well-accepted indicator used by several international and national organizations to classify water quality at a certain location and time. Some researchers proposed modifications when calculating this indicator (WQI), for instance, Uddin, Nash [
8] presented twenty-one
WQI models for assessing drinking water quality, such as the Horton index, the National Sanitation Foundation (NSF-WQI), and the Bascaron index (BWQI), among others. In order to calculate it, physicochemical parameters must be gathered. As a result, an indicator is achieved that allows the general public to know the water quality in aquifers [
9]. It can also evaluate water characteristics about human health and natural quality effects [
10] or even to decipher its impact on water poverty risk [
11].
Indicators such as the
WQI often are calculated in a complex and time-consuming process. So, many methodologies are proposed to easily and accurately predict these indicators considering its application for larger scales instead of a specific municipality or small catchment. These models make it possible to expect compliance (and noncompliance) with quality requirements in the short and long terms [
12]. Water quality monitoring and forecasting are carried out using a variety of methods such as computational intelligence techniques (such as genetic algorithms, artificial neural networks, and others), which have received increasing attention in environmental time-series prediction research, as they allow for modeling nonlinear systems and are robust to noise data, leading to more right results [
13,
14,
15]. Thus, the machine learning helps to reduce the consumption time to compute the
WQI for each sample. However, using equations to determine the
WQI for 100 samples will consume more time, while using the machine learning (classification learner) will significantly save the consumed time [
12,
13,
14,
15].
Recently, traditional Machine Learning models such as the Decision Tree, which has been frequently used in many fields and applications [
16,
17] has been applied for water quality assessments. The Ensemble Trees (ET), which is considered a more accurate predictor than any of the individual learning algorithms has been tested [
18,
19]. Discriminant analysis (DA) was also utilized in several kinds of research around the world to predict water quality by generating discriminant functions (DFs) for grouping nonoverlapping data based on scores on one or more quantitative predictor variables [
20,
21]. Other researchers even used K-nearest neighbors (KNNs) to classify and predict the water quality [
22,
23].
Likewise, several new studies have been published assessing the behavior of the
WQI by using machine learning algorithms in many regions over the world [
24,
25]. For instance, Support Vector Machines (SVM) can be offered as a robust technique for water quality prediction in a free-form wetland environment because of many variables influencing water quality [
26]. Some approaches adopted SVM to predict sediment load concentration in an arid watershed as in India Samantaray, Sahoo [
27], or to predict the boundaries of water quality limits, for example, in the Kelantan River in Indonesia by Kurniawan, Hayder [
28]. Koranga, Pant [
29] proposed a machine learning model to predict the water quality of Nainital Lake in India. Tan, Yan [
30] used a square support vector machine to predict water quality time series data from China. Mohammadpour, Shaharuddin [
13] forecasted the
WQI in freely constructed wetlands using a support vector machine in Malaysia. Other studies have also been undertaken in Algeria to test the effectiveness of SVM [
31,
32,
33,
34] and confirmed that SVM provides accurate results in less time-consuming and can run with fewer data than other algorithms. However, there is a lack of studies that offer decision-makers effective tools for predicting water quality index to improve water resource planning and to be used at larger scales in arid areas.
Therefore, in this research, classification techniques were used to predict the
WQI for several water samples collected from an arid area, in particular, the Naama province in Algeria which depicts clear signals of water pollution and scarcity. To accomplish this objective, the MATLAB tool was considered because it contains a set of classification learner methods, such as the SVM among others [
35]. The main goals of this study can be summarized as follows: (i) assess the physicochemical properties of different water points (samples) on a large scale (12 municipalities); (ii) determine the water quality of the study area, depending on the
WQI; (iii) apply the learner technique to develop a classification model for dry areas estimating the model’s accuracy about
WQI values. These data were divided into classes such as excellent, good, poor, and very poor or unsafe water in order to facilitate its consideration; (iv) predict the
WQI by using the best classifier, which develops the based prediction accuracy, and (v) offer decision-makers with effective tools for predicting water quality index to improve water resource planning and management in arid areas. We hypothesize that the proposed prediction model will reduce the time to determine the water quality state based on conventional equations.
4. Discussion
Table 3 shows that concentrations of Calcium experienced varied considerably from 12.0 to 832.0 mg/L (average value 137.69 mg/L). These values are much higher than the standards in Europe for Calcium in drinking water ranging from 75–200 mg/L [
56]. Moreover, concentrations of Magnesium also varied considerably from 3.0 to 560.0 mg/L (average value of 76.03 mg/L). These values are much higher than other reference values found in literature as 78–155 mg/L (Calcium) and 28–54 mg/L (Magnesium) found in Slovakia [
58] and also in Egypt, as 8–197 mg/L (Calcium) and 1.6–110 mg/L (Magnesium) [
59].
Moreover,
Table 3 also depicts strong variations in sodium levels of groundwater samples. The values ranged from 5.0 mg/L to 2967.0 mg/L (186.4 mg/L as average value) and an extremally variable coefficient of variation of 169.17%. Similar values (22.15–2769.5 mg/L) were found in Ghana [
60] or south Africa (48–6971 mg/L) [
61]. In the present study, the potassium concentration observed ranged between 1.00 to 59.0 mg/L, being these values lower than the identified in some studies carried out in Ghana (0.21–126 mg/L) [
62]. The chloride concentration variation was 10–443 mg/L, while values higher to 21–110 mg/L were observed in Tunisia [
63]. Sulfates concentration ranges between 38–2370 mg/L (average 376.78 mg/L) and nitrates also vary considerably from 1.0 to 390.0 mg/L (with a mean value of 26.82 mg/L) being aware that limits for drinking water are 10 mg/L in the United States and 50 mg/L according to World Health Organization [
56].
The bicarbonates values in the water sampling points in our study area are between 20 and 529 mg/L, and electrical conductivity and mineralization varied considerably from 290 (µδ/cm) to 8660 (µδ/cm) and 186 mg/L to 5493 mg/L, respectively.
At different water quality levels, pH levels varied considerably from 6.58 to 10.60, spanning one order of magnitude with a mean value of 7.71 and a coefficient of variation of 6.64%. These values are not in agreement with the permissible limits (6.5–8.5 mg/L) for drinking water proposed by the World Health Organization [
56].
The result shown in
Table 4 revealed that 14.8% of samples fell into the excellent category (Class I; with values ranging from 33.32 to 49.48), 62.7% were classed as good (Class II; values varied from 50.9 to 99.26), 17.2% in the poor category (Class III, 100.15 to 183.82), and 5.3% are unsafe for drinking (Class IV values varied from 202.9 to 365.7). Being aware that 75% of water comes from groundwater [
64] and considering the huge amount of data required to calculate
WQI. Authors have found better values for predicting
WQI using the SVM model than other approaches in Malaysia with the coefficient of determination (R2) equal to 0.8796 [
64], or R2 = 0.9 also in Malaysia [
65] using LSSVM (Least square SVM), and R2 = 0.87 in Iran [
66]. However, other better results were found in Poland, where authors obtained R2 = 0.99 using neural networks [
67]. Similar trustworthy results were achieved in Ethiopia, Vietnam and Brazil among others [
68,
69,
70,
71].
5. Conclusions
In order to maintain the availability of resources for drinkable water and to monitor pollution, the prediction of water quality indexes is extremely important. Thus, planning and managing water resources can greatly benefit from precise groundwater level predictions. As a result, an effort is made in this work to create a forecasting model that is effective for predicting groundwater quality by using the water quality index (WQI) in the Wilaya of Naama, placed in the southwestern region of Algeria. Based on many characteristics and indexes, conventional approaches evaluate water suitability for drinking and domestic purposes. Although these techniques are reliable tools, they can be costly and time-consuming. Therefore, this study proposes an alternative machine learning method for predicting water quality using only a few simple water quality criteria. The data used to conduct the study were collected from 169 samples of groundwater from 12 municipalities in the Wilaya of Naâma. A set of representative supervised machine learning algorithms has been used to estimate the WQI indicator. Based on WQI results, four classes were fixed: excellent, good, poor, and very poor or unsafe water. A relevant percentage (62.7%) of the considered physicochemical parameters depicted good water quality results. Related to prediction tools, main results showed that Support Vector Machine (SVM) algorithms classify groundwater quality with high accuracy (95.4%) with standardized data and lower accuracy (88.88%) for raw data. Therefore, a great correlation between observed and predicted water quality data was obtained in the present manuscript. These results offer a useful performance assessment tool for decision-makers, and further investigation can be undertaken by integrating the findings of this research on a large scale in arid areas. In conclusion, the SVM model is a simple and effective empirical model to simulate water quality, and the method presented in this work is sufficiently general to be applied to a wide range of arid areas.