Data Grouping Method for the Purpose of Forecasting the Mechanical Strength of Plastic Soils

: The aim of this work was to develop a method of data grouping (DGM) that enables the selection of regression equations for forecasting soil penetration resistance based on an easily available and small set of input data: soil moisture content, soil bulk density and the grain size distribution of the soil. Models for forecasting the penetration resistance were created by selecting regression equations for speciﬁc intervals of granulometric variability of soil fractions. A ﬁeld measurements campaign was conducted and soil samples were taken from the subsoil on 43 proﬁles, at depths of 25–30, 35–40, 45–50 and 55–60 cm. It was found that the dry bulk density is much less useful for predicting the penetration resistance of plastic soils than soil moisture. The study also showed that it is possible to forecast the soil penetration resistance on the basis of the gravimetric moisture content and the soil speciﬁc surface.


Introduction
Excessive compaction of soils by wheels of machines and vehicles is one of the most serious agricultural problems, which has been known about for years [1]. Soil compaction can cause adverse changes in soil properties leading to the disappearance of aggregate structure, reduction of water and air conductivity, reduction of water retention, etc. These adverse changes result in reduction in crops [2], increase of production costs [3], and increase of environmental threats [4]. The soils particularly susceptible to compaction include heavy loams, clays, and light loams [5]. Especially dangerous is the excessive compaction of the subsoil, because the effects are long-lasting and attempts to loosen the soil are energy-consuming, often ineffective, and in some soil and atmospheric conditions may cause greater losses than benefits [6].
The subsoil compaction increases when the soil strength is exceeded. The soil compaction strength may be characterized by using various indicators, such as penetration resistance (PR) among others. The resistance to penetration is called the cone index when divided by the cone base surface area. This method is relatively easy and quick to use.
Comparing the PR with other soil properties, including other strength properties, makes knowledge of the penetration resistance value practical. The PR value indirectly allows one to determine, for example: the soil pre-compression stress [7], the draught force of tillage implements, vehicle trafficability and the growth (or elongation rate) of plant roots in the soil [8][9][10]. From the point of view of soil protection against excessive compaction, it is important to know the soil pre-compression stress. Information on the soil pre-compression stress along with the distribution of stresses from the pressures exerted on the surface can predict the load of tractors' or machines' running gears [11].
There is a need to be able to predict penetrometric resistance from basic soil properties such as the soil composition, bulk density and water content [8,[12][13][14]. It is generally found that for any one soil, it is quite easy to produce an empirical equation that accounts for differences in bulk density and water content. However, the predictions are often not so good when different soils are being compared [8]. Therefore, the empirical mathematical models to predict the pre-compression of the soil are created by selecting the regression equations for a given range of variability of the soils researched [15,16]. The determination of the ranges of the variability of the soils requires an assumption of their grouping (classification). It may be noted that it is not advisable to use the symbols of the marked soil grades for this purpose. This is because the currently used systems classify the soil formations on the basis of their graining and therefore could possibility include significantly different soils in the same granulometric group or could place almost identical soils into different groups [17,18].
The aim of the work was to develop the method of soil data grouping (DGM) enabling the selection of regression equations to predict penetrometric resistance. Models that were useful in practice were sought, enabling estimation of the current value of penetrometric resistance on the basis of an easily available and small set of input data, the acquisition of which does not require complicated experimental research. A simplifying assumption was made that the strength of soils with similar intrinsic properties depends on soil moisture and bulk density. This approach allows one to omit other factors that affect penetrometric resistance, e.g., soil depths, soil type, sampling time etc. Due to the fact that it is difficult to choose the same prognostic model for cultivated and uncultivated soil [19], the research was limited to the subsoil layer of soils.
In view of the huge number of factors affecting the physical-mechanical properties of soils, modeling of strength properties is an extremely difficult task. In order to eliminate unrecognized mutual interactions of various factors on soil strength, a staged method of iteration to divide the studied soils into groups was applied in a way that would eliminate disturbances caused by unrecognized natural soil variability.

Materials and Methods
The study consisted of the following stages: collecting a large set of data, developing the method of soil data grouping, selecting regression equations for PR forecasting and assessing them. These stages are described in detail below ( Figure 1).

The Characteristic of the Researched Soils
The development of the data grouping method started with the collection of experimental data. Experiments were performed on plastic soils of the Szczecin Lowland, north-west Poland. Soil material was taken from 13 sites. On each site, 1 or 4 profiles of soil pits were described, in which the first planned measurements were carried out-a total of 43 profiles. In the following years, measurements were repeated in soil pits, 2 to 3 m apart from each of the previously described profiles-a total of 57 pits. The measurements were carried out in the spring time (24 March-10 May) and in autumn , in a non-tilled sub-soil layer, at depths of 25-30, 35-40, 45-50 and 55-60 cm. In the sites where ploughing was performed to a depth of 30 cm, a layer of 25-30 cm was omitted.
The time of performing field measurements and taking soil samples for laboratory tests was dependent on the soil moisture [20]. Undisturbed soil samples were collected using cylinders of 50 mm high and 50 mm diameter. Each sample was protected against moisture loss during transport by fitting the protective covers over the sample rings. For each of the soil layers in each soil pit, 4 samples were used to determine: the soil gravimetric moisture content (w w ), the dry bulk density (ρ d ) and the soil moisture at pF2 (w pF2 ), using a gypsum board. The w pF2 was determined in order to assess soil moisture at the time of PR measurement. The w w and w pF2 and density ρ d of the soil were determined by a drying-weight method (oven-dried at 105 • C for 24 h). The PR was measured with a cone of 30 • angle and a base area of 1 cm 2 , using the Penetrologger made by the Eijkelkamp company, the Netherlands. The penetration rate was 2 cm·s −1 . The measurements of the PR were repeated 10 times for each soil layer in each soil pit.

The Characteristic of the Researched Soils
The development of the data grouping method started with the collection of experimental data. Experiments were performed on plastic soils of the Szczecin Lowland, north-west Poland. Soil material was taken from 13 sites. On each site, 1 or 4 profiles of soil pits were described, in which the first planned measurements were carried out-a total of 43 profiles. In the following years, measurements were repeated in soil pits, 2 to 3 m apart from each of the previously described profiles-a total of 57 pits. The measurements were carried out in the spring time (March 24-May 10) The granulometric composition was determined by the Bouyoucos-Casagrande method modified by Prószyński [21]. The pycnometric method was used to determine the density of the solid particles (ρ s ), using the pycnometer G-L (100 cm 3 ) made by the WPL Gliwice company (Poland). The humus content (Z pr ) was determined with the use of the Tiurin method and the pH reaction of the soil (pH KCl )-by means of the electrometric method. The calcium carbonate content (CaCO 3 ) was determined by the Scheibler method. The plastic limit (P L ) and the liquid limit (L L ) according to Atterberg were determined. Data of the characteristics of the soil itself are presented in the Appendix A (Table A1). The characteristic of the soils was enlarged by the information on the properties calculated on the basis of the results of the granulometric composition and the humus content. The calculated density of solid particles (ρ B ), the calculated dry density (ρ dB ), and the general porosity (n B ) were determined with the use of the Brogowski equations [22]. The field water capacity was calculated with the use of the Trzecki [23] formulae with the participation of (WPP z ) or without the participation of the humus content (WPP b ). The method proposed by Prusinkiewicz and Proszek [24] was used to calculate the grain average diameter (S z ), the specific surface (Z D ) and the dispersion index (S D ). The content of the easily dispersing clay in the soil (RCD) was calculated in accordance with Czyż [25] on the basis of the clay content (fraction <0.002 mm) and the organic substance. The stability index (S) was estimated with the use of the Pieri equation [26] on the basis of the content of humus and the fractions of silt and clay. More details about the calculation-ρ B , ρ dB , n B , WPP z , WPP b , S z , Z D , S D , RCD and S-are given in the Appendix A.
Determined and calculated properties for each soil layer in each measurement term formed data cases ( Figure 2).  Table 1 gives the number of profiles and pits per site, the location of arable fields of individual sites, soil types and depth of soil cultivation. The soils were classified as a Phaeozems, Cambisols and Luvisols according to the WRB FAO system. Maximum soil tillage depth was from 15 to 30 cm. The data collected in the form of cases (data rows) was divided into two subsets: the main set (275 cases) and the validation set (77 cases), which was used to estimate the predicted error of the PR. The data obtained in the years 2003-2006 formed the basic set, and the data obtained in the years 2007-2012 constituted the validation set. The basic set is data collected from sites for which 4 profiles were described and measurements were taken in 2 repetitions-a total of 8 pits per site (Table1). The validation set is data from sites for which one profile was described and measurements were carried out in 2 replications (sites: Sł, NP, Re). The validation set was increased by data collected from pits made in 2007-2012 (sites: Ku, Ob1, Os, Sk, St). Every case (data row) consisted of one dependent variable (PR), two independent variables (ww, d), and other properties describing soils, including those named grouping parameters: fractions of granulometric composition, Zpr, PL, LL, ρB, ρdB, nB, WPPb, Sz, ZD, SD; RCD, S, WPPz. The validation set was diversified in terms of soil describing properties as well as the place and depth of sampling. The set of validation data obtained this way was not used to create regression equations.   Table 1 gives the number of profiles and pits per site, the location of arable fields of individual sites, soil types and depth of soil cultivation. The soils were classified as a Phaeozems, Cambisols and Luvisols according to the WRB FAO system. Maximum soil tillage depth was from 15 to 30 cm. The data collected in the form of cases (data rows) was divided into two subsets: the main set (275 cases) and the validation set (77 cases), which was used to estimate the predicted error of the PR. The data obtained in the years 2003-2006 formed the basic set, and the data obtained in the years 2007-2012 constituted the validation set. The basic set is data collected from sites for which 4 profiles were described and measurements were taken in 2 repetitions-a total of 8 pits per site (Table 1). The validation set is data from sites for which one profile was described and measurements were carried out in 2 replications (sites: Sł, NP, Re). The validation set was increased by data collected from pits made in 2007-2012 (sites: Ku, Ob 1 , Os, Sk, St). Every case (data row) consisted of one dependent variable (PR), two independent variables (w w , ρ d ), and other properties describing soils, including those named grouping parameters: fractions of granulometric composition, Z pr , P L , L L , ρ B , ρ dB , n B , WPP b , S z , Z D , S D ; RCD, S, WPP z . The validation set was diversified in terms of soil describing properties as well as the place and depth of sampling. The set of validation data obtained this way was not used to create regression equations.

The preliminary Grouping Tests
The development of this method was preceded by attempts to group (stack) the observations in regards to the content of the particles <0.02 mm, their division into a various number of the sets (see Figure 3) and checking the values of the multiple regression coefficient (R 2 ) for the dependence of the PR on the selected independent variables. Because of their widespread use, the selected variables were: the gravimetric moisture content and the dry bulk density. It was also noted that despite the importance of particle sizes, the content of the fine particles poorly distinguishes among the soils in some ranges of their variability. This is because fine particle size described only a part of the granulometric composition of the soil. Other parameters (indicators) were used to arrange the cases, which better diversified the soils. Those parameters were highly correlated with the content of fine particles and calculated on the basis of the largest possible number of the different granulometric fractions. The best result was obtained by dividing the observations (cases) into four sets (defined as quartiles). It was found that higher values of the determination coefficient could be obtained by taking into consideration the number of the cases between the quartiles 1 and 3 within each of the sets. Such a procedure caused a rejection of the cases smaller than the quartile 1 and bigger than the quartile 3. It was therefore decided that a given set should be smoothed out before the rejection, the purpose of which was to reduce the interferences resulting from the natural variability of soils. It was Agronomy 2020, 10, 578 6 of 19 found a priori that such parameters used for this procedure, should correlate with the particle content <0.02 mm and be associated with the other selected soil characteristics, affecting the soil strength. quartiles). It was found that higher values of the determination coefficient could be obtained by taking into consideration the number of the cases between the quartiles 1 and 3 within each of the sets. Such a procedure caused a rejection of the cases smaller than the quartile 1 and bigger than the quartile 3. It was therefore decided that a given set should be smoothed out before the rejection, the purpose of which was to reduce the interferences resulting from the natural variability of soils. It was found a priori that such parameters used for this procedure, should correlate with the particle content <0.02 mm and be associated with the other selected soil characteristics, affecting the soil strength.
Therefore, the result of the initial grouping was to determine the initial number of data sets and to direct further work aimed at developing a procedure for grouping data.  Therefore, the result of the initial grouping was to determine the initial number of data sets and to direct further work aimed at developing a procedure for grouping data.

The Procedure of Soil Data Grouping
The method, schematically depicted in Figure 4, consists in giving a two-stage order of observations. At the first stage, the data collected (observations, cases) and consisting of the variables (dependent and independent), is divided into four main sets (Z 1 , Z 2 , Z 3 , Z 4 ). Their limits are determined by: the minimum (a min ) and maximum (a max ) values, Q 1 (quartile 1), Q 2 (quartile 2-median) and Q 3 (quartile 3). They were calculated from the set of the numbers describing the order of observations, which was a consequence of the order (obtained by the appropriate ranking and weighting) determined according to the parameter selected for the arrangement (P I ). If more than one parameter was used at the same time, the rank of sums was used. Since the parameters selected to arrange at this stage showed similar correlation coefficients relative to the particles <0.02 mm, the regular type of rank with a weight of 1 was used.

The Procedure of Soil Data Grouping
The method, schematically depicted in Figure 4, consists in giving a two-stage order of observations. At the first stage, the data collected (observations, cases) and consisting of the variables (dependent and independent), is divided into four main sets (Z1, Z2, Z3, Z4). Their limits are determined by: the minimum (amin) and maximum (amax) values, Q1 (quartile 1), Q2 (quartile 2-median) and Q3 (quartile 3). They were calculated from the set of the numbers describing the order of observations, which was a consequence of the order (obtained by the appropriate ranking and weighting) determined according to the parameter selected for the arrangement (PI). If more than one parameter was used at the same time, the rank of sums was used. Since the parameters selected to arrange at this stage showed similar correlation coefficients relative to the particles <0.02 mm, the regular type of rank with a weight of 1 was used.
At the second stage, the main sets were smoothed, consisting of the secondary ordering of the data in accordance with the parameters (PII) other than at the first stage. When arranging the data, only one parameter was used at a time. After the second stage, the Zx sets were divided into four sets (defined by quartiles). Because this operation concerned the subsets, they were marked with the lower case letters, i.e., q1, q2 and q3.
Then, the multiple regression equations (Eq1, Eq2, Eq3, Eq4) were selected to predict the soil penetration resistance on the basis of the data contained between the q1-q3 quartiles, i.e., for the subsets M1, M2, M3 and M4. Such a procedure resulted in obtaining equations for soils with similar features. The data located between amin and q1, as well as q3 and amax, were rejected in the Z1 and Z4 sets. In the first case, these were the plastic soils close to the non-plastic soils; in the second case, these were soils with a very high content of fine particles, which are therefore uncommon.
To describe the dependence of the whole range of the variability of the soils researched, it was necessary to also include the omitted date. Thus, the equations Eq1/2, Eq2/3 and Eq3/4 were also selected. These were obtained for the data from the quadrant interval of the supplementary sets Z1/2, Z2/3 and Z3/4, marked on Figure 4 as M1/2, M2/3 and M3/4, where Z1/2, Z2/3 and Z3/4 were created after the first stage of ordering from the subsequent observations located on the data axis between the medians (q2) of two neighbouring main sets, i.e., Z1 and Z2, Z2 and Z3 also Z3 and Z4.  used to create regression equations (Eq 1 , Eq 1/2 , Eq 2 , Eq 2/3 , Eq 3 , Eq 3/4 , Eq 4 ) to the soil penetration resistance (PR) in relation to ordering parameters P I (stage I) and P II (stage II).
At the second stage, the main sets were smoothed, consisting of the secondary ordering of the data in accordance with the parameters (P II ) other than at the first stage. When arranging the data, only one parameter was used at a time. After the second stage, the Z x sets were divided into four sets (defined by quartiles). Because this operation concerned the subsets, they were marked with the lower case letters, i.e., q 1 , q 2 and q 3 .
Then, the multiple regression equations (Eq 1 , Eq 2 , Eq 3 , Eq 4 ) were selected to predict the soil penetration resistance on the basis of the data contained between the q 1 -q 3 quartiles, i.e., for the subsets M 1 , M 2 , M 3 and M 4. Such a procedure resulted in obtaining equations for soils with similar features. The data located between a min and q 1 , as well as q 3 and a max , were rejected in the Z 1 and Z 4 sets. In the first case, these were the plastic soils close to the non-plastic soils; in the second case, these were soils with a very high content of fine particles, which are therefore uncommon.
To describe the dependence of the whole range of the variability of the soils researched, it was necessary to also include the omitted date. Thus, the equations Eq 1/2 , Eq 2/3 and Eq 3/4 were also selected. These were obtained for the data from the quadrant interval of the supplementary sets Z 1/2 , Z 2/3 and Z 3/4 , marked on Figure 4 as M 1/2 , M 2/3 and M 3/4 , where Z 1/2 , Z 2/3 and Z 3/4 were created after the first stage of ordering from the subsequent observations located on the data axis between the medians (q 2 ) of two neighbouring main sets, i.e., Z 1 and Z 2 , Z 2 and Z 3 also Z 3 and Z 4 .

Model Variables
Values of independent (w w , ρ d ) and dependent (PR) variables of the regression model have been calculated for all cases within individual sites ( Table 2). It can be seen that measurements of the PR were performed at moisture w w close to that for matric potential pF 2. It can also be noticed that the range of PR values for cases within sites was characterized by high variability. The following values show the standard deviation or the relative standard deviation: the results of measurements of w w , ρ d and PR before grouping test. The relative standard deviation reached values around 50% of the measured PR.  Table 3 shows the values of the multiple regression coefficient (R 2 ) for the dependence of the soil PR on the gravimetric moisture content and the dry bulk density obtained during the preliminary grouping tests (stacking) of the observations in respect of the particle content <0.02 mm, with their division for a different number of sets (A; B1-B2; C1-C3; D1-D4). The best effect was achieved by dividing the observations (cases) into four sets (D1-D4), in accordance with Figure 3. Dividing the observations into more sets than 4 caused a decrease of the R 2 value. This shows that the sets D1, D2, D3 and D4 (Table 3) are synonymous with those marked in Figure 4 as Z 1 , Z 2 , Z 3 and Z 4 .

Selection of Parameters for Grouping
The presented method of the data grouping ( Figure 4) required a limited number of grouping parameters used at both stages of possible combinations. Table A2 in the Appendix A contains the calculated values of the grouping parameters. It was assumed that the parameters to be rejected would be less frequently used parameters that highly correlated with the other commonly used parameters and that were of a relatively low variability. P I (stage I) parameters were used to rank (classify) the soils. When choosing P I parameters, attention was paid to the fact that they were positively correlated with <0.02 mm and were calculated on the basis of information about the particle size distribution of soils. In the case of stage II (P II ), parameters were selected for smoothing Z x sets. Therefore, they were calculated not only on the basis of information on the granulometric composition, but also on other selected soil characteristics, affecting the soil strength.
When selecting the parameters for the first stage of ordering (division of data into sets Z 1 , Z 1/2 , Z 2 , Z 2/3 , Z 3 , Z 3/4 , Z 4 ), special attention was paid to the results of the research conducted by Prusinkiewicz and Proszek [24], who consider that the external surface area, in spite of conventionality, is a good expression of the graining of the soil samples researched, reducing all grain size analysis results to only one characteristic value. However, taking into consideration the selection of parameters to smooth subsets Z x (stage II-creating the subsets M 1 , M 1/2 , M 2 , M 2/3 , M 3 , M 3/4 , M 4 ), indicators related to the humus content in the soil were included because it is known that humus affects the soil strength [27,28]. The stability index (S) developed by Pieri [26] was also considered because the relation of humus content to the content of silt or clay fraction in soil is also important for the soil strength. The parameter determining the content of the readily-dispersible clay (RCD) related to the susceptibility of agricultural soils [25] to destruction was also considered. The Trzecki [23] equation taking into consideration the humus content was also tested to smooth the subsets when calculating the field water capacity (WPP z ).
Among the many parameters considered during preliminary work (stage I: ρ B , ρ dB , n B , WPP b , S z , Z D , S D ; stage II: RCD, S, WPP z, P L , L L ), a list of finally selected parameters used for soil grouping during the main works of stages I and II is presented in Table 4. Because 1 to 3 parameters were used during the grouping in stage I, all had to be positively correlated to particles <0.02 mm. Therefore, the reverse value (1/S z ) was used for the average grain diameter. Table 4. Parameters P of stages I and II finally adopted for grouping (see Figure 4).

Parameter
Stage I (sets: Z 1 , Z 1/2 , Z 2 , Z 2/3 , Z 3  1. Total porosity (n B )-in acc. with Brogowski [22] 2. Field water capacity-without the humus content taken into consideration (WPP b )-in acc. with Trzecki [23] 3. Specific surface (Z D ), inverse of soil average grain diameter (1/S z )-in acc. with Prusinkiewicz and Proszek [24] 1. Content of readily-dispersible clay (RCD)-in acc. with Czyż [25] 2. Stability index (S)-in acc. with Pieri [26] 3. Field water capacity-with humus content taken into consideration (WPP z )-in acc. with Trzecki [23] When testing the different combinations of the adopted method of data grouping with the use of the parameters selected for grouping (Table 4), the goal was to find a variant for which the obtained multiple regression equation was characterized by the highest matching to the experimental data, where the adjustment was expressed by the value of the determination coefficient R 2 ( Table 5). It can be seen that the R 2 values obtained for subsets M x with respect to particle content <0.02 mm (combination 0) were on average higher than calculated for sets D1-D4 (Table 3). The use of other grouping parameters and the introduction of grouping stage P II increased the value of R 2 . However, increasing the number of parameters up to 3 of P I stage no longer resulted in a significant increase in the R 2 value. Due to the determination coefficient, i.e., assuming maximum values in M x subsets or close to them, for further considerations-selection of regression equations-the combination number 9 was chosen: P I -WPP b & Z D , P II -WPP z . Table 5. Determination factor values (R 2 ) for particular subsets (M x ) and selected "best" data grouping combinations obtained for dependency of penetration resistance (PR) on moisture gravimetric content and dry bulk density.

Combination Number
Grouping  Figure 5 presents the scopes of changes in the soil parameters selected for the particular data subsets (M x ), obtained after grouping the combination with the number 9. It can be seen that the particle content ranges <0.02, Z p and Z i are similar between subsets (M x ). With the largest range of values noted for subsets M 3/4 or M 4 . The values of parameters of adjacent subsets overlap, which was intended and results from the proposed method of data grouping (Figure 4). The largest ranges of changes in parameter values presented in Figure 5 were noted in relation to the humus content (Z pr ). The maximum Z pr values were on average 6 times higher than the minimum values. The wide range of changes in the Z pr value in individual subsets of M x indirectly justifies the second stage of data grouping (P II ). Figure 5 presents the scopes of changes in the soil parameters selected for the particular data subsets (Mx), obtained after grouping the combination with the number 9. It can be seen that the particle content ranges <0.02, Zp and Zi are similar between subsets (Mx). With the largest range of values noted for subsets M3/4 or M4. The values of parameters of adjacent subsets overlap, which was intended and results from the proposed method of data grouping (Figure 4). The largest ranges of changes in parameter values presented in Figure 5 were noted in relation to the humus content (Zpr). The maximum Zpr values were on average 6 times higher than the minimum values. The wide range of changes in the Zpr value in individual subsets of Mx indirectly justifies the second stage of data grouping (PII).  Table 5): Designations: <0.02, Z p and Z i -soil particle fraction content, respectively: <0.02 mm, 0.05-0.002 mm and <0.002 mm, Z pr -soil humus content.

Characterization of Subsets after Data Grouping
After grouping the cases with the selected combination, soils with different granulometric groups were allocated to individual subsets (Table 6). This result indicates that the subsets obtained for the purpose of forecasting soil strength are not identical to a specific granulometric group. Nevertheless, the direction of changes is noticeable, i.e., from sandy loam soils to clayey soils. Table 6. Soil texture for the particular data subsets (Mx), obtained after grouping with combination number 9 (see Table 5).  SiC-silty loam, SiCL-silty clay loam; The number of cases included in a given granulometric group is given in brackets. Figure 6 shows the obtained values of the independent (ww, ρd) and the dependent (PR) variables for individual subsets of data (Mx). The proposed method of data grouping caused a range of changes of independent variables in individual subsets of Mx. Significantly higher differentiation was obtained for subsets M3/4 and M4. For PR, larger value ranges occurred for subsets M1 and M4. Figure 6. The ranges of changes in the values of the independent (ww, ρd) and the dependent (PR) variables for individual subsets of data (Mx), obtained after soil grouping with combination number 9 (see also Table 5).

Regression Equations
Because of the high variability of the soil environment and, therefore, a large dispersion of the measurement results, the diverging observations (outliers) were searched prior to selecting the regression equations (Eqx) for individual subsets (Mx). The observations exceeding the range of ± 2 standard deviations were excluded ( Table 7). The sign of the regression coefficients indicates the negative influence of the soil moisture on the value of its the penetration resistance. It can therefore be concluded that the derived equations map the tendency of soil strength changes as a function of soil moisture content in a proper manner, in accordance with the current state of knowledge.
The study confirmed that the current soil moisture is more useful for predicting the penetration resistance value than soil dry bulk density. The lack of a statistically significant impact of density (significant at the p = 0.05 levels or less) on the result of PR prediction was probably related to the fact that the research material was plastic soils with moisture similar to pF2 and a high content of clay particles ( Table 2). This assumption is substantiated by the results of studies by Mosaddeghi et al. [7], who found that the soil dry bulk density may be of little use for forecasting soil strength as the content of clay in it increases. Table 7 shows the equations obtained for PR forecasting with (equations Eqx) and without (equations Eqx') dry bulk density as a predictor. Figure 6. The ranges of changes in the values of the independent (w w , ρ d ) and the dependent (PR) variables for individual subsets of data (M x ), obtained after soil grouping with combination number 9 (see also Table 5).

Regression Equations
Because of the high variability of the soil environment and, therefore, a large dispersion of the measurement results, the diverging observations (outliers) were searched prior to selecting the regression equations (Eq x ) for individual subsets (M x ). The observations exceeding the range of ±2 standard deviations were excluded ( Table 7). The sign of the regression coefficients indicates the negative influence of the soil moisture on the value of its the penetration resistance. It can therefore be concluded that the derived equations map the tendency of soil strength changes as a function of soil moisture content in a proper manner, in accordance with the current state of knowledge. Table 7. Regression equations to calculate soil penetration resistance (PR) and their statistical evaluation.
The study confirmed that the current soil moisture is more useful for predicting the penetration resistance value than soil dry bulk density. The lack of a statistically significant impact of density (significant at the p = 0.05 levels or less) on the result of PR prediction was probably related to the fact that the research material was plastic soils with moisture similar to pF2 and a high content of clay particles ( Table 2). This assumption is substantiated by the results of studies by Mosaddeghi et al. [7], who found that the soil dry bulk density may be of little use for forecasting soil strength as the content of clay in it increases. Table 7 shows the equations obtained for PR forecasting with (equations E qx ) and without (equations E qx ') dry bulk density as a predictor.
Considering the variability of the soil environment derived within individual subsets of Mx, the regression equations E qx (Table 7) are characterized by a relatively high rating. The calculated F, p, R 2 and RMSE assessment parameters have high or satisfactory values. All the dependencies received are statistically significant-p < 0.001 (the significance level = 0.05). However, matching the equations to the experimental data measured by the determination coefficient (R 2 ) is satisfactory and only with the equation E q3/4 is it less than 0.50. RMSE values for E qx equations are close to the minimum standard deviations of the PR values measured in the field (see Table 2). Moreover, it shows that the variance inflation factor (VIF) calculated to check the degree of multicollinearity among predictors (the gravimetric moisture content and the dry bulk density) did not exceed the value of 1.6, which is indicative of multicollinearity among predictors [30].
In the case of E qx ' equations (Table 7), it can be seen that the use of humidity only for PR prediction showed that matching the equations to the experimental data measured by the determination coefficient (R 2 ) is also satisfactory and only with the equations E q3 and E q3/4 is it less than 0.50. The calculated RMSE for E qx ' equations have values similar to those obtained for E qx equations.
Due to the variety and number of factors that affect the PR measurement result [31], the equations for forecasting penetrometric resistance, obtained by other authors, are difficult to compare with the results of this work. Nevertheless, it is possible to compare the parameters of the statistical evaluation of equations obtained for the same predictors. The R 2 and RMSE values given in Table 8 are similar to the results obtained by other authors [12,13] using moisture content (w w ) as a predictor of the PR. Table 8. Affiliation criteria for particular series of Eq x ' equations (Table 7) for cases from the verification set-median values of selected soil parameters for particular subsets (M x ) of basic set. For remaining markings-see section "Materials and Methods".

Values of Soil Parameters for Particular Subsets (M x ) Column Number-Equation Number
Considering the assessment of the results presented in Table 7 and the need for simplifying the procedure, further considerations were made only in relation to the equations E qx '.
The regression equations (Table 7) were assessed with the use of the cases included in the validation set. The choice of equations (E qx ') for individual cases of the verification set was made using the criteria (soil parameters), which determine the average values calculated from the medians of two adjacent subsets (M x ) and the extreme values for the subsets M 1 and M 4 (see Table 8). The choice of equations was made in two ways. First, attention was paid to which column contains the values of soil parameters from the validation set. When more than half of the criteria used pointed to a specific column, the choice of the equation was considered complete. In the case where there was no clear indication of the column (equation), the second method was used. Each result of comparing the parameters from the validation set with the affiliation criteria was assigned a number equal to the column number (equation). In this way, a table (matrix) of results was created with numbers that could have values from 1 to 7. The sum of all numbers of the matrix divided by the number of criteria used (parameters compared), and then rounded to whole numbers, indicated the number of the equation.
In an attempt to simplify the selection of regression equations to predict PR values (Table 7), various combinations of criteria (parameters) listed in Table 8 were tested.
The obtained values of the relative prediction errors (δ p ) of the soil PR were also analyzed ( Table 9). The δ p error is the difference between the values measured and predicted divided by the values measured. It may be noted that the mean values δ p are smaller than 20%. It can be seen, taking into account the variability of the property that is PR (Table 2), that for the selection of the regression equation for the cases contained in the validation set it was sufficient to use information about Z D . Although, on average, better results were obtained using 3 or 5 criteria, the obtained values of the forecast error and its standard deviation were at a similar level. The use of up to 7 parameters to choose the equation did not significantly improve the quality of the PR prediction. Table 9. Values of mean relative error of the prognosis (δ p ) for particular equations E qx ' ( Table 7) using selected combinations of criteria listed in Table 8 and data contained in the validation set.
Parameter Used (Table 8) Values of Mean Relative Error of the Prognosis (%) For remaining markings-see section "Materials and Methods"; the standard deviation is given in the brackets.
The presented method of data grouping allowed us to obtain a series of 7 equations for predicting the PR of plastic soils. Applying the selected equation requires only one predictor (the gravimetric moisture content). To choose the equation, it is sufficient to provide information about one parameter characterizing the soil surface (the specific surface). It should be added that in the literature you can find examples of equations that require the use of more predictors [8,32,33].

Methodological Limitations
The grouping method presented in the paper as well as the obtained regression equations for forecasting soil penetrometric resistance are characterized by certain limitations.
Using the proposed method requires collecting a large set of output data. Most authors recommend that one should have at least 10 to 20 times as many observations (cases, respondents) as one has variables [34].
Verification of the proposed method, from the point of view of assessing the values of grouping parameters and variables of the regression model, may be difficult due to the measurement methods used in this work and the conditions for making measurements. The soil bulk density determination was performed using the core sampling method (volumetric cylinder method), which is the most common method used to determine bulk density agricultural soils [35]. Researchers use cylinders of different sizes, which may be affected by the result of the soil properties determined [36][37][38]. The Bouyoucos Casagrande method modified by Prószyński was used to measure the particle size distribution of soil, which is one of the methods commonly used, although automated methods are currently gaining increasing popularity [39].
The humus content was determined with the use of the Tiurin method. This method allows you to calculate the humus content based on the determined amount of organic carbon. The Tiurin method is therefore not a modern method that allows advanced analysis of soil organic matter composition [40][41][42]. Determining the humus content with the use of the Tiurin method was, however, sufficient to calculate the values of such grouping parameters as: RCD, S and WPPz (Table 4). It should be added here that, in the light of recent research, organic matter persistence in soil is seen as a property of the ecosystem [43], which means that the results of this study should be treated rather for local application. It should be emphasized, however, that the parameters used to compile the data are not a closed list (see Section 3.3). Verification of the proposed DGM can be undertaken by using data (cases) for other, available or determinable parameters of selected soils. At the same time, other measuring methods can be used to determine soil properties such as humus content or granulometric composition.
The use of regression equations obtained in this work for PR forecasting also has limitations. First, the equations were obtained for the subsoil of plastic soils with a similar original, i.e., glacial deposit. Secondly, the choice of equations is made using the affiliation criteria for a particular series of Eqx 'equations (Table 8), which are determined, with the exception of P L and L L , on the basis of soil granulometric composition. The use of a method other than the Bouyoucos Casagrande method modified by Prószyński may result in obtaining values of affiliation criteria that will cause the selection of the wrong equation for PR forecasting.

Conclusions
A new data grouping method (DGM) was developed for predicting the penetration resistance of plastic soils. The method is based on the division of the results of measurements into groups with narrow ranges of soil grain variability, taking into account humus content.
The study showed that it is possible to forecast the soil penetration resistance on the basis of two independent variables: gravimetric moisture content and bulk density. Statistical evaluation indicates that the dry bulk density is much less useful for predicting the penetration resistance of plastic soils than soil moisture. The study also showed that it is possible to forecast the soil penetration resistance on the basis of the gravimetric moisture content and the specific surface. Verification of the obtained regression equations showed that the mean relative errors of the prognosis of penetration resistance were less than 20%.
The method is universal, because it is independent of existing soil grain classifications. On the other hand, the DGM method may have some limitations. The method has been verified so far for plastic soils. Moreover, selection of equations for PR forecasting may be sensitive to equations used for soil granulometric composition determination.
Further research will focus on application of the DGM method in relation to other soil properties, for example vane shear stress and the pre-compression stress.
3. The method proposed by Prusinkiewicz and Proszek [24] was used to calculate the grain average diameter (S z ), the specific surface (Z D ) and the dispersion index (S D ). The limit values of fraction ranges, entered in mm units, are converted by the computer (TEXTURE procedures) into the values of the φ scale commonly used in sedimentology according to the Krumbein (1934Krumbein ( , 1936. Cumulative curves in which grain diameters are expressed in units of scale φ can be the basis for calculating several synthetic coefficients of granulation according to Folk and Ward (1957), including S z . The computer gives also the values of the Z D and S D . The computer calculates these values assuming that all soil grains have a spherical shape and that their density is 2.65 g·cm −3 . where: RDC-the quantity of readily dispersible clay (g/100g of soil), CL-clay content %; (fraction <0.002 mm), OM-organic matter content % (or g/100g of soil).
5. The stability index (S) was estimated with the use of the Pieri equation [26]: where: S-stability index, OM-organic matter content %, Zi-clay content %, Zp-silt content %.