Inﬂuential Points in Adaptability and Stability Methods Based on Regression Models in Cotton Genotypes

: The aim of this work was to answer the following question: can inﬂuential points modify the recommendation of genotypes, based on regression methods, in the presence of genotype × environment (G × E)? Therefore, we compared the parameters of the adaptability and stability of three methodologies based on regression in the presence of inﬂuential points. Speciﬁcally, were evaluated methods based on simple, non-parametric and quantile ( τ = 0.50) regressions. The dataset used in this work corresponds to 18 variety trials of cotton cultivars that were conducted in the 2013–2014 and 2014–2015 crop seasons. The evaluated variable was the cotton ﬁber yield (kg/ha). Once we noticed that the effect of G × E interaction is signiﬁcant, the statistical procedures adopted for the adaptability and stability analysis of the genotypes. To verify the presence of a possible inﬂuential point, we used the leverage values, studentized residuals (SR), DFBETAS and Cook’s distance. As a result, the inﬂuential points can modify the recommendation of genotypes, based on regression methods, in the presence of G × E interaction. The non-parametric and quantile ( τ = 0.50) regressions, which are based on median estimators, are less sensitive to the presence of inﬂuential points avoiding misleading recommendations of genotypes in terms of adaptability.


Introduction
The knowledge of the genotype × environment (G × E) interaction component is important for plant breeding programs. If this component is significant, it is possible that superior genotypes in one environment may not be in another [1,2]. Despite its importance, the GE interaction does not provide detailed information on the behavior of each genotype in relation to environmental variations [3][4][5].
The literature presents several adaptability and stability methodologies that allow for the identification and recommendation of superior cultivars in different environments. There are methodologies based on a simple [6], segmented [7] and quantile regression (QR) [8,9]; non-parametric methodologies [10,11], multiple centroid and centroid modified [12,13].
In the presence of influential points, methodologies based on regression may present inadequate estimates, and overestimate or underestimate the adaptability parameter. Once the recommendation of a genotype is made considering a set of environments, the use of a methodology that does not remove a possible influential point and mitigate some possible differential effect is interesting. In order to overcome these issues, Nascimento et al. [11] proposed the use of non-parametric regression [14] for situations in which there are influential points. Despite its usefulness, non-parametric regression does not present well-defined statistical properties, since it is based on the calculation of medians from the data set. Another regression approach to adaptability (the differential response of genotypes to different environmental conditions) and stability (the ability of genotypes present predictable behavior as to different environmental conditions) studies is QR [8]. Unlike traditional regression methods, which use the mean (central value), QR allows the functional relationship between environmental variation and the phenotypic response for any quantile of interest to be explained. In this study, the authors showed the efficiency of the QR to deal with the presence of outliers.
The QR has also been used successfully in several areas of knowledge, such as genomics [15,16] and agricultural sciences [8,9,17]. Unlike the non-parametric regression [14], the QR is based on the sum of the weighted errors to estimate its parameters, in addition to allowing an adjustment of regression models in different parts of the distribution.
In this study, we seek to answer the following question: can influential points modify the recommendation of genotypes in the presence of a G × E interaction? Therefore, the aim of this work was to compare the parameters of adaptability and stability of three methodologies based on regression (linear, non-parametric and quantile regressions) in the presence of influential points. For this, we used data from 18 variety trials of cotton cultivars conducted in Brazil. In addition, a synthetic dataset, obtained from the experimental dataset, was also analyzed to assess the effect of influential points on adaptability and stability analysis and the behavior of the measures used to detect influential points.

Experimental Data
The dataset used in this work corresponds to 18 variety trials of cotton cultivars that were conducted in the 2013-2014 and 2014-2015 crop seasons. The evaluated variable was the cotton fiber yield (FY, kg/ha). The combinations of sites and cropping seasons of Brazilian Cerrado, whose edaphoclimatic characteristics are shown in Table 1, was considered as an environment. The experimental design was randomized complete blocks with 12 treatments with four replicates each. The genotypes used were TMG 41 WS, TMG 43 WS, IMA CV 690, IMA 5675 B2RF, IMA 08 WS, NUOPAL, Delta Pine DP 555 BGRR, DELTA OPAL and Embrapa cultivars BRS 286, BRS 335, BRS 368 RF and BRS 369 RF. Characteristics of each environment are shown in Table 1.
The culture practices were the ones commonly used for growing cotton, including the use of herbicides for weed control and pest control, according to the integrated management of pests recommended for crops in the region. The experimental units consisted of four 5.0-m rows, spaced with 0.90 m between rows, with nine plants per meter in each row. The plots were managed according to the local production recommendations of each test site. In each experimental unit, a random sample of 20 bolls was taken to determine the percentage of fibers using an HVI (high volume instrument). Cotton seed yield was evaluated in the two central rows by mechanically harvesting 4 m of each line, scattering 0.5 m at each end of the plot (border), correcting to 13% of moisture and extrapolating to kilograms per hectare. Finally, cotton fiber yield was obtained by multiplying the cotton seed yield and the percentage of fibers in each experimental unit.

Statistical Analysis
The following joint analysis variance was performed to verify if the effect of GE interaction is significant following the model: where Y ijk is the observation in the kth block, evaluated in the ith genotype and jth environment; µ is the overall mean; E j is the effect of the jth environment, considered as random; B/E jk is the effect of block k within environment j, considered as random; G i is the effect of the ith genotype, considered as fixed; G × E ij is the random effect of the genotype i environment j interaction; and e ijk is the random error associated with the Y ijk observation. Before conducting the joint ANOVA, the homogeneity between all the environments was verified according to [1], in which the ratio between the highest and lowest residual mean square was less than 7. Once we noticed that the effect of GE interaction was significant, the statistical procedures adopted for the adaptability and stability analysis of the genotypes were those proposed by Eberhart and Russell [6], non-parametric [11,14] and quantile [8,18] regressions.
Eberhart and Russell's methodology [6] is based on the following simple linear regression: where Y ij is the observation of the ith (i = 1, 2, . . . , g) genotypes in the jth (j = 1, 2, . . . , a) environment; β ER 0i is the general mean for the ith genotype; β ER 1i is the regression coefficient; ); δ ij is equal to the regression deviation of the ith cultivar in the jth environment; and e ij is the effect of the mean experimental error.
The non-parametric regression estimator for the slope [11,14] is the median of the slopes S ikl = (Y il −Y ik ) (I l −I k ) determined by all pairs of sample points. Therefore,β TH 1i = median{S ikl , 1 ≤ k < l ≤ a}, where a is the number of environments. The estimator of the intercept is given byβ TH 0i = median Y ij −β TH 1i median I j .
The estimates from QR model [8,18] are given by the solution of the following optimization problem: where τ ∈ [0, 1] τ indicates the quantile used and ρ τ is the check function [18]: In order to deal with influential points, the value of τ was defined as equal to 0.5; that is, the median quantile regression.
The stability parameter for the genotypes were defined by R 2 ER R 2 TH and R 2 QR 0.5 for, respectively, Eberhart and Russel [6], non-parametric and quantile regressions. Genotypes that present values of R 2 lower than 0.70 were classified as having a low predictability.
The hypothesis, H 0i : β 1i = 1, was assessed using the t test with m degrees of freedom (m is the number of degrees of freedom of the residual obtained in the joint analysis), whose statistics are given by t = where MSE is the mean square error from the joint analysis variance and r is the number of replicates [1].

Synthetic Data
To assess the effect of influential points on adaptability and stability analysis, the experimental dataset was changed. According to Tukey [19], any data point that fell outside of either 1.5 times the IQR (Interquartile range) below the first or 1.5 times the IQR above the third quartile is considered an outlier. In a regression point of view, an outlier is a data point whose response Y ij does not follow the general trend of the rest of the data. However, an outlier cannot present any problem in the regression fitting, since this point can be near to the mean of the independent variable (I j : Environmental Index). Thus, to define an influential point, other measures are needed.
The leverage is a measure of how far away the values of an observation are from of the mean of this variable [20]. Therefore, a possible influential point was defined considering those points that presented high leverage values and fell outside of either 1.5 times the IQR below the first or 1.5 times the IQR above the third quartile. Specifically, we added the influential points in three genotypes (genotypes 1, 6 and 12, without loss of generality) by changing the observed value by one lower or higher than 1.5 times the interquartile range (IQR). The influential points are lower or higher than 1.5 times the interquartile range (IQR) and were added in environments that presented a lower, medium and higher environmental index (leverage) in the original data set. The situation with a medium leverage value was chosen to verify the influence of an outlier located near to the mean of the independent variable.

Detecting Influential Points
To verify the presence of a possible influential point, we used the leverage values, studentized residuals (SR), DFBETAS and Cook's distance [20].
According to Fox [21], observations that present values higher than 2 × number of independent variables +1 number of environments , 2 √ number of environments , 1 and 2, for, respectively, the leverage, studentized residuals, DFBETAS and Cook's distance, deserve attention. In practice, we carefully analyzed those genotypes that presented values jointly higher than cutoffs for leverage and SR or/and those that presented values higher than thresholds defined for DFBETAS and Cook's distance.

Computational Features
The possible influential points were evaluated by using stats package [22]. The models fitting were carried out by using mblm (to fit the non-parametric regression) and rq (to fit the quantile regression) functions of the packages mblm [23] and quantreg [24] of R software [22], respectively. The Eberhart and Russell [6] methodology was carried out by using genes software [25]. The dataset analyzed, as well the R software codes used, during the current study are available from the corresponding author on reasonable request.

Analysis of Cotton Yields in Different Environments
The joint analysis of variance for the data of yield of 12 cotton genotypes showed differences (p < 0.05) for environments (E), genotypes (G) and the G × E interaction ( Table 2). The significance of the G × E interaction effects indicates the differential performance of genotypes in different environments justifying the use of adaptability and stabilities analysis.

Potential Influential Points on the Experimental Data
According to the boxplot (Figure 1) in the environments 1, 9, 10, 15 and 18, some genotypes presented phenotypic values that differ significantly from other observations, that is, outliers (Figure 1). The leverage values range from 0.06 to 0.31. The two higher leverage values were observed for environments 1 (0.22) and 17 (0.31) ( Figure 1 and Table S1). The further I j is from I, the larger the leverage is ( Figure 1).

Yield Adaptability and Stability from Experimental Data
The genotypes TMG 41 WS (genotype one), TMG 43 WS (genotype two), BRS 286 (genotype nine) and BRS 368 RF (genotype 11) presented discordant classifications in relation to the parameter of adaptability (Table 3). Specifically, the genotypes one and nine (TMG 41 WS and BRS 286) were characterized as having wide adaptability by Eberhart and Russell and non-parametric regression [11,14] methodologies, and were recommended for unfavorable environments by the quantile regression. Genotype two (TMG 43 WS), recommended for unfavorable environments by Eberhart and Russell and non-parametric regressions, was characterized as having wide adaptability by the quantile regression methodology. Differently from the other methods (non-parametric and quantile regressions), genotype 11 (BRS 368 RF) was considered as having wide adaptability by Eberhart and Russell. Although these genotypes presented discordant classification, according to the defined cutoffs (leverage > 0.22 and SR > 2; DFBETAS > 0.47 or CD > 1), they do not present any influential points (Tables S1-S3 and Figures S1-S4). Therefore, the classification considered is that provided by the Eberhart and Russell methodology. On the other hand, genotype five (IMA 08 WS), that presents a possible influential point at environment 17 (SR = 2.71; CD = 1.17), does not present discordant classification in relation to the parameter of adaptability (Tables S1 and S3). Specifically, this genotype was classified as recommended for unfavorable environments by the three methodologies being studied (Table 3). The estimates of absolute studentized residuals (SR), DFBETAS and Cook's distance (CD) range from 0 to 5.03, 0 to 0.21 and from 0 to 1.17, respectively (Tables S1-S3). Considering the environment with leverage values higher than the threshold (0.22), only genotype five (IMA 08 WS) presented values of absolute studentized residuals (2.71) and CD (1.17) higher than the thresholds (2.00 and 1.00), showing that this genotype deserves attention (Tables S1-S3).

Yield Adaptability and Stability from Experimental Data
The genotypes TMG 41 WS (genotype one), TMG 43 WS (genotype two), BRS 286 (genotype nine) and BRS 368 RF (genotype 11) presented discordant classifications in relation to the parameter of adaptability (Table 3). Specifically, the genotypes one and nine (TMG 41 WS and BRS 286) were characterized as having wide adaptability by Eberhart and Russell and non-parametric regression [11,14] methodologies, and were recommended for unfavorable environments by the quantile regression. Genotype two (TMG 43 WS), recommended for unfavorable environments by Eberhart and Russell and nonparametric regressions, was characterized as having wide adaptability by the quantile regression methodology. Differently from the other methods (non-parametric and quantile regressions), genotype 11 (BRS 368 RF) was considered as having wide adaptability by Eberhart and Russell. Although these genotypes presented discordant classification, according to the defined cutoffs (leverage > 0.22 and SR > 2; DFBETAS > 0.47 or CD > 1), they do not present any influential points (Tables S1-S3 and Figures S1-S4). Therefore, the classification considered is that provided by the Eberhart and Russell methodology. On the other hand, genotype five (IMA 08 WS), that presents a possible influential point at environment 17 (SR = 2.71; CD = 1.17), does not present discordant classification in relation to the parameter of adaptability (Tables S1 and S3). Specifically, this genotype was classified Overall, the genotypes present the low stability. The intercept estimated from the Eberhart and Russell method of each genotype (Table 3) is given by the average of the genotypes over environments. On the other hand, non-parametric and quantile regression methodologies use strategies based on the median. The estimated values for the intercept parameter do not present higher differences (Table 3).
The estimates of absolute studentized residuals (SR), DFBETAS and Cook's distance (CD) ranged from 0 to 10.78, 0 to 0.46 and from 0 to 2.04, respectively (Tables S4-S6). Considering these thresholds (Leverage > 0.22 and SR > 2; DFBETAS > 0.47 or CD > 1), genotypes one (TMG 41 WS), five (IMA 08 WS) and 12 (BRS 369 RF) deserve attention and should be analyzed carefully. According to the used thresholds, genotype six (NUOPAL), which was added an influential point, does not present any problem and can be analyzed considering the Eberhart and Russell classification.

Yield Adaptability and Stability from Synthetic Data
Among the genotypes that deserve attention, two of them (genotypes 1-TMG 41 WS and 12-BRS 369 RF) correspond to those to which the influential point was added. These genotypes presented discordant classifications in relation to the parameter of adaptability (Table 4). Genotype one (TMG 41 WS) was recommended for unfavorable environments by Eberhart and Russell; it was characterized as having wide adaptability by non-parametric and quantile regression methodologies (Table 4 and Figure S5). On the other hand, genotype 12 (BRS 369 RF) was characterized as having wide adaptability using this method and was recommended for favorable environments by non-parametric and quantile regression methodologies (Table 4 and Figure S6). The recommendation of the other genotypes can consider the Eberhart and Russell results. Genotype six (NUOPAL) was not affected by the "influential point" (Table 4 and Figure S7).  [6] method (based on standard least squares);β TH 0 andβ TH 1 : intercept and parameter of adaptability of Non-parametric Regression [11,14] method;β QR 0.5 0 : intercept of Quantile Regression [8,24] method;β ER 1 : parameter of adaptability using the Eberhart and Russell [6] method;β TH 1 : parameter of adaptability using the Nonparametric Regression [11,14] method;β QR 0.5 1 : parameter of adaptability using the Quantile Regression [8,18] method; R 2 ER : coefficient of determination using the Eberhart and Russell [6] method; R 2 TH : coefficient of determination using the Non-parametric Regression [11,14] method; R 2 QR 0.5 : coefficient of determination using the Quantile Regression [8,18] method; σ 2 di−ER : regression deviation using the Eberhart and Russell [6] method. : parameter of adaptability using the Quantile Regression [8,18] method; R 2 ER : coefficient of determination using the Eberhart and Russell [6] method; R 2 TH : coefficient of determination using the Non-parametric Regression [11,14] method; R 2 QR 0.5 : coefficient of determination using the Quantile Regression [8,18] method; σ 2 di−ER : regression deviation using the Eberhart and Russell [6] method.
The stability parameter for the genotypes affected by the influential points is defined through R 2 TH or R 2 QR 0.5 for non-parametric and quantile regressions, respectively. Overall, the genotypes where the influential point was added and deserves attention, were classified as having low predictability R 2 TH and R 2 QR 0.5 < 0.70) ( Table 4). These results agree with those obtained using the Eberhart and Russell methodology, which uses the component of variance from the regression deviations as a measure of stability. The intercept's estimates presented similar results (Table 4).
In summary, the presence of the influential points underestimates and overestimates the adaptability parameter estimates obtained using Eberhart and Russell, non-parametric and quantile regression methodologies (Table 5). Table 5. Summary of changes in the classification in relation to the parameter of adaptability obtained using the Eberhart and Russell (1966), non-parametric and quantile Regression methodologies for three early cotton genotypes that deserves attention and should be analyzed carefully according to the measures used to assess influential points.

Discussion
In this study, we aimed to evaluate the effect of influential points in methodologies based on regression for adaptability and stability parameter estimation. The Eberhart and Russell [6] methodology, and non-parametric [11,14] and quantile [8,18] regressions were compared. These methodologies are, respectively, based on standard least squares [6] and median (non-parametric and quantile (τ = 0.5) regressions) estimators. In order to do so, a real data set consisting of a yield of 12 early cotton genotypes evaluated in 18 Brazilian Cerrado environments and synthetic data were used.
According to Draper and John [26], an outlier is an observation that provides a large residual. However, an outlier does not necessarily affect the fitted equation, mainly when the outlier presents a low leverage value. To assess the influence of a specific observation in the estimation process, we used some measures to verify the presence of a possible influential point. In the first one, the leverage was used jointly with the studentized residual. The leverage summarizes the potential influence of Y i on all the fitted values according to Fox [21]. The further I j is from I, the larger the leverage is. In adaptability and stability methods based on regression models, since the independent variable is defined by an environmental index, the environments far away from the mean (zero) can have a considerable impact on the fitted value. Environment 17 (Chapadão do Sul, MS-CHA) presented a leverage value higher than the threshold (0.22) and, therefore, could have impacted the fitted value. However, the leverage presents one issue. A point with high leverage may or may not be influential since the leverage depends only on the independent variable (I j ). Therefore, another measure is needed to detect an influential point. The studentized residual was used jointly with the leverage for detecting possible influential points, since observations that combine higher leverage with a higher studentized residual exert substantial influence on the regression coefficients. The second one was based on measures direct of the impact on each coefficient through deleting each observation. Nascimento et al. [11] defined the variation, in modulus, between the estimators of the slope coefficient estimated using the methods least squares and using the non-parametric regression method as a measure of the direct influence of one point. Nevertheless, it is usual to scale this measure using the coefficient standard error. The DFBETAS measure the standardized change in regression coefficients. According to Fox [21], values of DFBETAS higher than one or two indicate an influential point. However, since in adaptability and stability studies, the number of environments is small, it is recommended that the sizeadjusted threshold proposed by Belsley   . Cook's distance measures the "distance" between the slopes estimated without and with the observation. In general, Cook's distance was more sensitive compared to the DFBETAS. Differently of the DFBETAS, Cook's distance was able to detect the inserted influential points. Thus, it is suggested that more than a simple measure is used to assess possible influential points.
After the evaluation of influential points, the analyses of adaptability and stability were performed. Overall, the methodologies based on medians (non-parametric and quantile regressions) were more appropriate in the presence of influential points (synthetic data). These methods are less sensitive in the presence of influential points (as observed in the synthetic data) compared with least square estimators [11,28]. According to John [29], since the median has a bounded influence function, the effect of an outlier on a sample median is bounded, no matter how far the outlying observation is. Therefore, methodologies based on median estimators seem more adequate in the presence of influential points. Nascimento et al. [11] showed that non-parametric regression [14] reduces the influence of influential points resulting from the presence of genotypes with answers to a certain environment that are too different on the estimation of the adaptability parameter. Barroso et al. [8] showed that the quantile regression (τ = 0.5) methodology provides superior results compared to Eberhart and Russell [6] in the presence of influential points.
These results are similar to those obtained in this work. The use of one methodology that is less sensitive to the presence of influential points can avoid misclassification, as it seemed for genotypes 1 and 12 considering the synthetic data. Genotype five, in which was detected by a potential influential point, does not present discordant classification. Thus, in a context of adaptability and stability analysis, only those genotypes that present a discordant classification between methodologies should be analyzed carefully.
Overall, the effect of an influential point in the adaptability parameter can be summarized considering the results obtained for genotypes 1 (TMG 41 WS) and 12 (BRS 369 RF) from the synthetic dataset. Specifically, environments 17 (CHA) and 8 (CV2), respectively, underestimate and overestimate the adaptability parameter estimates obtained using the Eberhart and Russell methodology. However, to identify an influential point is not a simple task. In an adaptability and stability study context, the research should carefully analyze an environment that greatly affects adaptability parameters. To do that, it is interesting to use, in addition to current experience, the measures presented in this manuscript. Regarding the stability parameter, the genotypes do not present higher differences. The intercept estimator from the Eberhart and Russell method of each genotype is given by the average of the genotypes over the environments. On the other hand, non-parametric and quantile regression methodologies use strategies based on the median. Despite it, the estimated values for the intercept parameter do not present higher differences.
Finally, once the recommendation of a genotype is made in a general way for several environments, any methodology that mitigates a differential effect from a specific environment into a genotype seems interesting.

Conclusions
The influential points can modify the recommendation of genotypes, based on regression methods, in the presence of a G × E interaction. The non-parametric and quantile (τ = 0.5) regressions based on median estimators are less sensitive to the presence of influential points, avoiding misleading recommendations of genotypes in terms of their adaptability in the presence of influential points. The recommendation of genotypes by the Eberhart and Russell methodology in the presence of an influential point can under or overestimate the adaptability parameter estimates causing a loss of information, time and financial resources. On the other hand, when the dataset does not present any problems related to influential points, the Eberhart and Russell methodology should be used in the genotype classification.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/agronomy11112179/s1, Figure S1: Estimated lines by Eberhart and Russell (1966), nonparametric and quantile regressions (τ = 0.50) to the genotype TMG 41 WS, considering the experimental data, Figure S2: Estimated lines by Eberhart and Russell (1966), non-parametric and quantile regressions (τ = 0.50) to the genotype TMG 43 WS, considering the experimental data, Figure S3: Estimated lines by Eberhart and Russell (1966), non-parametric and quantile regressions (τ = 0.50) to the genotype BRS 286, considering the experimental data, Figure S4: Estimated lines by Eberhart and Russell (1966), non-parametric and quantile regressions (τ = 0.50) to the genotype BRS 368 RF, considering the experimental data., Figure S5: Estimated lines by Eberhart and Russell (1966), non-parametric and quantile regressions (τ = 0.50) to the genotype TMG 41 WS, considering the synthetic data, Figure S6: Estimated lines by Eberhart and Russell (1966), non-parametric and quantile regressions (τ = 0.50) to the genotype NUOPAL, considering the synthetic data, Figure S7: Estimated lines by Eberhart and Russell (1966), non-parametric and quantile regressions (τ = 0.50) to the genotype BRS 369 RF, considering the synthetic data, Table S1: Absolute studentized residuals, leverage and environmental index considering the experimental data, Table S2: Difference in adaptability parameter estimated for each genotype with and without the environment (DFBETA) considering the experimental data, Table S3: Cook's distance for the experimental data, Table S4: Absolute studentized residuals, leverage and environmental index considering the synthetic data, Table S5: Difference in adaptability parameter estimate for each genotype with and without the environment (DFBETA) considering the synthetic data, Table S6: Cook's distance for the synthetic data.