1. Introduction
Hybridizations among closely related species have frequently occurred in nature. Under Mayr’s biological species concept, hybrid species can be defined as organisms formed by cross-fertilization between individuals of different species [
1,
2]. Hybrid speciation occurs in at least two ways: allopolyploid speciation and diploid (homoploid) hybrid speciation. While allopolyploidy is hybrid speciation between two species resulting in a new species that has the complete diploid chromosome complement of both its parents, diploid hybrid speciation results from a normal sexual event in which each gamete has a haploid complement of the nuclear chromosomes from its parent, but gametes that form the zygote come from different species [
3]. This means that, in hybrid speciation, the new species may have the same number of chromosomes as its parent (diploid hybridization) or the sum of the number of chromosomes of its parents (polyploid hybridization).
Phylogenetic comparative methods (PCMs) are commonly applied to study correlated trait evolution; most methods were developed by incorporating a phylogenetic tree to represent the affinity among a group of related species [
4,
5,
6]. However, if evolution involved ancient hybridizations, then we cannot simply use the phylogeny to represent the affinity among species, but instead should use the phylogenetic network (which is a directed acyclic graph, coupled with time constraints). Currently, in the literature, we can observe the development of statistical methods using phylogenetic networks to investigate trait evolution including the hybridization process [
7,
8,
9,
10]. Note that approaches to phylogenetic analysis typically involve constructing networks using molecular data [
11,
12], while our approach employs the given phylogenetic network with known topology and branch lengths to study the evolution of traits.
The objective of our research is to examine the evolution of traits in both hybrid and non-hybrid species, specifically through the lens of reticulation evolution. This phenomenon involves the merging of genetic material from different species, resulting in the creation of hybrid offspring that exhibit a unique combination of traits inherited from their parents. Our study aims to investigate the implications of reticulation evolution for correlated trait evolution in a linear regression framework.
The paper is organized as follows. In
Section 2, we model the hybrid on the given phylogenetic network and create a phylogenetic regression model to analyze trait data that account for the hybrid information. In
Section 3, a heuristic algorithm is proposed to build the variance–covariance matrix given a phylogenetic network and we propose a maximum likelihood framework for parameter estimation. In
Section 4, the novel regression model is applied to study the drought tolerance of sunflowers. The discussion for this work is provided in
Section 5, and the conclusions are given in
Section 6.
3. Algorithm and Inference
An extended Newick format (eNewick) uses unique syntax to represent a given phylogenetic network in linear form [
22]. A phylogenetic network can be transformed into a phylogenetic tree with some replicated nodes, adequately tagged according to the hybrid nodes, and then traversing the resulting phylogenetic network in postorder to obtain the eNewick description of the phylogenetic network. We modified their representation in the function
newick2phylog in the
ade4 package [
23] in the R software to obtain the eNewick format. The function
Newick2phylog [
23] in the
ade4 package of the R software program was designed to read in phylogenies in Newick format and return an array with three columns, where the first column contains the ancestral nodes and the second and third columns have the two descendants of the corresponding ancestor. Note that the number of rows (ancestors) in this array is
as a hybrid node requires two incoming ancestors while a species node only has one ancestor. The root is also included in the count. To provide an example, in a
taxa network with one hybrid (
), as in
Figure 1, we have the number of rows equal to 4, which is calculated as
. This is also shown in the following
Table 1.
The algorithm can generate the covariance matrix
by starting from the root, adding a new node in each step, and terminating until the desired matrix of
n species is built. For the tree case, each descendant has a unique ancestor. For the node with the reticulated event, the function reads a descendant such as a hybrid species with two ancestors; in one of the ancestral rows, the descendant will be listed by name, and in the other row, the descendant will have a
attached to the end of the name. After determining the ancestral–descendant relationships, we find the times from the root at which speciation events or hybridization events occur:
,
,
, ⋯, etc. Note that there are
branches, and we build the phylogenetic similarity matrix
up from the root. For times
, there are two species present whose evolution is independent given the root. The relationship matrix up until
is thus a
diagonal matrix with
t on the diagonal. For each event, we adjust the similarity matrix according to Equation (
11) for the Brownian motion model as follows to generate the variance–covariance matrix
for
n tips by starting with the root, adding a new node at each
speciation or
hybridization event, and terminating when the process reaches the tips. A concrete example with detailed illustration is provided in
Appendix A.2.
Our proposed methodology uses a feasible generalized least-squares approach to estimate the model parameters
and
, as well as the regression parameter
, through a joint estimation approach. An alternating search procedure is utilized to simultaneously obtain the estimate for
and the covariance by maximizing the likelihood of the model parameters and minimizing the squared residuals of the regression parameters, as illustrated in Algorithm 1.
Algorithm 1: Procedure for Parameter Estimation. |
Require:
Predictive traits , and Y, network .
Ensure:
Regression estimator , hybrid vigor estimator , and rate estimator .
- 1:
Get ordinary least-square estimates , where , p is the number of covariates. - 2:
Set . - 3:
Use the tree traversal algorithm with Equation ( 11) to construct the variance–covariance matrix . - 4:
Compute - 5:
Apply the Nelder–Mead method to search the maximum likelihood and and let in Equation ( 13). - 6:
Use to compute the GLS estimate . - 7:
if - 8:
return . - 9:
else - 10:
if - 11:
set . - 12:
Set and - 13:
if - 14:
Set and go to step 4. - 15:
else Go to step 4.
|
4. Empirical Analysis
Hybridization is common in nature, with at least 25% of plant species showing hybridization. Sunflowers are an example of a species that has adapted to a wide range of environmental conditions, including soil types, temperature, and salinity. Studies show that hybridization frequently occurs among sunflowers, resulting in genetically hybrid species. Sunflowers have various uses, including traditional Chinese medicine, edible oil, and soil phytoremediation [
24]. The family of Helianthus is the subject of ongoing research on the adaptation of hybrid species to their environment. Sunflowers, in particular, have adapted to tolerate drought and salty conditions in their habitats with lower precipitation levels. Selective sweeps in sunflowers have revealed candidate genes for adaptation to drought and salt tolerance [
25]. Studies have also shown that sunflowers vary in their tolerance to drought [
26].
The study focused on exploring the correlation between traits and drought tolerance, with soil moisture, precipitation, and rainfall in the area considered as possible factors that affect the response variable,
Y. The precipitation data used as the covariates were collected from the
WorldClim database [
27,
28]. The geographical data of the longitude and latitude of sunflowers were collected from the Global Biodiversity Information Facility (GBIF) database [
29], and the R package
raster [
30,
31] was used to download the corresponding data for analysis. To further investigate sunflowers’ adaptation to drought tolerance conditions, a phylogenetic regression method was proposed, which can analyze trait data from both hybrid and typical species in the evolutionary mechanism. This method was applied to study a group of common sunflowers,
Helianthus annuus, using data from the
efloras database [
32]. The collected traits include the plant height, petiole, pedicel, hemispherical bract, bract, stalk, leaf, ray flower, disk, corolla, and calyx achene of sunflowers. The predictor variable used in the study was the annual precipitation amount measured in various locations, which was obtained using the
raster package from the
WorldClim database. For example, the precipitation data for uncommon species located at
latitude degrees and
longitude degrees were obtained with a setting resolution of
minutes.
The presented data in
Table 2 showcase the response traits of sunflowers, including various characteristics such as annuals, petioles, peduncles, involucres, phyllaries, paleae laminae, ray florets, disc florets, corollas, cypselae, and pappi. The covariate trait in question is the annual precipitation (
AnnPrec), which represents the yearly precipitation levels at the location of the observed sunflowers.
This dataset offers valuable insights into the relationship between the response traits of sunflowers and the annual precipitation levels in their growing location. Such findings could have significant implications for plant breeding and cultivation in regions with varying levels of precipitation. As such, a thorough analysis of the presented data can provide critical information that can contribute to the development of more robust and resilient plant species in the future. In light of this, further investigation and exploration of the data presented in
Table 2 are warranted, as they may reveal essential correlations and trends that can deepen our understanding of sunflowers and their responses to varying levels of precipitation.
The network in
Figure 4 is a modification from [
33], where 11 sunflowers species are given at the genus level.
To investigate whether precipitation has a significant impact on traits, it is necessary to check whether the regression slope is zero, represented by the null hypothesis
. The results of the analysis using the phylogenetic regression model are presented in
Table 3. The table reports GLS estimates for
, along with its 95% confidence interval, as well as estimates for the rate parameter
and the hybrid parameter
.
Table 3 provides the estimates of hybrid effect
, rate of evolution
, and slope
, along with their corresponding standard errors for different response traits in a study. The table also provides information about whether the slope estimate is statistically significant or not (significant set to Yes or No) at the 5% significance level.
For example, for the response trait “Annuals”, the hybrid effect estimate is with a standard error of 0.07, indicating that the response of annuals has moderate hybrid weakness among sunflower species. The rate of evolution estimate is with a standard error of 0.087, indicating that the evolutionary rate of annuals is relatively slow. The slope estimate is with a 95% CI of , suggesting that precipitation has a significant positive effect on the trait. The significance of the effect is indicated by the “Significant?” column, which shows “Yes” for a significant effect based on the 95% confidence interval of the slope estimate.
Similarly, the second-to-last row for the response trait Cypselae indicates that the hybrid effect estimate is (hybrid vigor) with standard error , the rate of evolution estimate is with standard error , and the slope estimate is with a 95% confidence interval . Additionally, the slope estimate is statistically significant (significance set to Yes) for this trait.
In summary, the table provides estimates and corresponding standard errors for the hybrid effect, rate of evolution, and slope, along with their significance levels for different response traits in a study. These estimates can be used to make inferences about the relationship between the variables being studied and the response traits under consideration.
We further evaluate the correlations among the parameter estimates , and using the 12 sunflower trait datasets; there is a moderate positive correlation () between the rate of evolution () and the regression slope (), suggesting that an increase in the rate of evolution is associated with an increase in the magnitude of the regression slope. There is a moderate negative correlation () between the rate of evolution () and the hybrid effect parameter (), suggesting that an increase in the rate of evolution is associated with a decrease in the magnitude of the hybrid effect parameter. There is a weak negative correlation () between the regression slope () and the hybrid effect parameter (), suggesting that there is a weak relationship between these variables, and as the hybrid effect parameter increases, the regression slope tends to decrease, but the relationship is not particularly strong.
We performed a benchmark analysis to evaluate the proposed methodology. The baseline model used for comparison is a simple linear regression model. Another model used for comparison is the tree model, which assumes a Brownian motion model [
34]. These models were used for the benchmark analysis of our network model. While the existing methodology may not be directly comparable, the analysis still provides insights into baseline estimation and allows us to compare the performance of the proposed methodology with existing baselines. The result is shown in
Table 4. The first row of the table compares the performance of the tree model and linear regression model using the “Annuals” trait. The tree model has a benchmark ratio of 1.006, indicating that its RMSE is 0.6% higher than that of the linear regression model. Similarly, the network model has a benchmark ratio of 1.077, which means that its RMSE is 7.7% higher than that of the linear regression model. The results indicate that the tree model has slightly poorer performance compared to the linear regression model, while the network model performs even worse than the linear regression model. This is expected because the network model is more complex. However, despite the larger RMSE values obtained from the network model, the values are still reasonable when compared to the baseline model.
5. Discussion
The model utilized to examine trait values in phylogenetic networks through hybridization modeling is of fundamental importance and represents an essential tool in the analysis of this type of data. There is room for improvement by using more appropriate representations for the hybrid R based on its parents X and Y to find suitable functions , which would allow us to model events such as horizontal gene transfers or recombination that are biologically different from hybridization and can affect trait values.
We acknowledge that the covariance structure
is complex, which creates difficulties in demonstrating the positive definiteness of the Hessian matrix of the likelihood function. This makes it challenging to ensure that the likelihood is jointly convex in all parameters. However, our regression model meets certain conditions, including having a well-defined likelihood function and satisfying the assumption of non-singularity. Our empirical analysis confirms that our method achieves the global maximum within its domain. This is supported by the fact that
is positive definite for each dataset, as detailed in
Appendix A.1.3.
In order to enhance the current model’s capability to analyze phylogenetic network data, several future research avenues could be pursued. Firstly, the model could be extended to include more complex evolutionary processes, such as the Ornstein–Uhlenbeck (OU) model [
35] or the early burst model [
36]. The OU model could be implemented by introducing a force parameter
to the covariance matrix construction, and the optimization process would require a multidimensional search. For instance, if implementing the OU process [
35], one would need to take the non-independent increment condition into account to construct the covariance matrix. One can also consider implementing non-Gaussian processes [
37] in the network for trait evolution. Secondly, the algorithm could be generalized to handle the hard polytomy by analyzing multifurcating phylogenetic networks for regression analysis [
38].
It is also worthwhile to take into account situations in which characteristics may conform to probability distributions beyond the normal distribution and to evaluate the resilience of our proposed methodology when the assumption of normality is not met. In particular, researchers should examine model misspecification problems [
39] and study the consequences of non-normal distributions on the efficacy of the model, as has been done in previous studies [
40].
Incorporating more parameters into the model would enable a more functional role of interaction with the hybrid parameters, particularly in the context of richer models such as the OU and early burst models. Furthermore, future work could explore the integration of discrete character evolution or the joint analysis of both discrete and continuous characters [
41,
42], as well as extend the proposed approach to accommodate diverse types of trait distributions. The development of such extensions would contribute to a better understanding of the evolution of biological traits, and may have practical applications in fields such as conservation biology and agriculture [
43].
6. Conclusions
A phylogenetic regression model that incorporates a network structure to examine trait evolution in the context of reticulation events is proposed. Maximum likelihood estimation is utilized to estimate parameters, and an algorithm is developed to build the variance–covariance matrix using a phylogenetic network in eNewick format as input. This model is applied to investigate the response of common sunflower, Helianthus annuus, traits to drought conditions.
Parameter estimation is conducted through maximum likelihood, a widely used method in evolutionary biology, which allows for the estimation of model parameters that maximize the probability of the observed data. Additionally, an algorithm is developed to build the variance–covariance matrix, a crucial component of the model, using a phylogenetic network in eNewick format as input.
Overall, the proposed model and associated methods offer a novel approach to studying trait evolution in the context of reticulation events. By applying the model to the common sunflower and investigating its response to drought conditions, new insights can be gained into the evolutionary patterns of this important species.