A Phylogenetic Regression Model for Studying Trait Evolution on Network

Dwueng-Chwuan Jhwueng

doi:10.3390/stats6010028

Department of Statistics, Feng-Chia University, Taichung 40724, Taiwan

Stats2023, 6(1), 450-467;https://doi.org/10.3390/stats6010028

This article belongs to the Section Regression Models

Version Notes

Order Reprints

Abstract

A phylogenetic regression model that incorporates the network structure allowing the reticulation event to study trait evolution is proposed. The parameter estimation is achieved through the maximum likelihood approach, where an algorithm is developed by taking a phylogenetic network in eNewick format as the input to build up the variance–covariance matrix. The model is applied to study the common sunflower, Helianthus annuus, by investigating its traits used to respond to drought conditions. Results show that our model provides acceptable estimates of the parameters, where most of the traits analyzed were found to have a significant correlation with drought tolerance.

Keywords:

regression model; phylogenetic comparative analysis; variance–covariance matrix; reticulate evolution; Brownian motion

1. Introduction

Hybridizations among closely related species have frequently occurred in nature. Under Mayr’s biological species concept, hybrid species can be defined as organisms formed by cross-fertilization between individuals of different species [1,2]. Hybrid speciation occurs in at least two ways: allopolyploid speciation and diploid (homoploid) hybrid speciation. While allopolyploidy is hybrid speciation between two species resulting in a new species that has the complete diploid chromosome complement of both its parents, diploid hybrid speciation results from a normal sexual event in which each gamete has a haploid complement of the nuclear chromosomes from its parent, but gametes that form the zygote come from different species [3]. This means that, in hybrid speciation, the new species may have the same number of chromosomes as its parent (diploid hybridization) or the sum of the number of chromosomes of its parents (polyploid hybridization).

Phylogenetic comparative methods (PCMs) are commonly applied to study correlated trait evolution; most methods were developed by incorporating a phylogenetic tree to represent the affinity among a group of related species [4,5,6]. However, if evolution involved ancient hybridizations, then we cannot simply use the phylogeny to represent the affinity among species, but instead should use the phylogenetic network (which is a directed acyclic graph, coupled with time constraints). Currently, in the literature, we can observe the development of statistical methods using phylogenetic networks to investigate trait evolution including the hybridization process [7,8,9,10]. Note that approaches to phylogenetic analysis typically involve constructing networks using molecular data [11,12], while our approach employs the given phylogenetic network with known topology and branch lengths to study the evolution of traits.

The objective of our research is to examine the evolution of traits in both hybrid and non-hybrid species, specifically through the lens of reticulation evolution. This phenomenon involves the merging of genetic material from different species, resulting in the creation of hybrid offspring that exhibit a unique combination of traits inherited from their parents. Our study aims to investigate the implications of reticulation evolution for correlated trait evolution in a linear regression framework.

The paper is organized as follows. In Section 2, we model the hybrid on the given phylogenetic network and create a phylogenetic regression model to analyze trait data that account for the hybrid information. In Section 3, a heuristic algorithm is proposed to build the variance–covariance matrix given a phylogenetic network and we propose a maximum likelihood framework for parameter estimation. In Section 4, the novel regression model is applied to study the drought tolerance of sunflowers. The discussion for this work is provided in Section 5, and the conclusions are given in Section 6.

2. Model

2.1. Relation between the Hybrid and Its Parents

Figure 1 displays a phylogenetic network that illustrates the connection between three species—X, R, and Y. Species R is a hybrid of species X and Y, and it came into existence at a specific time,

t = t_{1}

. The root node O served as the ancestor for the three species. The purpose of the network is to show the relationships between the species.

Figure 1. A three-taxa phylogenetic network. The hybrid species R of X and Y on the tips of the network was formed at

t = t_{1}

.

To model trait evolution with hybridization, we treat the hybrid node on the phylogenetic network by allowing a burst of new variation at the hybridization event. We achieve this by incorporating a hybridization parameter

τ

. Consider that the trait of the hybrid species is defined in the log scale [7,8] via

log R = γ log X + (1 - γ) log Y + log τ

, where

γ

is the proportion of the hybrid trait inherited from parent X (i.e.,

1 - γ

is the proportion of the hybrid trait inherited from parent Y), and

τ

is denoted as the hybridization parameter that is designed to model an increase in the variance of the hybrid species.

In raw scale modeling, the relation can be expressed by exponentiation to obtain

R = τ X^{γ} Y^{1 - γ}

. For a setting, we use

γ = 0.5

, where the hybrid was assumed to be inherited equally from both parents. The arithmetic–geometric inequality establishes that

R = τ \sqrt{X Y} \leq \frac{τ}{2} (X + Y)

. As

τ

typically ranges between

(0, \infty)

, it follows that

\frac{τ}{2}

shares this range with

τ

. Because the quantitative phenotypic traits are inherently non-negative, the inequality of arithmetic and geometric means condition is met. By incorporating a model that permits variation in the hybrid’s variance to be computed from an additive operation on X, Y through

τ

, we establish the relationship between the hybrid species R and its parent organisms X and Y in Equation (1):

R = τ (X + Y) .

(1)

By incorporating this additive structure, below, we provide an approach to modeling hybrid trait evolution. In Equation (1), the affinity among species at time t in the phylogenetic network can be derived as follows. For any other species Z, the affinity between Z and R is

C o v (R, Z) = C o v (τ (X + Y), Z) = τ {C o v (X, Z) + C o v (Y, Z)}

(2)

In particular, when

Z = R

, we have

V a r (R) = V a r (τ (X + Y)) = τ^{2} {V a r (X) + V a r (Y) + 2 C o v (X, Y)} .

(3)

Given a phylogenetic network

N

of n taxa, one can use Equations (2) and (3) to derive the corresponding similarity matrix

G_{τ, n} = [g_{τ, i j}]

, where

g_{τ, i j}, i, j = 1, 2, \dots, n

describe the affinities between taxa i and j, possibly with hybrid species.

Below, we use the Brownian motion (BM) in modeling trait evolution [5,13,14] with the definition in Equation (1) to construct the model and variance–covariance matrix for a group of related species in Section 2.2.

2.2. Covariance Matrix under the Brownian Motion Model

Under the assumption of the BM process for trait evolution [15], we can define

X : = X_{t}, Y : = Y_{t}

as stochastic variables with

X_{t} = X_{0} + σ_{x} ϵ_{t}^{X}, Y_{t} = Y_{0} + σ_{x} ϵ_{t}^{Y}

, where

X_{0} = Y_{0}

is the ancestral value at the root of the tree,

σ_{x}

and

σ_{y}

are parameters of the rate of evolution, and

ϵ_{t}^{X}

and

ϵ_{t}^{Y}

are the Brownian motion variables with

E [ϵ_{t}^{X}] = E [ϵ_{t}^{Y}] = 0

and

V a r [ϵ_{t}^{X}] = V a r [ϵ_{t}^{Y}] = t

for trait X and Y, respectively.

Given the network with a known topology and branch length (times) as shown in Figure 1, we have

V a r (X) : = V a r (X_{t_{1} + t_{2}}) = σ^{2} (t_{1} + t_{2})

,

V a r (Y) : = V a r (Y_{t_{1} + t_{2}}) = σ^{2} (t_{1} + t_{2})

, and

C o v (X, Y) = 0

as X and Y are independent. Since the hybrid R is produced at time

t = t_{1}

, the variation in the hybrid R is decomposed into two parts: one comes from its parent at

t_{1}

and the other comes from its evolution from

t_{1}

to

t_{1} + t_{2}

. Hence, we have

V a r (R) : = V a r (R_{t_{1} + t_{2}}) = V a r (R_{t_{1}}) + V a r (R_{[t_{1}, t_{1} + t_{2}]}) = V a r (τ (X_{t_{1}} + Y_{t_{1}})) + σ^{2} (t_{1} + t_{2} - t_{1}) = τ^{2} {V a r (X_{t_{1}}) + V a r (Y_{t_{1}}) + 2 C o v (X_{t_{1}}, Y_{t_{1}})} + σ^{2} t_{2} = τ^{2} σ^{2} (t_{1} + t_{1} + 2 \cdot 0) + σ^{2} t_{2} = (2 τ^{2} t_{1} + t_{2}) σ^{2} .

Since evolution on different branches occurs independently, the covariation between the hybrid and its parents is

C o v (Y, R) = C o v (X, R) = C o v (X_{t_{1}}, R_{t_{1}}) = C o v (X_{t_{1}}, τ (X_{t_{1}} + Y_{t_{1}})) = τ [C o v (X_{t_{1}}, X_{t_{1}}) + C o v (X_{t_{1}}, Y_{t_{1}})] = τ [V a r (X_{t_{1}}) + 0] = τ σ^{2} t_{1}

. Therefore, with Equations (1)–(3), the corresponding similarity matrix

G_{τ, 3}

is obtained as in Figure 1.

G_{τ, 3} = \begin{matrix} X \\ R \\ Y \end{matrix} \begin{matrix} X & R & Y \\ ( & \begin{matrix} t_{1} + t_{2} \\ τ t_{1} \\ 0 \end{matrix} & \begin{matrix} τ t_{1} \\ t_{2} + 2 τ^{2} t_{1} \\ τ t_{1} \end{matrix} & \begin{matrix} 0 \\ τ t_{1} \\ t_{1} + t_{2} \end{matrix} & ) \end{matrix} .

(4)

Previous work has explained trait evolution in a logarithmic scale, using different parameter notations for the hybrid vigor [7,8,9], while we use

τ

. However, it is worth noting that both of these prior methods do account for the hybrid effect. Our proposed approach offers an alternative method of constructing the variance–covariance matrix, which differs from the methods used in the literature. We must acknowledge that our method has a limitation in its ability to handle gene flow, as it can only account for reticulation events. This limitation has been discussed in the literature [8].

2.3. Stepwise Procedure for Constructing the Variance–Covariance Matrix

In this study, we present a novel method for constructing the variance–covariance matrix using a matrix multiplication technique. The proposed approach involves a three-step process, as illustrated in Figure 2.

Figure 2. Evolution scenario for 3-taxa phylogenetic network containing a reticular hybridization event.

Figure 2 describes an evolutionary scenario for a phylogenetic network with three taxa, which involves a reticular hybridization event. The scenario consists of three main steps:

1.: Step 1: the root O speciates into two distinct taxa denoted as $X_{t 1}$ and $Y_{t 1}$ .
2.: Step 2: a hybrid species, denoted as R, is produced as a result of hybridization between X and Y at a specific time point, denoted as $t_{1}$ .
3.: Step 3: after $t = t_{1}$ , the three species X, R, and Y continue to evolve without undergoing any further speciation or hybridization, ultimately reaching the current time point of $t = t_{1} + t_{2}$ .

Note that this calculation of the covariance matrix is a three-step process [16], with both steps able to be described using matrix operations.

First, in step 1 in Figure 2, a speciation at the root yields two species

X, Y

at

t_{1}

with the covariance in Equation (5):

G_{2} = \begin{matrix} X \\ Y \end{matrix} \begin{matrix} X & Y \\ ( & \begin{matrix} t_{1} \\ 0 \end{matrix} & \begin{matrix} 0 \\ t_{1} \end{matrix} & ) \end{matrix} .

(5)

Next, in step 2, there is the instantaneous hybridization event at time

t_{1}

. This can be accomplished mathematically by multiplying the previous 2-by-2 matrix describing the variance

G_{2}

in Equation (5) in X and Y by a

3 \times 2

path matrix

K_{2, τ}

on the left and

K_{2, τ}^{t}

on the right:

G_{2, τ} = K_{2, τ} G_{2} K_{2, τ}^{t} = \begin{matrix} X \\ R \\ Y \end{matrix} \begin{matrix} X & R & Y \\ ( & \begin{matrix} t_{1} + t_{2} \\ τ t_{1} \\ 0 \end{matrix} & \begin{matrix} τ t_{1} \\ 2 τ^{2} t_{1} \\ τ t_{1} \end{matrix} & \begin{matrix} 0 \\ τ t_{1} \\ t_{1} + t_{2} \end{matrix} & ) \end{matrix},

(6)

where

K_{2, τ}

is shown in Equation (7)

K_{2, τ} = \begin{matrix} X \\ R \\ Y \end{matrix} \begin{matrix} X & Y \\ ( & \begin{matrix} 1 \\ τ \\ 0 \end{matrix} & \begin{matrix} 0 \\ τ \\ 1 \end{matrix} & ) \end{matrix} .

(7)

Finally, the last step is elongation by adding

t_{2} I_{3}

, where

I_{3}

is the 3-by-3 identity matrix. The corresponding covariance structure is shown in Equation (8):

G_{3, τ} = \begin{matrix} X \\ R \\ Y \end{matrix} \begin{matrix} X & R & Y \\ ( & \begin{matrix} t_{1} + t_{2} \\ τ t_{1} \\ 0 \end{matrix} & \begin{matrix} τ t_{1} \\ 2 τ^{2} t_{1} + t_{2} \\ τ t_{1} \end{matrix} & \begin{matrix} 0 \\ τ t_{1} \\ t_{1} + t_{2} \end{matrix} & ) \end{matrix} .

(8)

Alternatively, standard speciation events, as depicted in Figure 3, can be analyzed using analogous matrix operations.

Figure 3. A phylogenetic tree of 3 taxa.

X, Z, Y

are taxa; O is the root. X and Z share the same branch length on

t_{1}

. Y is independent with both X and Z.

The instantaneous speciation event shown in Figure 3 at time

t_{1}

is accomplished by multiplying on the left and on the right by the transpose of the matrix:

K_{2, τ = 1} = \begin{matrix} X \\ Z \\ Y \end{matrix} \begin{matrix} X & Y \\ ( & \begin{matrix} 1 \\ 1 \\ 0 \end{matrix} & \begin{matrix} 0 \\ 1 \\ 1 \end{matrix} & ) \end{matrix} .

(9)

For the tree case with only speciation, as shown in Figure 3, one can construct the similarity matrix

G_{3}

in Equation (10):

G_{3} = K_{2, τ = 1} G_{2} K_{2, τ = 1}^{t} = \begin{matrix} X \\ Z \\ Y \end{matrix} \begin{matrix} X & Z & Y \\ ( & \begin{matrix} t_{1} + t_{2} \\ t_{1} \\ 0 \end{matrix} & \begin{matrix} t_{1} \\ t_{1} + t_{2} \\ 0 \end{matrix} & \begin{matrix} 0 \\ 0 \\ t_{1} + t_{2} \end{matrix} & ) \end{matrix} .

(10)

These operations can be generalized to the k existing species case whenever the

k + 1

taxon arises by hybridization or speciation. Since the form of

K

changes depending on whether the hybridization of speciation is involved, we adopt the following notation: let

K_{j}

denote the

(j + 1)

by j matrix obtained from the j by j identity matrix by inserting a row with a one in column j and zeros elsewhere, where column j denotes the taxon involved in the speciation event. Let

K_{j, τ}

denote the

(j + 1)

by j matrix obtained from the j by j identity matrix by inserting a row with

τ

in columns i and j and zeros elsewhere, where columns i and j denote the taxa involved in the hybridization event. Then, the adjustment from time

t_{1} + \dots + t_{j - 1}

to time

t_{j}

is as given in Equation (11):

G_{j, τ} = K_{j - 1} G_{j - 1, τ} K_{j - 1}^{t} + t_{j} I_{j},

(11)

where, for hybridization,

K_{j - 1} = K_{j - 1, τ}

, and for speciation,

K_{j - 1} = K_{j - 1, τ = 1}

, which sets

τ = 1

, as is evident from Equation (9) when we compare it with Equation (7).

Our proposed methodology can indeed handle a general ultrametric phylogenetic network with an arbitrary number of hybrid nodes, as later demonstrated in the case study with 13 species and 3 hybrid species in Section 4, as well as the six-taxa network with two hybrids presented in Appendix A.2.

2.4. The Statistical Model and Likelihood Function

Under the regression model framework, let

Y = {(y_{1}, y_{2}, . . ., y_{n})}^{t}

be the trait values for n species, some of which are possibly old hybrids. Let

X = [1, X_{1}, X_{2}, \dots, X_{p}]

be the

n \times k

design matrix from the covariate trait, where

1 = {(1, 1, \dots, 1)}^{t} \in R^{n}

is the vector of 1s, and we have

Y \sim X β + ϵ, where ϵ \sim N (0, σ^{2} G_{τ}) .

(12)

Let

θ = (τ, σ, β)

, and the negative log-likelihood function given the traits

Y, X

and network

N

is

- log L (θ | Y, X, N) = \frac{n}{2} log (2 π) + \frac{n}{2} log σ^{2} + \frac{1}{2} log | G_{τ} | + \frac{1}{2 σ^{2}} {(Y - X β)}^{t} G_{τ}^{- 1} (Y - X β) .

(13)

The least-square estimate is shown in Equation (14):

\hat{β} = {(X^{t} G_{τ}^{- 1} X)}^{- 1} X^{t} G_{τ}^{- 1} Y .

(14)

As the model assumes a Gaussian process distribution, the estimation of model parameters can be conducted through maximum likelihood inference, utilizing the Nelder–Mead optimization method in the R software [17]. One of the Nelder–Mead optimization’s primary benefits is that it can be utilized in a variety of problem settings, without requiring knowledge of the objective function’s derivatives. In our specific likelihood function, the covariance matrix contains embedded parameters denoted by

τ

.

We use maximum likelihood analysis to estimate the hybridized parameter

τ

by optimizing the negative log-likelihood function, where

| G_{τ} |

is the determinant of

G_{τ}

. We set the bound for

τ

as

[0, 10]

for the purpose of optimization. We use the golden section method to search for the maximum likelihood estimator (MLE) of the negative log-likelihood function for the Brownian motion model.

Let

J (τ, σ) \equiv - log L (τ, σ | β, Y, X, N)

. By taking the partial differentiation of

J (τ, σ)

with respect to

τ

and

σ

, the Hessian matrix can be obtained:

H (τ^{*}, σ^{*}) = \frac{\partial^{2} J (τ, σ)}{\partial τ \partial σ} |_{(τ, σ) = (τ^{*}, σ^{*})} = Σ_{τ^{*}, σ^{*}}^{- 1},

(15)

which is useful to compute the variance of parameters

τ, σ

for further inference.

For the Gaussian random variable (here, the Brownian motion), the second derivatives of the objective function are constant for

(τ, σ)

because the objective function is a quadratic function

(τ, σ)

. Therefore, the Hessian matrix can be computed without obtaining the mean vector

(τ^{*}, σ^{*})

. We apply the R function hessian [18] to compute the Hessian matrix.

It is known that under regularity conditions (smoothness of the likelihood function) [19], the estimator

\hat{β}

(by iterating a finite number of times) is asymptotically distributed as

\sqrt{n} (\hat{β} - β) \overset{d}{\to} N (0, V)

where n is the taxa size and

σ^{2} X^{t} G_{τ}^{- 1} X

converge to

V

in probability. It is assumed that the response variable Y is continuous and that the error terms

ϵ

are normally distributed with a mean of 0 and a covariance matrix of

σ^{2} G_{τ}

, which means that the Brownian motion assumption is applied to each tip variable

y_{i}

in the response trait vectors Y. The predictor variables

X_{i}

are non-stochastic and fixed. Based on these assumptions, the likelihood of the linear regression model is given by an equation, Equation (13), and in order to show that this equation meets the regularity conditions, several properties must be satisfied. The likelihood function must be well-defined, non-negative, continuous in

β

and

σ^{2} G_{τ}

, and differentiable with respect to

β

,

σ^{2}

, and

τ

separately. These properties are satisfied because the likelihood function is a product of non-negative terms, the exponential function is always positive, and the sum of continuous functions is continuous. Additionally, the derivatives of the likelihood function with respect to

β

and

σ^{2}

are continuous. However, we note to the reader that the regularity condition’s likelihood function in Equation (13) depends on a certain range of the parameters

τ

for the network models proposed here and in the literature [7,8]. The derivative of the likelihood function with respect to

τ

involves the inverse of the covariance matrix

G_{τ}^{- 1}

, which depends on the network structure. First,

G_{τ}

is symmetric as a covariance matrix. If

G_{τ}

is a positive definite matrix, the derivative of the likelihood with respect to

τ

will be continuous. The inference can be used to infer the regression effect pending the condition of the

τ

. In the empirical analysis, we verify the positive definite property of the

G_{τ}

.

According to Varga and Nabben [20] and Nabben and Varga [21], if the covariance matrix

G_{τ}

is an ultrametric matrix, meaning that it satisfies certain mathematical inequalities (i.e.,

G_{τ} [i, i] > max {G_{τ} [i, k], k \neq i}

for all

i = 1, 2, \dots, n

,

G_{τ} [i, j] \geq min {G_{τ} [i, k], G_{τ} [k, j})

for all i, j, and k), then the derivative of the likelihood function with respect to

τ

will be continuous. This is because ultrametricity implies the stronger condition of the triangle inequality, which ensures that the matrix is always positive definite and has no negative eigenvalues. To ensure that all regularity conditions are met, it would be ideal to determine the parameter space for

τ

that would make

G_{τ}

ultrametric before analysis. However, this strict condition depends on the given network and cannot be solved analytically in general. For example, in the case of a three-taxon network, as shown in Equation (6), the parameter space for

τ

would need to be constrained to

τ \in {τ : (t_{1} + t_{2}) > τ t_{1}; 2 τ^{2} t_{1} + t_{2} > τ t_{1}}

to meet the ultrametric condition.

3. Algorithm and Inference

An extended Newick format (eNewick) uses unique syntax to represent a given phylogenetic network in linear form [22]. A phylogenetic network can be transformed into a phylogenetic tree with some replicated nodes, adequately tagged according to the hybrid nodes, and then traversing the resulting phylogenetic network in postorder to obtain the eNewick description of the phylogenetic network. We modified their representation in the function newick2phylog in the ade4 package [23] in the R software to obtain the eNewick format. The function Newick2phylog [23] in the ade4 package of the R software program was designed to read in phylogenies in Newick format and return an array with three columns, where the first column contains the ancestral nodes and the second and third columns have the two descendants of the corresponding ancestor. Note that the number of rows (ancestors) in this array is

n - 1 + 2 k

as a hybrid node requires two incoming ancestors while a species node only has one ancestor. The root is also included in the count. To provide an example, in a

n = 3

taxa network with one hybrid (

k = 1

), as in Figure 1, we have the number of rows equal to 4, which is calculated as

3 - 1 + 2 \times 1

. This is also shown in the following Table 1.

Table 1. Ancestral–descendant relationship corresponding to Figure 1.

The algorithm can generate the covariance matrix

G_{n, τ}

by starting from the root, adding a new node in each step, and terminating until the desired matrix of n species is built. For the tree case, each descendant has a unique ancestor. For the node with the reticulated event, the function reads a descendant such as a hybrid species with two ancestors; in one of the ancestral rows, the descendant will be listed by name, and in the other row, the descendant will have a

_1

attached to the end of the name. After determining the ancestral–descendant relationships, we find the times from the root at which speciation events or hybridization events occur:

t_{1}

,

t_{1} + t_{2}

,

t_{1} + t_{2} + t_{3}

, ⋯, etc. Note that there are

n - 1

branches, and we build the phylogenetic similarity matrix

G_{n, τ}

up from the root. For times

t < t_{1}

, there are two species present whose evolution is independent given the root. The relationship matrix up until

t_{1}

is thus a

2 \times 2

diagonal matrix with t on the diagonal. For each event, we adjust the similarity matrix according to Equation (11) for the Brownian motion model as follows to generate the variance–covariance matrix

G_{n, τ}

for n tips by starting with the root, adding a new node at each speciation or hybridization event, and terminating when the process reaches the tips. A concrete example with detailed illustration is provided in Appendix A.2.

Our proposed methodology uses a feasible generalized least-squares approach to estimate the model parameters

τ

and

σ

, as well as the regression parameter

β

, through a joint estimation approach. An alternating search procedure is utilized to simultaneously obtain the estimate for

\hat{β}

and the covariance by maximizing the likelihood of the model parameters and minimizing the squared residuals of the regression parameters, as illustrated in Algorithm 1.

Algorithm 1: Procedure for Parameter Estimation.

Require: Predictive traits

X = [X_{1}, X_{2}, \dots, X_{p}]

, and Y, network

N

.
Ensure: Regression estimator

\hat{β}

, hybrid vigor estimator

\hat{τ}

, and rate estimator

\hat{σ}

.

1:: Get ordinary least-square estimates ${\hat{β}}_{0} = {(X^{t} X)}^{- 1} X^{t} Y$ , ${\hat{σ}}_{0} = \sqrt{\frac{n - p}{n} {\hat{ϵ}}^{t} \hat{ϵ}}$ where $\hat{ϵ} = Y - X {\hat{β}}_{0}$ , p is the number of covariates.
2:: Set $τ_{0} = 0.1$ .
3:: Use the tree traversal algorithm with Equation (11) to construct the variance–covariance matrix $G_{τ}$ .
4:: Compute $ℓ_{0} = - log L (τ_{0}, {\hat{σ}}_{0} | {\hat{β}}_{0}, Y, X, N)$
5:: Apply the Nelder–Mead method to search the maximum likelihood $\hat{τ}$ and $\hat{σ}$ and let $ℓ_{1} = - log L (\hat{τ}, \hat{σ} | {\hat{β}}_{0}, Y, X, N)$ in Equation (13).
6:: Use $\hat{τ}$ to compute the GLS estimate ${\hat{β}}^{'} = {(X^{t} G_{\hat{τ}}^{- 1} X)}^{- 1} X^{t} G_{\hat{τ}}^{- 1} Y$ .
7:: if $| | {\hat{β}}^{'} - {\hat{β}}_{0} {| |}^{2} < 10^{- 5}$
8:: return $\hat{τ}, \hat{σ}, {\hat{β}}^{'}$ .
9:: else
10:: if $ℓ_{1} < ℓ_{0}$
11:: set $τ_{0} = \hat{τ}, σ_{0} = \hat{σ}$ .
12:: Set $ℓ_{10} = - log L (\hat{τ}, \hat{σ} | {\hat{β}}_{0}, Y, X, N)$ and $ℓ_{11} = - log L (\hat{τ}, \hat{σ} | {\hat{β}}^{'}, Y, X, N)$
13:: if $ℓ_{11} < ℓ_{10}$
14:: Set ${\hat{β}}_{0} = {\hat{β}}^{'}$ and go to step 4.
15:: else Go to step 4.

4. Empirical Analysis

Hybridization is common in nature, with at least 25% of plant species showing hybridization. Sunflowers are an example of a species that has adapted to a wide range of environmental conditions, including soil types, temperature, and salinity. Studies show that hybridization frequently occurs among sunflowers, resulting in genetically hybrid species. Sunflowers have various uses, including traditional Chinese medicine, edible oil, and soil phytoremediation [24]. The family of Helianthus is the subject of ongoing research on the adaptation of hybrid species to their environment. Sunflowers, in particular, have adapted to tolerate drought and salty conditions in their habitats with lower precipitation levels. Selective sweeps in sunflowers have revealed candidate genes for adaptation to drought and salt tolerance [25]. Studies have also shown that sunflowers vary in their tolerance to drought [26].

The study focused on exploring the correlation between traits and drought tolerance, with soil moisture, precipitation, and rainfall in the area considered as possible factors that affect the response variable, Y. The precipitation data used as the covariates were collected from the WorldClim database [27,28]. The geographical data of the longitude and latitude of sunflowers were collected from the Global Biodiversity Information Facility (GBIF) database [29], and the R package raster [30,31] was used to download the corresponding data for analysis. To further investigate sunflowers’ adaptation to drought tolerance conditions, a phylogenetic regression method was proposed, which can analyze trait data from both hybrid and typical species in the evolutionary mechanism. This method was applied to study a group of common sunflowers, Helianthus annuus, using data from the efloras database [32]. The collected traits include the plant height, petiole, pedicel, hemispherical bract, bract, stalk, leaf, ray flower, disk, corolla, and calyx achene of sunflowers. The predictor variable used in the study was the annual precipitation amount measured in various locations, which was obtained using the raster package from the WorldClim database. For example, the precipitation data for uncommon species located at

38.68

latitude degrees and

- 110.54

longitude degrees were obtained with a setting resolution of

0.5

minutes.

The presented data in Table 2 showcase the response traits of sunflowers, including various characteristics such as annuals, petioles, peduncles, involucres, phyllaries, paleae laminae, ray florets, disc florets, corollas, cypselae, and pappi. The covariate trait in question is the annual precipitation (AnnPrec), which represents the yearly precipitation levels at the location of the observed sunflowers.

Table 2. Sunflowers and their traits. Each column represents a sunflower species, while each row records the trait collected from the database.

This dataset offers valuable insights into the relationship between the response traits of sunflowers and the annual precipitation levels in their growing location. Such findings could have significant implications for plant breeding and cultivation in regions with varying levels of precipitation. As such, a thorough analysis of the presented data can provide critical information that can contribute to the development of more robust and resilient plant species in the future. In light of this, further investigation and exploration of the data presented in Table 2 are warranted, as they may reveal essential correlations and trends that can deepen our understanding of sunflowers and their responses to varying levels of precipitation.

The network in Figure 4 is a modification from [33], where 11 sunflowers species are given at the genus level.

Figure 4. Sunflower network regraphed from [33]. Species on the tip from the leftmost (labeled with the number 1) to the rightmost (labeled with the number 1) are 1. praecox, 2. debilis, 3. neglectus, 4. petiolaris, 5. anomalus, 6. deserticola, 7. paradoxus, 8. annuus, 9. argophyllus, 10. bolanderi, and 11. exilis, where deserticola, anomalus, and paradoxus are hybrids from petiolaris and annuus. The eNewick format is

(1 : 0, ((2 : 0.84, ((3 : 0.23, 4 : 0.23) 16 : 0.55, (5 : 0.32, (6 : 0.15, (7 : 0.15) 13) 12 : 0.18 : 0.16) 17 : 0.45) 20 : 0.06) 21 : 0.1, (((13, 8 : 0.15 : 0.35) 14 : 0.2, 9 : 0.35) 18, (10 : 0.16, 11 : 0.16) 15) 19 : 0.43) 22 : 0.06) 23 : 0

.

To investigate whether precipitation has a significant impact on traits, it is necessary to check whether the regression slope is zero, represented by the null hypothesis

H_{0} : β_{1} = 0

. The results of the analysis using the phylogenetic regression model are presented in Table 3. The table reports GLS estimates for

β

, along with its 95% confidence interval, as well as estimates for the rate parameter

σ

and the hybrid parameter

τ

.

Table 3. The table provides estimates and corresponding standard errors for the hybrid effect (

\hat{τ}

), the rate of evolution (

\hat{σ}

), and the slope (

\hat{β}

) for each of the 12 response traits of sunflowers under a network relationship. The slope estimate represents the effect of precipitation on the particular trait, with a positive value indicating a positive relationship and a negative value indicating a negative relationship. The 95% confidence interval (CI) for the slope estimate provides a range of plausible values for the true effect of precipitation on the trait.

Table 3 provides the estimates of hybrid effect

\hat{τ}

, rate of evolution

\hat{σ}

, and slope

\hat{β}

, along with their corresponding standard errors for different response traits in a study. The table also provides information about whether the slope estimate is statistically significant or not (significant set to Yes or No) at the 5% significance level.

For example, for the response trait “Annuals”, the hybrid effect estimate is

\hat{τ} = 0.841

with a standard error of 0.07, indicating that the response of annuals has moderate hybrid weakness among sunflower species. The rate of evolution estimate is

\hat{σ} = 0.133

with a standard error of 0.087, indicating that the evolutionary rate of annuals is relatively slow. The slope estimate is

\hat{β} = 0.243

with a 95% CI of

(0.038, 0.449)

, suggesting that precipitation has a significant positive effect on the trait. The significance of the effect is indicated by the “Significant?” column, which shows “Yes” for a significant effect based on the 95% confidence interval of the slope estimate.

Similarly, the second-to-last row for the response trait Cypselae indicates that the hybrid effect estimate

\hat{τ}

is

1.221

(hybrid vigor) with standard error

0.11

, the rate of evolution estimate

σ

is

0.053

with standard error

0.034

, and the slope estimate

\hat{β}

is

0.071

with a 95% confidence interval

(0.038, 0.105)

. Additionally, the slope estimate is statistically significant (significance set to Yes) for this trait.

In summary, the table provides estimates and corresponding standard errors for the hybrid effect, rate of evolution, and slope, along with their significance levels for different response traits in a study. These estimates can be used to make inferences about the relationship between the variables being studied and the response traits under consideration.

We further evaluate the correlations among the parameter estimates

\hat{τ}, \hat{σ}

, and

{\hat{β}}_{1}

using the 12 sunflower trait datasets; there is a moderate positive correlation (

0.73

) between the rate of evolution (

σ

) and the regression slope (

b_{1}

), suggesting that an increase in the rate of evolution is associated with an increase in the magnitude of the regression slope. There is a moderate negative correlation (

- 0.66

) between the rate of evolution (

σ

) and the hybrid effect parameter (

τ

), suggesting that an increase in the rate of evolution is associated with a decrease in the magnitude of the hybrid effect parameter. There is a weak negative correlation (

- 0.19

) between the regression slope (

b_{1}

) and the hybrid effect parameter (

τ

), suggesting that there is a weak relationship between these variables, and as the hybrid effect parameter increases, the regression slope tends to decrease, but the relationship is not particularly strong.

We performed a benchmark analysis to evaluate the proposed methodology. The baseline model used for comparison is a simple linear regression model. Another model used for comparison is the tree model, which assumes a Brownian motion model [34]. These models were used for the benchmark analysis of our network model. While the existing methodology may not be directly comparable, the analysis still provides insights into baseline estimation and allows us to compare the performance of the proposed methodology with existing baselines. The result is shown in Table 4. The first row of the table compares the performance of the tree model and linear regression model using the “Annuals” trait. The tree model has a benchmark ratio of 1.006, indicating that its RMSE is 0.6% higher than that of the linear regression model. Similarly, the network model has a benchmark ratio of 1.077, which means that its RMSE is 7.7% higher than that of the linear regression model. The results indicate that the tree model has slightly poorer performance compared to the linear regression model, while the network model performs even worse than the linear regression model. This is expected because the network model is more complex. However, despite the larger RMSE values obtained from the network model, the values are still reasonable when compared to the baseline model.

Table 4. The benchmark analysis involves the use of 12 traits, with RMSE1 computed via a simple linear regression baseline model, RMSE2 computed via the tree model [34], and RMSE3 computed via the proposed network model. The fourth and fifth columns of the table present the benchmark ratio for each model.

5. Discussion

The model utilized to examine trait values in phylogenetic networks through hybridization modeling is of fundamental importance and represents an essential tool in the analysis of this type of data. There is room for improvement by using more appropriate representations for the hybrid R based on its parents X and Y to find suitable functions

R = f (τ, X, Y)

, which would allow us to model events such as horizontal gene transfers or recombination that are biologically different from hybridization and can affect trait values.

We acknowledge that the covariance structure

G_{τ}

is complex, which creates difficulties in demonstrating the positive definiteness of the Hessian matrix of the likelihood function. This makes it challenging to ensure that the likelihood is jointly convex in all parameters. However, our regression model meets certain conditions, including having a well-defined likelihood function and satisfying the assumption of non-singularity. Our empirical analysis confirms that our method achieves the global maximum within its domain. This is supported by the fact that

G_{\hat{τ}}

is positive definite for each dataset, as detailed in Appendix A.1.3.

In order to enhance the current model’s capability to analyze phylogenetic network data, several future research avenues could be pursued. Firstly, the model could be extended to include more complex evolutionary processes, such as the Ornstein–Uhlenbeck (OU) model [35] or the early burst model [36]. The OU model could be implemented by introducing a force parameter

α

to the covariance matrix construction, and the optimization process would require a multidimensional search. For instance, if implementing the OU process [35], one would need to take the non-independent increment condition into account to construct the covariance matrix. One can also consider implementing non-Gaussian processes [37] in the network for trait evolution. Secondly, the algorithm could be generalized to handle the hard polytomy by analyzing multifurcating phylogenetic networks for regression analysis [38].

It is also worthwhile to take into account situations in which characteristics may conform to probability distributions beyond the normal distribution and to evaluate the resilience of our proposed methodology when the assumption of normality is not met. In particular, researchers should examine model misspecification problems [39] and study the consequences of non-normal distributions on the efficacy of the model, as has been done in previous studies [40].

Incorporating more parameters into the model would enable a more functional role of interaction with the hybrid parameters, particularly in the context of richer models such as the OU and early burst models. Furthermore, future work could explore the integration of discrete character evolution or the joint analysis of both discrete and continuous characters [41,42], as well as extend the proposed approach to accommodate diverse types of trait distributions. The development of such extensions would contribute to a better understanding of the evolution of biological traits, and may have practical applications in fields such as conservation biology and agriculture [43].

6. Conclusions

A phylogenetic regression model that incorporates a network structure to examine trait evolution in the context of reticulation events is proposed. Maximum likelihood estimation is utilized to estimate parameters, and an algorithm is developed to build the variance–covariance matrix using a phylogenetic network in eNewick format as input. This model is applied to investigate the response of common sunflower, Helianthus annuus, traits to drought conditions.

Parameter estimation is conducted through maximum likelihood, a widely used method in evolutionary biology, which allows for the estimation of model parameters that maximize the probability of the observed data. Additionally, an algorithm is developed to build the variance–covariance matrix, a crucial component of the model, using a phylogenetic network in eNewick format as input.

Overall, the proposed model and associated methods offer a novel approach to studying trait evolution in the context of reticulation events. By applying the model to the common sunflower and investigating its response to drought conditions, new insights can be gained into the evolutionary patterns of this important species.

Funding

This research and APC were funded by the National Science and Technology Council, Taiwan. MOST 111-2118-M-035-004-.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

I would like to express my sincere gratitude to the reviewers for their valuable feedback, which significantly enhanced the quality and rigor of the analysis. Additionally, I am grateful to Elizabeth Houseworth and Yo-Lun Tsai for their inspiration and support in the early stages of this work.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A

Appendix A.1. Script and Data files

All files in the manuscript can be accessed at http://tonyjhwueng.info/phyreghyb (accessed on 10 March 2023).

Appendix A.1.1. Model

1.: BM: http://tonyjhwueng.info/phyreghyb/bmhydRegV3.r (accessed on 10 March 2023).

Appendix A.1.2. Sunflower Precipitation Dataset

The data for each sunflower can be accessed by executing the R script at the following link:

1.: Precipitation data script: http://tonyjhwueng.info/phyreghyb/worldclim (accessed on 10 March 2023).

Appendix A.1.3. Figures and Tables

1: Figure 1: http://tonyjhwueng.info/phyreghyb/3taxanetwork.pptx (accessed on 10 March 2023).
2: Figure 2: http://tonyjhwueng.info/phyreghyb/3taxanetworkstep.pptx (accessed on 10 March 2023).
3: Figure 3: http://tonyjhwueng.info/phyreghyb/3taxatree.pptx (accessed on 10 March 2023).
4: Figure 4: http://tonyjhwueng.info/phyreghyb/sfnet.pptx (accessed on 10 March 2023).
5: Figure A1: http://tonyjhwueng.info/phyreghyb/sixtaxanetwork.pptx (accessed on 10 March 2023).
6: Table 2: http://tonyjhwueng.info/phyreghyb/precdatatolatex.html (accessed on 10 March 2023).
7: Table 3: http://tonyjhwueng.info/phyreghyb/RegSunflower.html (accessed on 10 March 2023).
8: Positive definite of $G_{\hat{τ}}$ : https://tonyjhwueng.info/phyreghyb/pdultracheck.html (accessed on 10 March 2023).
9: Table 4: https://tonyjhwueng.info/phyreghyb/AnnPrecAllResponseTrait.html (accessed on 10 March 2023).

Appendix A.2. Demonstration of Algorithm under Brownian Motion Model

Consider the phylogenetic network given in Figure A1. There are 6 extant taxa, 2 hybridization events, and 9 ancestral nodes in the network.

Figure A1. A six-taxa phylogenetic network where 2, 3, and 5 are the hybrid descendants. The eNewick format for the network topology is

((1, ((2, 3) 7) 12) 11, ((12, (4, (5) 9) 8) 13, (9, 6) 10) 14) 15

.

The ancestral–descendant data gathered from the eNewick2phylog function and modified are shown as follows:

Ancestor	[15]	[14]	[12]	[11]	[13]	[9]	[8]	[10]	[7]
Descendants	[11,14]	[13,10]	[7]	[1],[12]	[12,8]	[5]	[4,9]	[9,6]	[2,3]

From this, we can determine the event times and which times lead to which descendants as follows:

Node	[1]	[2]	[3]	[4]	[5]	[6]	[7]	[8]
Length	$t_{3} + t_{4} + t_{5}$	$t_{5}$	$t_{5}$	$t_{4} + t_{5}$	$t_{4} + t_{5}$	$t_{4} + t_{5}$	$t_{3} + t_{4}$	$t_{3}$

Node	[9]	[10]	[11]	[12]	[13]	[14]	[15]
Length	0	$t_{2} + t_{3}$	$t_{1} + t_{2}$	0	$t_{2}$	$t_{1}$	0

We also identify the sequence of temporary similarity matrices built up from the root to the tips in terms of the nodes at each event (speciation or hybridization):

[15] \to [11, 14] \to [11, 13, 10] \to [11, 12, 13, 10] \to [1, 7, 8, 10]

\to [1, 7, 8, 9, 10] \to [1, 7, 4, 5, 6] \to [1, 2, 3, 4, 5, 6] .

This sequence contains information for speciation and hybridization events where the speciation replaces the ancestor node with the corresponding two descendants (e.g., for speciation,

[15]

is replaced by

[11, 14]

). For hybridization, the hybrid node is inserted between its parents (e.g.,

[11, 13, 14] \to [11, 12, 13, 14]

indicates that [12] is hybrid and is inserted between [11] and [13]).

For the first similarity matrix, we obviously have

G_{2} = \begin{matrix} 11 \\ 14 \end{matrix} \begin{matrix} 11 & 14 \\ ( & \begin{matrix} t_{1} \\ - \end{matrix} & \begin{matrix} 0 \\ t_{1} \end{matrix} & ) \end{matrix} .

(A1)

Going from [11,14] → [11,13,10] involves a straightforward speciation event and the new similarity matrix becomes

G_{3} = \begin{matrix} 11 \\ 13 \\ 10 \end{matrix} \begin{matrix} 11 & 13 & 10 \\ ( & \begin{matrix} t_{1} + t_{2} \\ - \\ - \end{matrix} & \begin{matrix} 0 \\ t_{1} + t_{2} \\ - \end{matrix} & \begin{matrix} 0 \\ t_{1} \\ t_{1} + t_{2} \end{matrix} & ) \end{matrix} .

(A2)

Going from [11,13,10] → [11,12,13,10] involves a hybridization. The variance for the hybrid [12] can be calculated from

G_{3}

with the following formula:

V a r ([12]) = V a r (τ ([11] + [13])) = τ^{2} {V a r ([11]) + V a r ([13]) + 2 C o v ([11], [13])}

.

Moreover, the covariance between the hybrid species [12] and other species can be obtained by following the formula:

C o v ([12], Z) = C o v (τ ([11] + [13]), Z) = τ {C o v ([11], Z) + C o v ([13], Z)}, Z = 11, 13, 10 .

All other elements in

G_{4, τ}

can be tracked from

G_{3}

because they are identical. Therefore, the covariance for species

[11], [12], [13]

, and

[10]

at

t = t_{1} + t_{2}

is

G_{4, τ} = \begin{matrix} 11 \\ 12 \\ 13 \\ 10 \end{matrix} \begin{matrix} 11 & 12 & 13 & 10 \\ ( & \begin{matrix} t_{1} + t_{2} \\ - \\ - \\ - \end{matrix} & \begin{matrix} τ (t_{1} + t_{2}) \\ 2 τ^{2} (t_{1} + t_{2}) \\ - \\ - \end{matrix} & \begin{matrix} 0 \\ τ (t_{1} + t_{2}) \\ t_{1} + t_{2} \\ - \end{matrix} & \begin{matrix} 0 \\ τ t_{1} \\ t_{1} \\ t_{1} + t_{2} \end{matrix} & ) \end{matrix} .

We elongate from [11,12,13,10]→[1,7,8,10] to obtain

G_{4, τ}^{'} = \begin{matrix} 1 \\ 7 \\ 8 \\ 10 \end{matrix} \begin{matrix} 1 & 7 & 8 & 10 \\ ( & \begin{matrix} t_{1} + t_{2} + t_{3} \\ - \\ - \\ - \end{matrix} & \begin{matrix} τ (t_{1} + t_{2}) \\ t_{3} + 2 τ^{2} (t_{1} + t_{2}) \\ - \\ - \end{matrix} & \begin{matrix} 0 \\ τ (t_{1} + t_{2}) \\ t_{1} + t_{2} + t_{3} \\ - \end{matrix} & \begin{matrix} 0 \\ τ t_{1} \\ t_{1} \\ t_{1} + t_{2} + t_{3} \end{matrix} & ) \end{matrix} .

(A3)

The next event from [1,7,8,10] → [1,7,8,9,10] is another hybridization. The

5 \times 5

matrix

G_{5, τ}

for species

[1, 7, 8, 9, 10]

is constructed by inserting the hybrid

[9]

between its parents

[8]

and

[10]

.

G_{5, τ} = \begin{matrix} 1 \\ 7 \\ 8 \\ 9 \\ 10 \end{matrix} \begin{matrix} 1 & 7 & 8 & 9 & 10 \\ ( & \begin{matrix} t_{1} + t_{2} + t_{3} \\ - \\ - \\ - \\ - \end{matrix} & \begin{matrix} τ (t_{1} + t_{2}) \\ t_{3} + 2 τ^{2} (t_{1} + t_{2}) \\ - \\ - \\ - \end{matrix} & \begin{matrix} 0 \\ τ (t_{1} + t_{2}) \\ t_{1} + t_{2} + t_{3} \\ - \\ - \end{matrix} & \begin{matrix} 0 \\ τ^{2} (2 t_{1} + t_{2}) \\ τ (2 t_{1} + t_{2} + t_{3}) \\ τ^{2} (3 t_{1} + 2 t_{2} + 2 t_{3}) \\ - \end{matrix} & \begin{matrix} 0 \\ τ t_{1} \\ t_{1} \\ τ (2 t_{1} + t_{2} + t_{3}) \\ t_{1} + t_{2} + t_{3} \end{matrix} & ) \end{matrix} .

(A4)

We elongate from [1,7,8,9,10] to [1,7,4,5,6] to obtain

G_{5, τ}^{'} = \begin{matrix} 1 \\ 7 \\ 4 \\ 5 \\ 6 \end{matrix} \begin{matrix} 1 & 7 & 4 & 5 & 6 \\ ( & \begin{matrix} \sum_{k = 1}^{4} t_{k} \\ - \\ - \\ - \\ - \end{matrix} & \begin{matrix} τ \sum_{k = 1}^{2} t_{k} \\ \sum_{k = 3}^{4} t_{k} + 2 τ^{2} \sum_{k = 1}^{2} t_{k} \\ - \\ - \\ - \end{matrix} & \begin{matrix} 0 \\ τ \sum_{k = 1}^{2} t_{k} \\ \sum_{k = 1}^{4} t_{k} \\ - \\ - \end{matrix} & \begin{matrix} 0 \\ τ^{2} (2 t_{1} + t_{2}) \\ τ (2 t_{1} + \sum_{k = 2}^{3} t_{k}) \\ t_{4} + τ^{2} (3 t_{1} + 2 \sum_{k = 2}^{3} t_{k}) \\ - \end{matrix} & \begin{matrix} 0 \\ τ t_{1} \\ t_{1} \\ τ (2 t_{1} + \sum_{k = 2}^{3} t_{k}) \\ \sum_{k = 1}^{4} t_{k} \end{matrix} & ) \end{matrix},

(A5)

where

\sum_{k = 1}^{4} t_{k} = t_{1} + t_{2} + t_{3} + t_{4} .

The final step from [1,7,4,5,6]→ [1,2,3,4,5,6] involves a speciation event. The final similarity matrix

G_{6, τ}

is given as

G_{6, τ} = \begin{matrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \end{matrix} \begin{matrix} 1 & 2 & 3 & 4 & 5 & 6 \\ ( & \begin{matrix} \sum_{k = 1}^{5} t_{k} \\ - \\ - \\ - \\ - \\ - \end{matrix} & \begin{matrix} τ \sum_{k = 1}^{2} t_{k} \\ 2 τ^{2} \sum_{k = 1}^{2} t_{k} + \sum_{k = 1}^{3} t_{k} \\ - \\ - \\ - \\ - \end{matrix} & \begin{matrix} τ \sum_{k = 1}^{2} t_{k} \\ 2 τ^{2} \sum_{k = 1}^{2} t_{k} + \sum_{k = 3}^{4} t_{k} \\ 2 τ^{2} \sum_{k = 1}^{2} t_{k} + \sum_{k = 3}^{5} t_{k} \\ - \\ - \\ - \end{matrix} & \begin{matrix} 0 \\ τ \sum_{k = 1}^{2} t_{k} \\ τ \sum_{k = 1}^{2} t_{k} \\ \sum_{k = 1}^{5} t_{k} \\ - \\ - \end{matrix} & \begin{matrix} 0 \\ τ^{2} (2 t_{1} + t_{2}) \\ τ^{2} (2 t_{1} + t_{2}) \\ τ (2 t_{1} + \sum_{k = 2}^{3} t_{k}) \\ τ^{2} (3 t_{1} + 2 \sum_{k = 2}^{3} t_{k}) + \sum_{k = 4}^{5} t_{k} \\ - \end{matrix} & \begin{matrix} 0 \\ τ t_{1} \\ τ t_{1} \\ t_{1} \\ τ (2 t_{1} + \sum_{k = 2}^{3} t_{k}) \\ \sum_{k = 1}^{5} t_{k} \end{matrix} & ) \end{matrix},

(A6)

If we assign branch lengths by setting

t_{1} = 0.1, t_{2} = 0.25, t_{3} = 0.15, t_{4} = 0.2, t_{5} = 0.3

, the eNewick format with branch lengths input into the R program will be as follows. Input: network

= c (“ ((1 : 0.65, ((2 : 0.3, 3 : 0.3) 7 : 0.35) 12 : 0) 11 : 0.35, ((12 : 0, (4 : 0.5, (5 : 0.5) 9 : 0) 8 : 0.15) 13 : 0.25, (9 : 0, 6 : 0.5) 10 : 0.4) 14 : 0.1) 15 : 0 ")

.

Output: The similarity matrix for the species

[1, 2, 3, 4, 5, 6]

on the tips of the tree is

G_{6, τ} = \begin{matrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \end{matrix} \begin{matrix} 1 & 2 & 3 & 4 & 5 & 6 \\ ( & \begin{matrix} 1 \\ - \\ - \\ - \\ - \\ - \end{matrix} & \begin{matrix} 0.175 \\ 0.825 \\ - \\ - \\ - \\ - \end{matrix} & \begin{matrix} 0.175 \\ 0.525 \\ 0.825 \\ - \\ - \\ - \end{matrix} & \begin{matrix} 0 \\ 0.175 \\ 0.175 \\ 1 \\ - \\ - \end{matrix} & \begin{matrix} 0 \\ 0.1125 \\ 0.1125 \\ 0.3 \\ 0.8 \\ - \end{matrix} & \begin{matrix} 0 \\ 0.05 \\ 0.05 \\ 0.1 \\ 0.3 \\ 1 \end{matrix} & ) \end{matrix} .

(A7)

It can be seen that the covariance matrix is a 6 by 6 matrix where the upper diagonal is shown due to its symmetry.

References

Rieseberg, L.H.; Carney, S.E. Plant hybridization. New Phytol. 1998, 140, 599–624. [Google Scholar] [CrossRef] [PubMed]
Mitchell, N.; Owens, G.L.; Hovick, S.M.; Rieseberg, L.H.; Whitney, K.D. Hybridization speeds adaptive evolution in an eight-year field experiment. Sci. Rep. 2019, 9, 6746. [Google Scholar] [CrossRef] [PubMed]
Bock, D.G.; Kantar, M.B.; Rieseberg, L.H. Population Genomics of Speciation and Adaptation in Sunflowers; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Harmon, L.J.; Weir, J.T.; Schulte, L.A. Phylogenies and Comparative Methods in Ecology and Evolution; University of California Press: Berkeley, CA, USA, 2005. [Google Scholar]
Harvey, P.H.; Pagel, M.D. Comparative methods for explaining adaptations. Nature 1991, 351, 619–624. [Google Scholar] [CrossRef] [PubMed]
Clutton-Brock, T.H. Phylogenetic Perspectives on the Evolution of Mammalian Social Behavior; University of Chicago Press: Chicago, IL, USA, 2010. [Google Scholar]
Bastide, P.; Solis-Lemus, C.; Kriebel, R.; Sparks, K.W.; Ané, C. Phylogenetic comparative methods on phylogenetic networks with reticulations. Syst. Biol. 2018, 67, 800–820. [Google Scholar] [CrossRef]
Jhwueng, D.C.; O’Meara, B. Trait evolution on phylogenetic networks. bioRxiv 2015, 023986. [Google Scholar] [CrossRef]
Teo, B.; Rose, J.P.; Bastide, P.; Ané, C. Accounting for within-species variation in continuous trait evolution on a phylogenetic network. bioRxiv 2022, 490814. [Google Scholar] [CrossRef]
Jacquemyn, H.; Merckx, V.; Brys, R.; Tyteca, D.; Cammue, B.P.; Honnay, O.; Lievens, B. Analysis of network architecture reveals phylogenetic constraints on mycorrhizal specificity in the genus Orchis (Orchidaceae). New Phytol. 2011, 192, 518–528. [Google Scholar] [CrossRef]
Solís-Lemus, C.; Bastide, P.; Ané, C. PhyloNetworks: A package for phylogenetic networks. Mol. Biol. Evol. 2017, 34, 3292–3298. [Google Scholar] [CrossRef]
Solís-Lemus, C.; Ané, C. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genet. 2016, 12, e1005896. [Google Scholar] [CrossRef]
Felsenstein, J. Phylogenies and the comparative method. Am. Nat. 1985, 125, 1–15. [Google Scholar] [CrossRef]
Revell, L.J. Phylogenetic signal and linear regression on species data. Methods Ecol. Evol. 2010, 1, 319–329. [Google Scholar] [CrossRef]
Ané, C. Analysis of comparative data with hierarchical autocorrelation. Ann. Appl. Stat. 2008, 2, 1078–1102. [Google Scholar] [CrossRef]
Jhwueng, D.C. Some Problems in Phylogenetic Comparative Methods. Ph.D. Thesis, Indiana University, Bloomington, IN, USA, 2010. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
Gilbert, P.; Varadhan, R. numDeriv: Accurate Numerical Derivatives; R Package Version 2016.8-1.1, CRAN Repository. 2019. Available online: https://cran.r-project.org/web/packages/numDeriv/index.html (accessed on 21 February 2023).
Wald, A. Note on the consistency of the maximum likelihood estimate. Ann. Math. Stat. 1949, 20, 595–601. [Google Scholar] [CrossRef]
Varga, R.S.; Nabben, R. On symmetric ultrametric matrices. In Numerical Linear Algebra; De Gruyter: Berlin, Germany, 1993; pp. 193–199. [Google Scholar]
Nabben, R.; Varga, R.S. A linear algebra proof that the inverse of a strictly ultrametric matrix is a strictly diagonally dominant Stieltjes matrix. SIAM J. Matrix Anal. Appl. 1994, 15, 107–113. [Google Scholar] [CrossRef]
Cardona, G.; Rosselló, F.; Valiente, G. Extended Newick: It is time for a standard representation of phylogenetic networks. BMC Bioinform. 2008, 9, 532. [Google Scholar] [CrossRef]
Dray, S.; Dufour, A.B. The ade4 package: Implementing the duality diagram for ecologists. J. Stat. Softw. 2007, 22, 1–20. [Google Scholar] [CrossRef]
Tsai, Y.L. Regression Analysis of Hybrid Species’s Trait Data. Master’s Thesis, Feng-Chia University, Taichung, Taiwan, 2016. [Google Scholar]
Kane, N.C.; Rieseberg, L.H. Selective sweeps reveal candidate genes for adaptation to drought and salt tolerance in common sunflower, Helianthus annuus. Genetics 2007, 175, 1823–1834. [Google Scholar] [CrossRef]
Koziol, L.; Rieseberg, L.H.; Kane, N.; Bever, J.D. Reduced drought tolerance during domestication and the evolution of weediness results from tolerance—Growth trade-offs. Evol. Int. J. Org. Evol. 2012, 66, 3803–3814. [Google Scholar] [CrossRef]
Fick, S.E.; Hijmans, R.J. WorldClim 2: New 1-km spatial resolution climate surfaces for global land areas. Int. J. Climatol. 2017, 37, 4302–4315. [Google Scholar] [CrossRef]
Cerasoli, F.; D’Alessandro, P.; Biondi, M. Worldclim 2.1 versus Worldclim 1.4: Climatic niche and grid resolution affect between-version mismatches in Habitat Suitability Models predictions across Europe. Ecol. Evol. 2022, 12, e8430. [Google Scholar] [CrossRef]
GBIF. Global Biodiversity Information Facility Database. Rumex acetosella L. 2010. Available online: https://www.gbif.org/ (accessed on 31 January 2023).
Hijmans, R.J.; Van Etten, J.; Mattiuzzi, M.; Sumner, M.; Greenberg, J.; Lamigueiro, O.; Bevan, A.; Racine, E.; Shortridge, A. Raster Package in R Version. 2023. Available online: https://cran.r-project.org/web/packages/raster/raster.pdf (accessed on 10 March 2023).
van Etten, R.J.H.J. Raster: Geographic Analysis and Modeling with Raster Data; R Package Version 2.0-12, CRAN Repository. 2012. Available online: https://cran.r-project.org/web/packages/raster/index.html (accessed on 21 February 2023).
Brach, A.R.; Song, H. eFloras: New directions for online floras exemplified by the Flora of China Project. Taxon 2006, 55, 188–192. [Google Scholar] [CrossRef]
Gross, B.; Rieseberg, L. The ecological genetics of homoploid hybrid speciation. J. Hered. 2004, 96, 241–252. [Google Scholar] [CrossRef] [PubMed]
Ho, L.S.T.; Ane, C. A linear-time algorithm for Gaussian and non-Gaussian trait evolution models. Syst. Biol. 2014, 63, 397–408. [Google Scholar] [PubMed]
Hansen, T.F. Stabilizing selection and the comparative analysis of adaptation. Evolution 1997, 51, 1341–1351. [Google Scholar] [CrossRef]
Harmon, L.J.; Losos, J.B.; Jonathan Davies, T.; Gillespie, R.G.; Gittleman, J.L.; Bryan Jennings, W.; Kozak, K.H.; McPeek, M.A.; Moreno-Roark, F.; Near, T.J.; et al. Early bursts of body size and shape evolution are rare in comparative data. Evol. Int. J. Org. Evol. 2010, 64, 2385–2396. [Google Scholar] [CrossRef] [PubMed]
Blomberg, S.P.; Rathnayake, S.I.; Moreau, C.M. Beyond Brownian motion and the Ornstein-Uhlenbeck process: Stochastic diffusion models for the evolution of quantitative characters. Am. Nat. 2020, 195, 145–165. [Google Scholar] [CrossRef]
Jhwueng, D.C.; Liu, F.C. Effect of Polytomy on the Parameter Estimation and Goodness of Fit of Phylogenetic Linear Regression Models for Trait Evolution. Diversity 2022, 14, 942. [Google Scholar] [CrossRef]
McCulloch, C.E.; Neuhaus, J.M. Prediction of random effects in linear and generalized linear models under model misspecification. Biometrics 2011, 67, 270–279. [Google Scholar] [CrossRef]
Sheng, Y.; Yang, C.; Curhan, S.; Curhan, G.; Wang, M. Analytical methods for correlated data arising from multicenter hearing studies. Stat. Med. 2022, 41, 5335–5348. [Google Scholar] [CrossRef]
Caetano, D.S.; O’Meara, B.C.; Beaulieu, J.M. Hidden state models improve state-dependent diversification approaches, including biogeographical models. Evolution 2018, 72, 2308–2324. [Google Scholar] [CrossRef]
Grundler, M.C.; Rabosky, D.L. Complex ecological phenotypes on phylogenetic trees: A hidden Markov model for comparative analysis of multivariate count data. Syst. Biol. 2019, 69, 1200–1211. [Google Scholar] [CrossRef] [PubMed]
Boyko, J.D.; O’Meara, B.C.; Beaulieu, J.M. A Novel Method for Jointly Modeling the Evolution of Discrete and Continuous Traits. Evolution 2023, 77, 836–851. [Google Scholar] [CrossRef] [PubMed]

Figure 1. A three-taxa phylogenetic network. The hybrid species R of X and Y on the tips of the network was formed at

t = t_{1}

.

Figure 2. Evolution scenario for 3-taxa phylogenetic network containing a reticular hybridization event.

Figure 3. A phylogenetic tree of 3 taxa.

X, Z, Y

are taxa; O is the root. X and Z share the same branch length on

t_{1}

. Y is independent with both X and Z.

Figure 4. Sunflower network regraphed from [33]. Species on the tip from the leftmost (labeled with the number 1) to the rightmost (labeled with the number 1) are 1. praecox, 2. debilis, 3. neglectus, 4. petiolaris, 5. anomalus, 6. deserticola, 7. paradoxus, 8. annuus, 9. argophyllus, 10. bolanderi, and 11. exilis, where deserticola, anomalus, and paradoxus are hybrids from petiolaris and annuus. The eNewick format is

(1 : 0, ((2 : 0.84, ((3 : 0.23, 4 : 0.23) 16 : 0.55, (5 : 0.32, (6 : 0.15, (7 : 0.15) 13) 12 : 0.18 : 0.16) 17 : 0.45) 20 : 0.06) 21 : 0.1, (((13, 8 : 0.15 : 0.35) 14 : 0.2, 9 : 0.35) 18, (10 : 0.16, 11 : 0.16) 15) 19 : 0.43) 22 : 0.06) 23 : 0

.

Table 1. Ancestral–descendant relationship corresponding to Figure 1.

Rows	Parent	Descendant 1	Descendant 2
1	O	$X_{t_{1}}$	$Y_{t_{1}}$
2	$X_{t_{1}}$	X	$R_{t_{1}}$
3	$Y_{t_{1}}$	Y	$R_{t_{1}}$
4	$R_{t_{1}}$	R	R

Table 2. Sunflowers and their traits. Each column represents a sunflower species, while each row records the trait collected from the database.

	Praecox	Debilis	Neglectus	Petiolaris	Anomalus	Deserticol	Paradoxus	Annuus	Argophyllus	Bolanderi	Exilis
AnnPrec	1796.71	978.25	148.00	384.80	393.25	154.00	229.00	459.62	695.25	444.67	829.00
Annuals	95.00	7.00	27.50	15.50	34.50	7.25	21.00	13.50	35.00	5.50	2.90
Petioles	1.35	115.00	4.00	29.50	16.00	25.00	7.75	17.50	15.50	30.00	4.75
Peduncles	2.85	1.85	140.00	9.50	25.00	12.00	30.00	9.50	34.00	26.00	150.00
Involucres	6.25	4.50	3.00	120.00	3.00	9.50	17.00	19.50	6.00	17.50	20.00
Phyllaries	75.00	5.25	3.75	2.25	42.50	3.10	6.50	23.50	17.00	7.50	27.50
Paleae	9.50	25.00	7.15	6.80	3.25	25.00	3.50	2.00	19.00	17.00	8.50
Laminae	20.00	10.00	25.00	5.75	4.50	2.05	165.00	3.75	15.00	17.50	20.00
Ray florets	8.50	25.00	16.00	50.00	5.25	3.50	2.70	200.00	11.00	11.00	27.50
Disc florets	25.00	10.00	37.50	23.50	150.00	6.50	4.50	2.75	200.00	6.00	5.00
Corollas	25.00	27.50	10.50	25.00	17.50	150.00	7.00	5.00	2.35	105.00	2.50
Cypselae	8.00	21.00	14.00	10.00	17.00	14.50	75.00	6.00	4.00	2.35	65.00
Pappi	1.60	8.00	17.50	14.50	9.75	17.00	11.50	50.00	5.00	3.25	2.20

Table 3. The table provides estimates and corresponding standard errors for the hybrid effect (

\hat{τ}

), the rate of evolution (

\hat{σ}

), and the slope (

\hat{β}

) for each of the 12 response traits of sunflowers under a network relationship. The slope estimate represents the effect of precipitation on the particular trait, with a positive value indicating a positive relationship and a negative value indicating a negative relationship. The 95% confidence interval (CI) for the slope estimate provides a range of plausible values for the true effect of precipitation on the trait.

Table 3. The table provides estimates and corresponding standard errors for the hybrid effect (

\hat{τ}

), the rate of evolution (

\hat{σ}

), and the slope (

\hat{β}

) for each of the 12 response traits of sunflowers under a network relationship. The slope estimate represents the effect of precipitation on the particular trait, with a positive value indicating a positive relationship and a negative value indicating a negative relationship. The 95% confidence interval (CI) for the slope estimate provides a range of plausible values for the true effect of precipitation on the trait.

Response Trait	$\hat{τ}$ ( ${se}_{\hat{τ}}$ )	$\hat{σ}$ ( ${se}_{\hat{σ}}$ )	$\hat{β_{1}}$ (CI)	Significant?
Annuals	0.841(0.07)	0.133(0.087)	0.243(0.038, 0.449)	Yes
Petioles	0.787(0.157)	0.106(0.069)	−0.091(−0.22, 0.038)	No
Peduncles	0.801(0.176)	0.157(0.103)	0.451(0.165, 0.737)	Yes
Involucres	1.068(0.038)	0.038(0.025)	0.123(0.106, 0.141)	Yes
Phyllaries	0.937(0.042)	0.048(0.031)	0.068(0.041, 0.096)	Yes
Paleae	1.005(0.035)	0.026(0.017)	−0.086(−0.094, −0.079)	Yes
Laminae	1.031(0.055)	0.061(0.04)	−0.019(−0.064, 0.026)	No
Ray florets	0.812(0.046)	0.057(0.037)	−0.038(−0.076, −0.001)	Yes
Disc florets	0.779(0.056)	0.106(0.069)	−0.006(−0.137, 0.124)	No
Corollas	1.065(0.052)	0.032(0.021)	0.048(0.036, 0.06)	Yes
Cypselae	1.221(0.11)	0.053(0.034)	0.071(0.038, 0.105)	Yes
Pappi	1.149(0.159)	0.051(0.033)	−0.019(−0.05, 0.012)	No

Table 4. The benchmark analysis involves the use of 12 traits, with RMSE1 computed via a simple linear regression baseline model, RMSE2 computed via the tree model [34], and RMSE3 computed via the proposed network model. The fourth and fifth columns of the table present the benchmark ratio for each model.

	RMSE1	RMSE2	RMSE3	$\frac{RMSE 2}{RMSE 1}$	$\frac{RMSE 3}{RMSE 1}$
Annuals	0.608	0.612	0.655	1.006	1.077
Petioles	0.558	0.559	0.565	1.002	1.013
Peduncles	0.718	0.727	0.730	1.013	1.017
Involucres	0.227	0.234	0.246	1.033	1.087
Phyllaries	0.276	0.276	0.277	1.001	1.005
Paleae	0.164	0.165	0.171	1.009	1.046
Laminae	0.250	0.267	0.265	1.065	1.059
Ray.florets	0.307	0.308	0.357	1.004	1.162
Disc.florets	0.665	0.676	0.768	1.017	1.154
Corollas	0.125	0.128	0.147	1.020	1.175
Cypselae	0.213	0.219	0.332	1.026	1.555
Pappi	0.175	0.183	0.239	1.046	1.367

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A Phylogenetic Regression Model for Studying Trait Evolution on Network

Abstract

1. Introduction

2. Model

2.1. Relation between the Hybrid and Its Parents

2.2. Covariance Matrix under the Brownian Motion Model

2.3. Stepwise Procedure for Constructing the Variance–Covariance Matrix

2.4. The Statistical Model and Likelihood Function

3. Algorithm and Inference

4. Empirical Analysis

5. Discussion

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Script and Data files

Appendix A.1.1. Model

Appendix A.1.2. Sunflower Precipitation Dataset

Appendix A.1.3. Figures and Tables

Appendix A.2. Demonstration of Algorithm under Brownian Motion Model

References

Article Metrics

Citations

Article Access Statistics