Exploring the Entropy Complex Networks with Latent Interaction

In the present work, we study the introduction of a latent interaction index, examining its impact on the formation and development of complex networks. This index takes into account both observed and unobserved heterogeneity per node in order to overcome the limitations of traditional compositional similarity indices, particularly when dealing with large networks comprising numerous nodes. In this way, it effectively captures specific information about participating nodes while mitigating estimation problems based on network structures. Furthermore, we develop a Shannon-type entropy function to characterize the density of networks and establish optimal bounds for this estimation by leveraging the network topology. Additionally, we demonstrate some asymptotic properties of pointwise estimation using this function. Through this approach, we analyze the compositional structural dynamics, providing valuable insights into the complex interactions within the network. Our proposed method offers a promising tool for studying and understanding the intricate relationships within complex networks and their implications under parameter specification. We perform simulations and comparisons with the formation of Erdös–Rényi and Barabási–Alber-type networks and Erdös–Rényi and Shannon-type entropy. Finally, we apply our models to the detection of microbial communities.


Introduction
Complex Network Analysis (CNA) is a crucial field that spans various disciplines, addressing network dynamics [1][2][3].In this context, the focus is on unraveling the complexities of network structure, specifically in the dynamics of link formation.The study delves into three fundamental network attributes: the homophily effect, unobserved heterogeneity, and persistence measures.Homophily, which denotes the tendency of nodes to connect with similar nodes, is a well-established phenomenon in real-world social networks [4].However, many node characteristics influencing linking decisions remain unobservable, outnumbering observable ones.To address this challenge, a fixed effect approach to account for unobserved heterogeneity is introduced [5-7], as well as persistence measures as tools for quantifying time series data dependence [8].These measures hold significant implications for various processes, for example, in information diffusion and ecological networks (see refs.[1,[9][10][11]).
While complex networks offer powerful modeling capabilities, they also present significant challenges.One major hurdle is the lack of a comprehensive metric to effectively measure unobserved heterogeneity, especially in understanding interconnected components [12].This metric should consider interaction abundance and latent nature, aligning with existing frameworks [13,14].Furthermore, metrics assessing heterogeneity, both observed and unobserved, are intricately linked to the ways in which nodes are aggregated.From the above, an acute dilemma arises when unobserved heterogeneity is treated as an incidental parameter, independent of node aggregation.In such cases, the parameter vector dimension grows with network size, leading to non-standard estimation challenges, where classical results regarding the properties of maximum likelihood estimates (MLEs) no longer apply [15].Additionally, certain models (see, for example, refs.[7,16,17]) disregard interdependencies in network formation.These limitations prompt essential questions: Can we devise a test for evaluating the link formation interdependence hypothesis?Is it feasible to extend the model's scope to incorporate these interdependencies?How can we address the challenges posed by complex network structures and their inherent uncertainties for a deeper understanding of link formation dynamics?
In this work, we introduce a novel framework to tackle these complex network analysis challenges, where our approach (i) incorporates a discrete latent interaction index that integrates parametric and semiparametric components, shedding light on network formation dynamics.
The realm of network models is diverse, ranging from classical ones [18][19][20] to the more recent advancements [6,7,16].In the first group of models (I), random networks [18] aim to probabilistically study graph properties as the number of random connections increases, reflecting the disordered nature of link arrangements between different nodes.We start with the hypothesis that the proposed latent interaction index displaces the possibility of randomness in link formation.We conduct statistical significance tests based on this hypothesis.Additionally, the Watts-Strogatz model [20] presents a rewiring model that often exhibits high clustering coefficients in "small" networks.On the other hand, the Barabási-Albert model [19] relies on two ingredients: growth and preferential attachment.The idea is that by mimicking the dynamic mechanisms that assemble the network, we can reproduce the system's topological properties as observed here.The second group of models (II) has been limited to studying static nonlinear dyadic models and their asymptotic properties.Because the number of individual parameters is proportional to the number of nodes, a problem of incidental parameters results in asymptotic bias [6].While the estimator is consistent, asymptotic bias is relevant for inference.We provide a model test based on the prevalence of transitive triads (i.e., node triples where links are transitive).Observed heterogeneity has also been incorporated through dyadic models that expand on this model, just as a probit or logit model generalizes a simple Bernoulli statistical model, which can be used in directed or undirected settings [21].It is possible to extend the Erdös-Rényi model to incorporate other features [5].
Our proposed model seeks to bridge these two groups (I and II), offering a comprehensive approach to network analysis by incorporating the strengths of both.
Additionally, (ii) we present an entropy function dynamically accounting for these components, providing insights into parameters related to persistence and homophily.The estimates derived from this entropy function provide valuable information to characterize the parameters related.This framework enhances our comprehension of link formation within dynamic networks, enabling us to explore the influence of these components on network formation and evolution [22][23][24].
To provide a comprehensive view, it is important to note that each entropy metric used in network analysis offers unique insights into network characteristics and its various components.However, it is widely acknowledged within the field that not all of these metrics can be universally applied to all categories of networks.In fact, this wealth of research is dispersed across numerous disciplines [1,17,22,23,25,26], making it challenging to identify the available metrics and understand the specific contexts in which they are applicable.Additionally, this dispersion complicates our ability to determine areas in need of further development.
These entropy metrics often depend on probability distributions based on various factors, such as node degrees [27,28], the degree and strength of node neighbors [23,29,30], or degrees associated with subgraphs of nodes [31].Path-based metrics, considering sequences of linked nodes and repetitions of nodes and edges, are also common [32][33][34].Moreover, entropy metrics explore other factors like closeness and information functionals [35,36].Some metrics rely on probability distribution, including Bayes posterior probability, although specific calculation methods may not always be clear [37].Notably, Wang et al. [38] introduced a combined metric, where the first part is calculated as the sum of closeness centrality and the clustering coefficient.
Ecological research has a long-standing tradition of studying co-occurrence and coabundance patterns.These patterns often signify non-random species co-occurrence, indicating that interactions play a significant role in community structure-either by fostering aggregation or promoting avoidance/exclusion-thus influencing the overall community dynamics.Macro-ecological interaction networks illustrate that such patterns bolster community robustness and functionality, crucial for comprehending community dynamics and productivity [25,39].Microorganisms engage in diverse relationships, encompassing both antagonistic and cooperative interactions.With the advancements in sequencing technologies, we now have access to substantial datasets for analysis.This allows for the construction of co-occurrence networks using correlation coefficients or similar metrics.However, interpreting these networks, especially in microbial surveys with poorly understood organism behaviors, presents significant challenges [11,17,40].
The complexity of microbial communities makes it challenging to validate communitywide interactions due to the multitude of species and limited experimental approaches.Consequently, modeling microbial populations using simplified growth and interaction rules offers an alternative approach to simulate the dynamics of these intricate multispecies communities.In this study, we consider the model proposal as an application for identifying microbial networks.Concretely, we apply our dynamic network formation model on an 18S rRNA gene amplicon dataset.The original dataset comprises 19 samples, and we observe a total of 3831 OTU (Operative Taxonomic Unity) entries.These observations are obtained through Lagrangian sampling as part of a study conducted by Hu et al. [40].
This work starts by providing an introduction to our notations, delineating the symbols and conventions used throughout this study.The organization of this paper is as follows: In Section 2, we present the proposed model; then, in Section 3, we introduce the entropy function.In Section 4, simulation results are presented, and in Section 5 we apply the model focused on the microbial network identification.Section 6 provides the conclusions and discussions, while all proofs of the theorems and elimination of fixed effects are present in Appendix A.
Notation 1. Network G = (V, E) is an ordered pair of sets V and E, where V is a set finite nonempty of elements named nodes, and the set E is composed of two-element subsets {ij} of V named edges.If i and j are connected, {ij} constitutes a dyad, and j is a neighbor of i.

Structural Model
We consider a dynamic group interaction scenario consisting of a large population of connected nodes.We let i = 1, . . ., N is the index of a random sample of size N from this population at time t = 1, . . ., T. Each node i has a profile defined as (X i,t , A i ) , where X i,t is an aggregated vector of the observed time-varying characteristics, A i contains unobserved information assuming the t-invariant.We let Supp(X i,t ) be a compact subset of R dim(X i,t ) , and A i is distributed compactly and continuously on the same support, conditional on X i,t = x, i.e., for all x ∈ Supp(X i,t ), Linking decisions are a binary choice that depends solely on the characteristics of the two nodes connected by the link.We observe relationships between nodes through the indicator variable C ij,t ∼ Bernoulli(p ij,t ), where C ij,t = 1 if node j interacts (success) with node i at time t and C ij,t = 0 (failure) otherwise.Parameter p ij,t can be interpreted as the detection rate of the interaction between nodes i, j.Connections are undirected (i.e., C ij,t = C ji,t ), and self-ties are ruled out (i.e., C ii,t = 0 for all t).For each t = 1, . . ., T, there is a corresponding N × N socio-matrix C = (C ij,t ) i =j that captures the interaction dynamics between nodes i and j across all time steps.
We parameterize the latent interaction structure according to the probability of each link C ij,t : where 1(•) denotes the indicator function.The q-dimensional vector α 0 = (α 0 1 , • • • , α 0 q ) with α < 1 and 1 ≤ q ≤ t − 1 captures the autocorrelation or cumulative nonlinear persistence of the time series [8].Variable X ij,t : Supp(X i,t ) × Supp(X j,t ) → R dim(β) is a known transformation of (X i,t , X j,t ) .This function is symmetric, so that X ij,t = X ji,t .For example, if X i,t and X jt are location coordinates, X ij,t is equal to the "distance" between i and j.This choice was implemented under the consideration that nodes only form connections if they are close enough [7,16,21].Vector β 0 is an unknown model parameter that parameterizes homophily preferences.The parameter vector is denotes the memory effect of connections that node i and j have had in common up to time t.Variable A ij is a component that varies with unobserved attributes by node pairs as in Graham's model [7], and ij,t represent an idiosyncratic component that is assumed to be independent and identically distributed over time.Moreover, this component is assumed to be independent across pairs, although not necessarily identically distributed; it is F( 12,1 , . . . ,12,T , . . . ,(N−1)N,1 , . . . , It is important to note that Equation (1) captures in a parsimonious way three forces that researchers consider important for bond formation [41].First, linkages are state dependent; equally, the linkage returns for i and j are higher in the current period if they were also connected in previous periods.Second, there are returns to "triadic closure", profit is higher if transitive aspects are considered in the interaction between nodes.In addition, Rule (1) is more general instead of taking D ij,t = 0 and α 0 p = 0 for all p = 1, . . ., q, which would imply that only direct entailments are important, not autocorrelation and particular incentives for interaction.
The degree of a node is defined as the number of links it possesses, which can be represented as the sum of connections it has with other nodes, and denoted as The network's degree sequence is obtained by summing the rows (or columns) of the adjacency matrix, resulting in an N × 1 vector C + = (C 1+,t , . . ., C N+,t ) .We denote For parameter values θ ∈ int(Θ) and A = ((A i ) i=1,...,N , (A j ) j=1,...,N ) ∈ Supp(A), we define the link probability p ij,t (θ, With the information presented above, we are now able to outline the principal assumptions that significantly influence our work: Assumption 1. Equations (1) and (2) specify a dynamic model of node interactions.The conditional likelihood of link C ij,t = c ij,t is given by Here, Assumption 1 implies that the idiosyncratic component of link surplus, ij,t , is a standard logistic random variable that is independently and identically distributed across pairs of nodes.The assumption that links are formed independently of each other based on agent attributes may hold in some situations but not in others.Specifically, Equation (1) and Assumption 1 are suitable for scenarios where link formation is predominantly bilateral.This is particularly relevant in certain types of friendship and trade networks, as well as in models of specific types of conflicts between nation-states [42,43].In these contexts, the incorporation of unobserved node characteristics into the link formation model represents a significant and useful generalization relative to many commonly used models.
The objective pursued here is to study the identification and estimation problems posed by the shape according to Equation (1) and Assumption 1.This set encompasses a useful class of empirical examples and represents a natural starting point for a formal statistical analysis.In this context, early methodological work focused on introducing unobserved correlated heterogeneity into static choice models [44,45].Subsequent work incorporated a chance for stated dependence in choice [46].
The estimated value of the parameters, denoted by are the solution to the population conditional maximum likelihood problem max for every N, T. Here, E a denotes the expectation with respect to the distribution of the data conditional on the unobserved effects.

Assumption 2.
(i) Asymptotics: We consider limits of sequences where N/T approaches a constant value c as both N and T rise to infinity, where c is a finite number greater than zero.(ii) Sampling: Conditional on a is independent across the dyad, and for Y ij,t = (C ij,t , X ij,t ), A is the σ-field generated by (Y ij,t , Y ij,t−1 , . . .)  and B is the σ-field generated by (Y ij,t , Y ij,t+1,... ) .(iii) Compact support: The support of X ij,t is a compact subset of R dim(β) .(iv) Concavity: For all N, T, (θ, a N ) → L NT (θ, a N ) is strictly concave over R dim θ+N .
Just for completeness, Assumption 2 (i) defines the large-T asymptotic framework and is the same as in Hahn and Kuersteiner [47].The relative rate exactly balances the order of the bias and variance producing a non-degenerate asymptotic distribution.Assumption 2 (ii) imposes neither identical distribution nor stationarity over the time series dimension, conditional on the unobserved effects, unlike most of the large-T panel literature [47].Additionally, it is used to bound covariances and moments in the application of the Laws of Large Numbers (LLN), as we see below, it could be replaced by other conditions that guarantee the applicability of these results.Assumption 2 (iii) is standard in the context of nonlinear estimation problems [48].It implies that the observed component of link surplus, ∑ q p=1 α p c ij,t−p + βx ij,t + d ij,t , has bounded support.This simplifies the proofs of the main theorems, especially those of the ML estimator.Furthermore, (iv) imposes smoothness and moment conditions in the log-likelihood function and its derivatives.These conditions guarantee that the higher-order stochastic expansions of the fixed effect estimator that we use to characterize the asymptotic bias are well-defined, and the remaining terms of these expansions are bounded.In addition, this guarantees that all the elements of X ij,t have cross-sectional and time series variation.In addition, it also guarantees that θ is the unique solution to the population problem (given by Equation ( 6)), that is, all the parameters are point identified.The existence and uniqueness of the solution to the population problem are guaranteed by our Assumptions 2, including the concavity of the objective function in all parameters.
Together with the above, and denoting p ij,t = p ij,t (θ 0 , a 0 N ), through to Parts (iii) and (iv) from Assumption 2 in combination with Supp(A i ) being a compact subset of R, our findings imply that p ij,t (θ, a N ) ∈ (κ, 1 − κ) for some 0 < κ < 1 and for all θ and a N ∈ Supp(A).An implication of this fact is that (C ij,t − p ij,t ) log(p ij,t (θ, a N )) is a bounded random variable.A more involved argument shows that it is possible to estimate the difference between C ij,t and p ij,t with uniform accuracy.
With the aforementioned assumptions in place, we can now elucidate the primary theorems that are providential through the work: Theorem 1 suggests that as more data are collected (increasing N) and a broader time horizon is considered (increasing T), the difference between latent variables and observed probabilities becomes relatively small and tends to be more bounded.This interpretation may be relevant for assessing the accuracy or validity of a latent model in relation to real observations within a network.The term ln(NT) in the upper bound can be interpreted as a measure of the uncertainty associated with the difference between latent variables and observations.As N and T grow, uncertainty decreases.
The following theorem is related to a generalized form of the Law of Large Numbers (LLN) adapted to the context of complex networks.Theorem 2. Under Assumptions 1 and 2, we assume that is finite for all t and F = {c ij,t : t < t} is a filtration with respect to A; then, In the LLN, the average of random variables is expected to converge to the expected value as the sample size grows.In this case, the sum of certain probability functions l θ,a N ij,t for all dyads in the network converges in probability towards a sum of probabilities associated with the dyads.Convergence in probability implies that as the network size (or the number of dyads) grows, the conditional expectation of the discrete choice probabilities approaches the expected value of those probabilities for all dyads.This can have significant implications in the theory of complex networks.For example, the stability of emergent patterns: if the result holds, it implies that as the network grows, emergent patterns in discrete choices may become more stable and predictable, providing a deeper understanding of collective behavior in the network [49].

Exploring the Entropy
Combining Assumption 1 and conditional on X ij,t , D ij,t and A ij , we write for the log-likelihood contribution of link {ij}.Since entropy characterizes the logarithm of the number of different nodes that can be separated in the stochastic dynamics of the network [37,50], we use Equation ( 8) to provide a new node interaction detection rate.We note that by the asymptotic equipartition property (AEP) (see, e.g., ref. [51]), we θ,a N ij,t converging in probability to the entropy of C, denoted as H(C), where C represents the socio-matrix of the network.Formally, where the variable k ranges from 1 to +∞, indicating that all possible configurations of connections that do not exist between nodes i and j are considered.Expression ) k /k represents the probability of there not being a connection between nodes i and j at time step t.Therefore, Equation ( 9) combines the influences of both existing and non-existing connections at each time step to compute the entropy of the dynamic network.For the sake of completeness, Figure 1 shows the behavior of the entropy H(C) for values of N nodes.It is crucial to note that Equation (9) comprehensively encompasses the charging capability of the logistics distribution-a facet that some propositions tend to disregard [52].For the sake of completeness, Figure 1 shows the behavior of the entropy H(C) for values of N nodes.
The following theorems establish consistency of θ (Equation ( 4)): Theorem 3.Under Assumptions 1 and 2, we have that Theorem 2 provides a foundation for drawing inferences about the parameter vector encompassing homophily and nonlinear persistence.However, attaining asymptotic normality, for reasons we elaborate on, cannot be guaranteed.The consistency test for models with only individual effects is based on partitioning the log-likelihood into the sum of individual log-likelihoods that depend on a fixed number of parameters, the model parameter, and the corresponding individual effect.The individual log-likelihood maximizers are then consistent estimators of all parameters as they become large according to standard arguments.This approach does not work on network structure because there is no partition of the data that are only affected by a fixed number of parameters and whose size grows with sample size [6].
To achieve asymptotic normality over the observed, we first need to control for the unobserved and second to establish consistency in the estimated entropy function, which depends on both components.We assess node performance and select a group of exogenous nodes to serve as a "testing ground".To achieve this, we examine the conditional expectation of C ik,t and C jk,t , conditioning on the observable characteristics of node k, and the characteristics of nodes i and j based on X ij,t and A ij .We denote H ij,t (x k,t , a ij ) as the expected value of (C ik,t − C jk,t According to Parzen's estimation [53] and Rosenblatt's remarks [54], we define dyadic extension for monadic data by δij (x Here, K(x) is a density function satisfying the following conditions: and integrates to one ( K(x)dx = 1).Bandwidth h(N) is assumed to be a positive, deterministic sequence that tends to zero as N → ∞.
There are at least two approaches to the estimation of unobserved heterogeneity (fixed effects).The first lies in a computational perspective [6,55].For these purposes, the solution of the (6) program for θ is the same as in the solution of the program that imposes ι N a N = 0 with ι N , a vector of N-ones, directly as a constraint on the optimization, which is invariant to normalization.This constrained program has good computational properties because its objective function is concave and smooth in all the parameters.The second alternative arises from Parzen's estimations of a density function [53].This alternative is also efficient for the estimation of unobserved heterogeneity.The problem of estimating a probability density function over the unobserved is sometimes similar to the problem of estimating maximum likelihood parameters.However, in a network setting, it is more similar to estimating the spectral density function of a stationary process [53].Focusing on the second alternative, the following argument shows that it is possible to estimate unobserved heterogeneity with a given probability of occurrence.We consider ι dim θ as a vector consisting of dim(θ).We let L : R → R be a Lipschitz function, differentiable, a symmetric kernel function, and θ as in Theorem 2. Theorem 4.Under Lemma 1, we define Chatterjee, Diaconis, and Sly [56] demonstrated the uniform consistency of estimator Âl (θ) in the model that does not incorporate dyad-level covariates.The key to this theorem is the following: In sparse network sequences, we effectively witness N − 1 linking decisions made by each node, which means that we observe whether node i links to every other node j.This unique feature of the problem allows for consistent estimation of Âl (θ) for each node.The argument becomes tedious because of the interdependence of the linking decisions in the sequences of nodes i and j.However, this dependence is weak, only arising via the presence of C ij,t in both link sequences.Establishing asymptotic normality of θ is also involved.This is because the sampling properties of θ are influenced by the estimation error in Âl (θ).This influence generates a bias in the limit distribution of θ.This bias is similar to that which arises in large N, large-T joint fixed effects estimation of non-linear panel data models [47].
To state the form of the limit distribution, we let Ĥ(C) and H 0 (C) be the entropy computed over the parameter vector θ and θ 0 , respectively.Our objective is to estimate quantity H(C) within the family of networks C that contains nodes i and j.Our estimator is expected to provide a reliable estimate of H(C).Here, we state the following result: Theorem 5.Under Assumptions 1 and 2, This inequality demonstrates that our estimator Ĥ(C) enjoys uniform consistency within class C. In simpler terms, it implies that, as our sample size N and time period T increase, the maximum absolute difference between our estimator and the true value H(C) across all sets C ∈ C becomes small.The probability that the bound 1 NT log NT holds is stated to be 1 − O((NT) −1 ), meaning that it holds with high probability as the size of the network and the number of time steps grow large.This result provides an upper bound on the discrepancy between the estimated and true entropy, ensuring the reliability of the estimation in the context of the class of networks C. Now, via definition we are in a position to show Theorem 6.Under Assumptions 1 and 2, To converge to a normal distribution, the difference between estimator θ and true value θ 0 has to be bias-corrected and rated proportionally to the number of nodes N and time T. In the dense network setting considered here, θ 0 is estimated based on the observed linking decisions about N(N − 1) potential links.Therefore, the rate of convergence √ NT is the conventional parametric rate corresponding to the sample size [5,7].
We finalize this section showing some functional dimensions of the entropy function, given by Theorem 7.Under Assumptions 1 and 2, we have that: where (ii) If F = {c ij,t : t < t} and F = {c ij,t : t < t} are two filtrations with respect to A, then .
Theorem 7 states that the entropy H(C) of the dynamic network C is bounded by the mutual information between successive states of the filtrations F and F .This means that as the states of the network become more predictable and related to each other, the entropy decreases, implying greater structure and order in the network.Conversely, if the states are more independent and random, the entropy increases, reflecting a more chaotic and less predictable structure in the network.

Benchmark and Simulations
In this section, we studied the finite sample performance of procedures in Monte Carlo simulations, where the programming language used for these simulations is Matlab.We compared the development and robustness of our network formation model using the Erdös-Rényi [18] and Barabási-Albert [19]-type networks.The Barabási-Albert network was generated with a connection probability of 0.5 and a new number of links in each period equal to five.These comparisons were made with the metrics of degree distribution, clustering coefficient, and entropy value.The experiment was based on the latent index formation rule with specification Here, β 0 = 0.5, α 0 is a random vector with a norm of less than one and X i ∈ {−1, 1}, i = 1, . . ., N being independent and identically distributed random variables simulated by X i = 1 − 2 • 1{i is even}, with size networks of 100, 150, 200, 250 and 500.For larger sample sizes, the behavior of the entropy function is, on average, similar.With this specification, nodes with an even index prefer links to nodes with an even index over links to nodes with an odd index, and vice versa for nodes with an odd index.Through 1000 repetitions of the experiment, we show the reproducibility and dynamics of the constructed networks.A 15step time experiment was proposed.In addition, for all i = 1, . . ., N. The descriptive characteristics of the network formation are shown in Table 1.Based on Table 1, we can perform a comparative analysis between the three generated networks.1.
Mean Degree: The mean degree represents the average number of connections that nodes have in the network.In the simulated network, the mean degree decreases as the network size increases, suggesting that nodes tend to be less connected to each other.This could be influenced by the parameters of the network generation model, such as α, β, and p, which affect the probability of forming new connections at each time step.On the other hand, the Erdös-Rényi and Barabási-Albert networks maintain their mean degree relatively constant, indicating that their connection generation process is not strongly influenced by network size.

2.
Standard Deviation of Degree: The standard deviation of the degree measures the variability in the number of connections that nodes have in the network.In the simulated network, the standard deviation of the degree tends to decrease as the size of the network increases, implying that node degrees become more homogeneous.This could be a desirable feature in some contexts, as it indicates that the simulated network tends to have a more uniform degree distribution, which is associated with greater robustness and stability in its structure.

3.
Clustering Coefficient: The clustering coefficient measures the proportion of connections that exist between the neighbors of a given node.In the simulated network and Erdöss-Rényi networks, the clustering coefficient tends to decrease as the size of the network increases.This suggests that nodes tend to be less interconnected compared to smaller networks.On the other hand, in the Barabási-Albert network, the clustering coefficient remains at one, indicating that neighboring nodes are highly connected.This result is characteristic of Barabási-Albert scale-free networks, where new nodes tend to preferentially connect to existing nodes with higher degrees, resulting in high clustering among the neighbors of each node.
Regarding the convergence order, it is observed that the simulated network exhibits an intermediate behavior between Erdös-Rényi and Barabási-Albert networks in terms of mean degree and clustering coefficient.While Erdös-Rényi networks are more homogeneous and less clustered, and Barabási-Albert networks are more heterogeneous and highly clustered, the simulated network shows intermediate characteristics, making it suitable for representing systems that contain elements of both tendencies.
Concerning entropy, we validated the development of entropy H(C) across the same number of network sizes over three time periods.We compared the results with Erdös-Rényi entropy [57] and Shannon entropy [58].Table 2 summarizes the results obtained from 1000 simulations.The analysis shows that entropy H(C) performs consistently well across various network sizes and time periods.It demonstrates competitive values compared to Shannon entropy and outperforms Erdös-Renyi entropy significantly.The results indicate that H(C) is a reliable and effective measure to capture the information flow in network dynamics.The lower values obtained by H(C) compared to Shannon entropy suggest that it provides a more informative representation of the network's complexity.Furthermore, the increasing trend of H(C) with network size indicates that it effectively captures the growing complexity of larger networks, indicating that larger networks tend to have more structure and order.Overall, these findings support the usefulness of H(C) as an entropy measure for analyzing network dynamics and information flow.Higher Shannon entropy values indicate greater diversity or complexity within the networks.In this context, Shannon entropy decreases as the size of the network increases, which implies greater self-organization and less uncertainty within larger networks.

Empirical Application
In this section, we apply our dynamic network formation model (Equation ( 1)) to the 18S rRNA gene amplicon dataset from a study by Hu et al. [40].This application has focused the microbial network identification.Seawater samples were collected from a depth of 15 m every 4 h following a Lagrangian sampling schematic in an anticyclonic eddy in the North Pacific Subtropical Gyre, as a part of the Simons Collaboration on Ocean Processes and Ecology (SCOPE, http://scope.soest.hawaii.edu/)cruise efforts in July 2015.Some species with taxonomical classification of RNA OTUs are shown in Table 3.

Analytical Processes
We examined the influence of species richness, specifically focusing on the relative rather than absolute frequency of OTUs.This simplicity forms the primary homophilic structure governing interactions among species taxa in this microbial context, where species engage based on their relative abundances.Subsequently, we applied the Community Louvain algorithm to identify the microbial communities participating in various interactions during each sampling period.To validate the algorithm's findings, we conducted null modularity calculations with 1000 replicates to assess the statistical significance and distinctiveness of the identified communities within the networks.Additionally, we considered community uniformity and similarity across the sampling periods.To confirm sample dissimilarity, we conducted multiple ANOVA tests and employed the Jaccard test.Our analysis encompassed sensitivity, interaction intensity, and the effect of parameters observable and non-observable on microbial diversity.Computational cost allowed us evaluation of six samples.Samples were collected using 10 L Niskin bottles mounted on a CTD rosette at 6 a.m., 10 a.m., 2 p.m., 6 p.m., 10 p.m., and 2 a.m.Corresponding temperature, salinity, dissolved oxygen, and chlorophyll a data were derived from the same CTD casts.The input data are presented in the form of sequential count tables, where each column represents a sample, and each row represents a taxonomic designation (OTU or transcription ID) with sequence count or read coverage abundance per taxon.Global singletons (where a single OTU appears with a frequency of 1 in the entire dataset sequence) are removed.Out of a total of 3831 Taxa observed, 1779 are eliminated.

Results
Incorporating the details outlined above, along with the dynamic network formation model (1), we present the following results.

Calculating Sensitivity and Specificity, Effect of Interaction Intensity
The interaction network and co-occurrence network were compared to each other to determine the sensitivity and specificity of the constructed co-occurrence network in detecting direct (first-order) interactions [25].For this calculation, a true positive (TP) was indicated by the presence of an edge in the co-occurrence network that had the same sign as in the interaction network (when using association metrics with sign).A false positive (FP) represented an edge in the co-occurrence network that was not present in the interaction network.A false negative (FN) denoted an edge in the interaction network that was absent in the co-occurrence network.A true negative (TN) was the absence of an edge in both the interaction and co-occurrence networks.Sensitivity was defined as TP/(TP + FN), and specificity was defined as TN/(TN + FP).In cases where two species interacted with each other with different signs, the interaction with the larger absolute value was considered to be the sign of the net interaction.In addition, we calculated each precision as TP/(TP + FP), and F1 score as 2 × (precision × sensibility)/(precision + sensibility).
The similarity of species had a large effect on network sensitivity (see Table 4).Though specificity remained high at similarities ranging from 89% to 90%, the sensitivity increased through this range within creasing similarity.Samples with relatively high similarity in species membership were therefore useful for constructing sensitive networks.Many real microbial communities have a lower percentage of shared taxa, but this is largely due to under sampling of rare species [59].The F1-score, which reflects the balance between precision and recall in measuring species interaction or co-occurrence, consistently indicates strong performance throughout the day.An F1-score of 0.637 indicates a good balance between precision and recall for species interaction or co-occurrence at 6 a.m.This means that the model or method used to measure species interaction performs well in identifying both positive (species interactions) and negative (absence of interactions) cases at this time.At other time points, including 10 a.m., 2 p.m., 6 p.m., 10 p.m., and 2 a.m., the F1-scores range from 0.635 to 0.644.These values suggest that the employed method effectively identifies species interactions, with a particularly noteworthy performance during the nighttime hours at 2 a.m.Overall, the F1-score results highlight the method's robustness in assessing species interactions across different times of the day.

Effect of Interaction Intensity in the Communities
Once the co-occurrence networks between species are constructed, we investigate the community structure that these interactions generate.In each sampling instance, we identify microbial communities based on the interaction of the corresponding species.These interaction networks of communities evolve with each sampling, both in terms of the number of communities and the composition of these communities.The depth of this identification is carried out at seven taxonomic levels.The original dataset comprises eight taxonomic levels, as described in Table 3.The sampling time reveals preferences in the interactions among certain communities.For instance, some of the microbial communities tend to be more inclined to interact during the day, likely due to the increased presence of the 18S rRNA gene within their taxonomy.Tukey-Kramer tests were conducted in this sampling.All tests resulted in p < 0.001 in favor of rejecting the null hypothesis: there is no statistically significant evidence in the mean of the compared communities.The randomness test is performed on the degree distribution at all sampling points.In all of these, we find a p-value < 0.001, indicating that the biological network formation structure does not follow a random structure.The modularity test based on 1000 permutations yields a p-value < 0.001.This indicates that the formation of these communities is robust and the interactions are strongly cohesive at each sampling.

Effects of Parameters on Microbial Diversity
Microbial communities in different environments can vary widely in their composition and structure.Though the experimenter cannot necessarily influence ecological parameters, it is valuable to know which factors may cause problems in co-occurrence network inference.We considered the effect of species richness, community evenness, and similarity of communities across sampling sites.
Our analysis suggests that community evenness does not directly affect co-occurrence network sensitivity and specificity.However, it may have an indirect effect because uneven communities require increased sampling depth in order to detect the real species richness, and if this is inadequate, then the number of detected species (i.e., the effective richness) is reduced.The diversity of communities between different sites can be calculated via a variety of metrics [60].We used a simple and intuitive metric to quantify the similarity of communities at different sampling sites: the average percentage of species shared between any two sites (i.e., the Jaccard similarity).The similarity of communities had a large effect on network sensitivity.The Jaccard index for all community networks is 0.017, indicating a dynamic configuration in the networks and thus in the microbial structure.This is of utmost importance due to the intrinsic biological complexity of genomic structure, considering that some of the taxonomic properties of the 16S rRNA gene are more expressive at certain times of the day.

Effect of the Non-Observable
The communities evaluated so far have not been in a steady state, representative of many complex communities [61,62].Therefore, we investigated the ways in which the variability in unobservable site properties influences the inference of each network of communities.To achieve this, we introduced random variations in the carrying capacity of each species at each site, which can be interpreted as an introduction of between-site heterogeneity.This addition of inter-site heterogeneity, where each species has varying advantages, introduced "noise" to the dataset.Nevertheless, we mitigated the impact of the unobservable factors using Theorem 4.
Table 5 shows the variation of network statistics as the level of heterogeneity changes.We observed that the number of microbial communities varies depending on the time of day and the formula for unobserved heterogeneity used.At 6 a.m. and 10 p.m., the number of communities was lower when formula N − i N − 1 log(log(N)) was applied, which could indicate a higher cohesion among communities at those hours.In contrast, at 2 p.m., regardless of the formula, a constant number of communities was maintained, suggesting a more robust structure.Regarding the average node degree in the networks, there was no clear pattern of increase or decrease based on the time of day or the formulation of unobserved heterogeneity.The values fluctuated under all conditions, implying natural variability in microbial interactions.Finally, the density of the networks showed significant variations.For example, at 10 a.m. and 2 p.m., the density was relatively low, implying a lower proportion of possible connections in these networks.In contrast, at 6 p.m., a higher density was observed, suggesting greater interconnection among microbial species at that time.
Table 5.Effect of the non-observable on microbial communities.

Discussion and Conclusions
Motivated to explore the field of CNA, we study the introduction of a latent interaction index, addressing the limitations inherent in traditional compositional similarity indices, taking into account both observed and unobserved heterogeneity per node, particularly in the context of large and complex networks.
This index addresses a limitation in network formation, namely interdependence.The study of complex network formation in the presence of interdependencies is one of the focal points of recent theoretical and empirical research on networks [5,7,16].However, with the exception of Graham's [7] and Dzemski's [5] models, none of these papers incorporate unobserved correlated heterogeneity within the modeling framework, unlike the approach used here.The results obtained through the development of this index (Theorems 1 and 2) demonstrate uniform consistency with respect to the homophily parameter vector and fixed effects.This assures us that the proposed index yields statistically replicable results, in line with the principles of the law of large numbers and its applicability across various domains [2,25].
Together with the above, we formulate a Shannon-type entropy measure to quantify network density.We further establish optimal boundaries for this measurement by utilizing insights from network topology.Additionally, we present asymptotic properties of pointwise estimation using this entropy function.This analytical approach allows us the application of scrutiny on the structural dynamics of composition, offering valuable insights into the intricate interactions within the network.Here, it is important to note the relevance of dyads contributing to this measure, as a consequence of both observed and unobserved factors.In contrast to some studied entropy measures that do not take these characteristics into account [23,27,29,34,36], it might be more useful and comprehensive for future research in various fields to conduct a deeper exploration of what other factors and dimensions could potentially influence the contributions of dyads in the network and, consequently, network entropy.
The results indicate that as network states become more predictable and interconnected, network entropy decreases.This decrease in entropy signifies a greater degree of structure and order within the network.Conversely, when network states exhibit greater independence and randomness, entropy increases, reflecting a more chaotic and less predictable network structure.These findings align with previous research on the interplay between network structure and entropy [13,14,63].
The application of the Shannon-type entropy function provides a robust measure for quantifying network complexity.By establishing optimal bounds for entropy estimation based on network topology, we ensure the accuracy of our analysis and enhance our ability to distinguish networks with varying complexity levels.This contributes to a more nuanced understanding of network dynamics and interactions.Simulations and comparisons with Erdös-Rényi and Barabási-Albert-type networks, in addition to the utilization of Erdös-Rényi and Shannon-type entropy, further validate the effectiveness of our proposed method.Our results demonstrate that the proposed index successfully distinguishes between networks with different degrees of complexity, even outperforming classical models in certain cases [18,19].
Despite the inherent complexity of microbiological data [40,61], our method offers a promising avenue for studying and comprehending the intricate relationships within these interaction networks and their implications under various parameter specifications.The ecological results presented here are currently under discussion with experts in the field.However, we acknowledge the possibility of simplifications and extensions of the model proposed here.
The theoretical results presented in this article allow us formulation of two statements.First, the interdependence structure in forming complex networks should not be independent of the objective parameters and unobservable node effects.This would enable researchers to discover causal relationships based on these parameters and the network formation itself, complementing some of the discussed network models [37,38].Second, entropy measures on network structures could be more robust and consistent if only the dyads influencing their structure were considered.It is well-known that biased estimates in entropy measures of networks arise from the influence of false dyads on the system [64].The entropy metric presented here is based solely on the contributing dyads of the network.
In conclusion, this approach enables us to capture both observed and unobserved heterogeneity per node, providing a more comprehensive understanding of interactions within ecological communities and other intricate networks.The proposed latent interaction index proves to be an invaluable tool for characterizing the structural dynamics of networks.Additionally, it is feasible to design a test to evaluate interdependencies in link formation.It is more plausible to assume that these interdependencies establish a bounded degree between pairwise interactions [21,24].The proposed model provides feasibility and evidence of how to incorporate these interdependencies, which in many cases are probabilities conditioned on triads (groups of three nodes) [5,7].It is worth noting that these probabilities introduce a bias in the linkage decision [6].While work has been extensive in reducing this bias in mono-nodal estimation [16,46,47], little is known about multi-nodal structures.This inherent uncertainty led to the introduction of the entropy function studied here.It possesses the property of reflecting parameter estimates as a function of the true parameters, meaning that the estimated entropy converges to the true entropy.This finding could be a valuable contribution to the challenge of multinodal estimates.
Proof of Theorem 1. From Hoeffding's inequality, ∀ > 0, where κ ∈ (0, 1) such that p ij,t (θ, a N ) ∈ (κ, 1 − κ).Setting = ln(NT), we have Here, 2 exp Proof of Theorem 2. We note that The last equation can be written as Via Theorem 4.2.2 from [48], we have Proof of Theorem 3. Since θ is based on contribution l θ,a N ij,t , which, by Theorem 2, converges in probability to a monotonic transformation of vector θ 0 , from Theorem 4.2.1 in [48], this implies that lim N,T→+∞ θ = θ 0 and, therefore, the variance of θ − θ 0 proceeds to zero as N, T approaches positive infinity.We let δ > 0 be a fixed small constant.Then, via Chebyshev's inequality, we have lim This means that the probability of θ being within a small neighborhood of θ 0 approaches 1 as N, T becomes large.To determine the rate of convergence, we can express this difference as θ − θ 0 = O (NT) −1/2 .This indicates that the convergence rate is at least by applying Hoeffding's inequality twice: First, we note that Since X il,t , X jl,t , and δ ij (x l ) are bounded for all l = 1, . . ., N, by mean value expansion of for all > 0, ∀θ, θ ∈ Θ.Now, from the Cauchy-Schwartz inequality, with sup X,t X il,t X jl,t θ = X(θ).
On the other hand, Finally, by mean-value expansion for logit distibution Λ, we have where in the last equality we use Equation (A2).
Proof of Theorem 5.By Theorem 2 (Law of Large Numbers) and Theorem 3, we have Proof of Theorem 6.From Theorem 3 and the first-order condition associated with the concentrated log-likelihood, a mean value expansion offers where s ijt,θ (θ, a N ) denotes the {ij}th dyad's contributions to the score of the maximum likelihood estimator associated with vector θ.After applying the result for the Hessian of the concentrated log-likelihood derived immediately above, we obtain Tedious calculations, along with the calcu- lations immediately above, produce Applying the Central Limit Theorem to the second addend, we have Proof of Theorem 7. (i) Here, we note that allowing us definition of a new vector over the parameters, i.e., q t = (q ij,t ) N−1,N i=1,j=i+1 as From this definition arises the fact that 0 ≤ q ij,t ≤ 1; then, Returning to the entropy with respect to vector q t , we have   Here, K(•) is a kernel density function that gives appropriate weight to link {ij}, while σ n is a bandwidth that shrinks as n increases.The asymptotic theory requires that kernel density be chosen so that a number of regularity conditions cn be determined.If Pr(X ij,t+1 = X ij,s+1 ) > 0 (e.g., discrete covariates or controlled experiments) and X ij,t+1 − X ij,s+1 has sufficient variation conditional on X ij,t+1 = X ij,s+1 , then the K(•) function can be replaced by a 1(X ij,t+1 − X ij,s+1 = 0) indicator function, and the resulting estimator has the usual (NT) −1/2 rate of convergence.However, if the regressors are continuous or have high dimensions, then the estimator, while still consistent and asymptotically normal, has a convergence rate slower than (NT) −1/2 .Also, this rate falls as the number of covariates increases.

Figure 1 .
Figure 1.Entropy function H(C) for values of N nodes.Here, β 0 is a scalar equal to 0.5 and α 0 is a random vector of length 10 with a norm of less than 1.

Figure 2 .
Figure 2. Identifying interaction in communities in sampling networks.

Table 4 .
Metric values of the sample of the network.