Computationally Reconstructed Interactome of Bradyrhizobium diazoefficiens USDA110 Reveals Novel Functional Modules and Protein Hubs for Symbiotic Nitrogen Fixation

Symbiotic nitrogen fixation is an important part of the nitrogen biogeochemical cycles and the main nitrogen source of the biosphere. As a classical model system for symbiotic nitrogen fixation, rhizobium-legume systems have been studied elaborately for decades. Details about the molecular mechanisms of the communication and coordination between rhizobia and host plants is becoming clearer. For more systematic insights, there is an increasing demand for new studies integrating multiomics information. Here, we present a comprehensive computational framework integrating the reconstructed protein interactome of B. diazoefficiens USDA110 with its transcriptome and proteome data to study the complex protein-protein interaction (PPI) network involved in the symbiosis system. We reconstructed the interactome of B. diazoefficiens USDA110 by computational approaches. Based on the comparison of interactomes between B. diazoefficiens USDA110 and other rhizobia, we inferred that the slow growth of B. diazoefficiens USDA110 may be due to the requirement of more protein modifications, and we further identified 36 conserved functional PPI modules. Integrated with transcriptome and proteome data, interactomes representing free-living cell and symbiotic nitrogen-fixing (SNF) bacteroid were obtained. Based on the SNF interactome, a core-sub-PPI-network for symbiotic nitrogen fixation was determined and nine novel functional modules and eleven key protein hubs playing key roles in symbiosis were identified. The reconstructed interactome of B. diazoefficiens USDA110 may serve as a valuable reference for studying the mechanism underlying the SNF system of rhizobia and legumes.


Introduction
In most cases, rhizobia are a group of Gram-negative soil bacteria within the family Rhizobiaceae. They can colonize roots of legumes, establish symbiotic relationship with them and perform nitrogen fixation. This kind of symbiotic nitrogen-fixing (SNF) system can convert inorganic nitrogen to organic nitrogen and constitutes an important part of the biogeochemical cycle [1]. B. diazoefficiens USDA110, (formerly known as B. japonicum USDA110) [2], a strain of Bradyrhizobium, which is an important genus of rhizobia, can establish a symbiotic relationship with Glycine max (soybean). The G. max-B. diazoefficiens system has been an important model for the study of the SNF system [3,4]. In addition, given the efficient SNF ability of B. diazoefficiens USDA110 [5], it is widely used in agriculture and environmental engineering [6,7]. Owing to its importance, various biological characteristics of B. diazoefficiens have been extensively studied for decades, especially its SNF mechanism, such as the information exchange between B. diazoefficiens and G. max during the process

Domain-Based Method
In recent years, the domain-based method has been widely used in PPI prediction because the occurrence of PPI is usually caused by the interaction between one domain X of protein A and another domain Y of protein B, and the interaction between proteins A and B can be inferred if domains X and Y have verified interaction. Here, domaindomain interactions (DDIs) were obtained at both the sequence and structure levels. The PPIs in databases of BioGRID, DIP, IntAct and HPRD_Release9_062910 ( Figure 1) were utilized to infer DDIs. Besides, DDIs were also acquired from the three-dimensional interacting domains (3DID) database [47]. Protein domains in B. diazoefficiens USDA110 and in BioGRID, DIP, IntAct and HPRD_Release9_062910 databases were recognized and obtained based on the domain definitions of the Pfam database and HMMER program (score > 20, e-value < 1 × 10 −5 , coverage > 0.9) [48,49]. A part of DDIs were inferred from experimental PPIs in databases based on domain sequences with score > 0.5, and another part of DDIs were obtained from 3DID database based on the domain structure with score > 0. For the former part, if the proteins in a PPI pair were regarded to have domains based on protein sequences (score > 0.5) and the PPI had relevant experimental verification in at least one of these databases (BioGRID, DIP, IntAct and HPRD), the corresponding DDI was inferred. As shown in Figure 1, the intersection of these two parts, as well as the domains of score > 1 from PPI databases, and the domains of score > 2 from 3DID database, were used in order to improve accuracy and data coverage.

Quality Assessment of the Reconstructed PPI Network
First, the PPIs were verified by an iLoop server (http://sbi.imim.es/iLoops.php, 5 May 2017) which utilizes local structure features to characterize protein interactions [50,51]. In our study, 1000 PPIs were randomly selected and submitted to the iLoop server for validation. Second, considering the interacting proteins are likely to appear in the same subcellular localization, the PPIs were verified by subcellular colocalization of the proteins. Sequences of the 5638 proteins in the B. diazoefficiens USDA110 interactome were submitted to PSORT 3.0 (http://www.psort.org/psortb, 16 May 2017) [52] and the location information of each protein was obtained. Third, the two proteins in a PPI tend to have similar functions. Based on this point, the functional similarity of each interacting protein pair in the reconstructed PPI network, and the randomized PPI network of the same topology, were calculated according to a published algorithm G-SESAME [53]. The Gene Ontology (GO) function annotations of B. diazoefficiens USDA110 proteins were obtained from the GO database [54], and the Wilcoxon rank-sum test was performed to compare the functional similarity difference between the reconstructed PPI network and the random PPI network. Last, the correlation of transcriptional profiles between two genes from a PPI pair was also used to examine the reliability of the network. Here, the SNF subnetwork of the global network was checked based on time-sequenced transcriptome data [11]. The Pearson Correlation Coefficient (PCC) was calculated for the transcription profiles of every interacting pair in the reconstructed PPI network and the randomized PPI network of the same topology, and then Wilcoxon rank-sum test was used to show the difference. All of the methods used to verify the interactions are yes-no methods, meaning that they only give information on whether or not the two proteins have an interaction.

COG Function Enrichment Analysis of PPIs
We assorted PPI numbers according to the Clusters of Orthologous Groups of proteins (COG) annotations as the functional network. Referring to 1000 random networks of the same topology generated from the functional network, the COG functional network was presented in the form of heat map. Z-scores were calculated to determine colors in the heat map by the following formula: where A ij represents the actual number of interacting protein pairs between COG category i and category j, and < Rnd ij > and σ Rnd ij represent the mean and standard deviation of interacting protein pairs between category i and category j summarized from the 1000 random networks, respectively.

Reconstruction and Validation of the PPI Sub-Networks in FL and SNF Conditions
B. diazoefficiens USDA110 cells cultivated in aerobic peptone salts-yeast extract medium were considered as the FL condition. The processed microarray data of the FL B. diazoefficiens USDA110 were downloaded from NCBI-GEO database [55] (GSM210269-GSM210283). In order to reduce false positives, a gene was deemed as expressed only when more than 80% of the replicates were of p-value ≤ 0.06 [11]. Then, the average of the signals in selected replicates was regarded as the expression level [11]. The transcriptome and proteome data of bacteroids under the SNF state were obtained directly from the Supplementary Materials of a previous study [14]. The FL and SNF subnetworks were reconstructed by selecting PPIs whose components were expressed under the FL and SNF conditions, respectively, indicated by transcriptome and proteome data [11,14]. The transcription level difference of an interacting protein pair i and j was normalized according to the following formula [56]: where Signal i and Signal j are the transcription levels of gene i and gene j, respectively.

Reconstruction of the Genome-Scale PPI Network of B. diazoefficiens USDA110
The PPIs in B. diazoefficiens USDA110 were predicted by integration of an "Interolog" and a domain-based method ( Figure 1). The reconstructed PPI network was visualized by using Cytoscape 3.2.1 [57] (Figure S1A), which contains 5638 proteins and 60,839 PPIs. The Clusters of Orthologous Groups of proteins (COG) functional categories [58] of the nodes (proteins) in the PPI network were determined, and their proportions are shown in Figure S1B. Proteins related to 'amino acid transport and metabolism (E)' accounted for the largest proportion (over 12%), while the proteins associated with 'cell cycle control, cell division, chromosome partitioning (D)' only accounted for 0.5%.

Quality Assessment of the PPI Network
Reliability of the reconstructed PPI network was assessed from four different perspectives: local structural features, subcellular localization, functional similarities and transcriptional correlations of genes. For local structural features, 1000 PPIs were selected randomly and submitted to the iLoop server by which the interactions were predicted. Due to the limitation of available structural templates (in the server's dependent databases), 47.4% of the 1000 PPIs had at least one protein without corresponding structural features and thus could not be assessed. For the remaining PPIs, 46.4% were confirmed, whereas only 1% were identified as noninteracting pairs (Figure 2A), which proved that the reconstructed PPI network is reliable based on local structural features. represents the directly confirmed PPIs; "Non-interacting pairs" represents the directly non-confirmed PPIs; "No structural features" represents that no structural features (loop or domain) could be assigned to either one or both proteins; "No interaction signatures" represents that the pair of proteins had structural features but no interacting features. (B) Validated by subcellular co-localization. "Co-localized pairs" represents that the two interacting proteins were co-localized; "Nonco-localized pairs" represents that the two interacting proteins were not co-localized; "Pairs with unknown locations" represents that at least one protein in an interacting protein pair had no subcellular location information. (C) The difference of distribution of functional similarity between the reconstructed network and the randomized network of the same topology. Higher frequencies appear in the reconstructed network than the random network when functional similarity becomes larger. (D) The distribution difference of PCC of gene transcription profiles between the reconstructed network and the randomized network of the same topology. Higher frequencies appear in the reconstructed network than the random network when PCC of transcription profiles becomes larger.
Proteins that interact with each other are likely to appear in the same subcellular location. Here, we ascertained the subcellular location of the 5638 proteins in the reconstructed PPI network. According to the results, more than half of the PPIs (51.1%) were confirmed to be colocalized in the cell. Besides, over 25% PPI pairs had a component whose location was unknown ( Figure 2B), and for these PPI pairs some of them must be colocalized; therefore, the proportion of positive results was actually larger than 51.1%.
Previous studies have demonstrated that interacting proteins tend to have relatively higher functional similarities [38,59], which can be used to confirm the reliability of the reconstructed PPI network. Based on the semantic similarities of gene ontology (GO) annotations [54], the functional similarities of the proteins in PPI pairs of the reconstructed PPI network and corresponding random networks of the same topology were calculated and compared ( Figure 2C). This showed that the functional similarities in the reconstructed PPI network were significantly (p-value < 2.2 × 10 −16 from Wilcoxon rank-sum test) higher than those of the random networks, which proved that the reconstructed PPI network is reliable based on functional similarities of the proteins in PPI pairs.
It has been reported that interacting proteins have similar expression patterns [60]. Based on the temporal transcriptome profiles of the bacteroid of B. diazoefficiens USDA110 [11], the PCC of normalized transcription data of the proteins in PPI pairs of the reconstructed PPI network, and corresponding random networks with the same topology, were calculated and compared ( Figure 2D). The PCC values of the reconstructed PPI network were significantly (p-value < 2.2 × 10 −16 from Wilcoxon rank-sum test) higher than those of the random networks. When PCC is close to 1.0, the difference is more significant, which means that the reconstructed PPI network contained much more PPI pairs whose components kept fairly consistent transcriptional patterns than random networks. Hence, the reconstructed PPI network is reliable based on coexpression of genes.

Properties of the Reconstructed PPI Network
The topological parameters of the PPI network were calculated and analyzed with the "Network Analysis" plugin in Cytoscape [61]. Degree distribution of the PPI network of B. diazoefficiens USDA110 follows the power law (y = 2143.3x − 1.348 , R 2 = 0.76, Figure 3A), which shows the characteristic of scale-free network, such as many typical biological networks [62]. The number of subnetworks is 46 ( Figure S1A), and the largest sub-network contains 5556 proteins and 60,730 interactions. The node degree distribution is shown in Figure 3B. The clustering coefficient is always used to evaluate the modularity of a node in the PPI network by examining the connectivity among the neighbors of the current node. A higher clustering coefficient suggests that the protein is more likely to participate in a close-connected module. The distribution of the clustering coefficient of the PPI network had two peaks ( Figure 3C), suggesting that the topological parameters of series of small sub-networks were very different from those of the largest one [63]. The average shortest path length was around 3 and 4 ( Figure 3D), which is consistent with the results of previous studies [64].
In the power law fitting formula, the degree exponent γ of the reconstructed PPI network was calculated as 1.348 by the maximum likelihood estimate. Studies have shown that if the degree exponent is smaller than 2, the control of the whole network requires a small set of essential nodes, which could be identified by minimum dominating set (MDS) [23,65]. By solving an integer-based linear programming problem [65], a MDS of the B. diazoefficiens USDA110 PPI network was determined. In order to eliminate the bias caused by self-interactions, they were removed from the reconstructed PPI network, and the refined network (including 5611 proteins and 59,125 PPIs) was used to identify the MDS. The determined MDS of the PPI network contained 427 nodes (less than 10% of the total number). Furthermore, COG enrichment analysis was performed to examine the functional distribution of these essential nodes. As shown in Table S1, the proteins in MDS are significantly enriched (Fisher's exact test, p < 0.01) in 'Energy production and conversion (C)', 'Lipid transport and metabolism (I)', 'Posttranslational modification, protein turnover, chaperones (O)', 'Intracellular trafficking, secretion, and vesicular transport (U)' and 'Defense mechanisms (V)'.

Comparison of PPIs among Rhizobia
In the reconstructed PPI network of B. diazoefficiens USDA110, 5199 proteins with COG annotations (involving 55,668 PPIs) were selected and analyzed. A z-score index, which quantitatively reflects the degree of enrichment of PPIs in different combinations of COG categories, was defined and calculated [41,66]. Same calculations were performed on the PPI networks of M. loti (2377 PPIs among 1408 proteins) [39], S. meliloti (856 PPIs among 320 proteins) [67] and B. diazoefficiens USDA110 under the SNF state (10,473 PPIs among 1777 proteins). As shown in Figure 4, proteins of the same functional categories are more likely to interact with each other in B. diazoefficiens USDA110 and S. meliloti, while this feature is not obvious in M. loti. Moreover, the PPIs related to the proteins of category O (post translational modification, protein turnover, chaperones) are the mainstay in the PPI network of B. diazoefficiens USDA110, because this protein category interacts with almost all the other protein categories. In contrast, the enrichment of PPIs related to category O in M. loti and S. meliloti is not obvious. This result may indicate that the PPIs in rhizobia are correlated with growth rate, and the slow growth of B. diazoefficiens USDA110 requires more protein modification and repair that are performed by the proteins in category O. Another noticeable point is the interaction between protein categories J (translation, ribosomal structure and biogenesis) and U (intracellular trafficking, secretion, and vesicular transport), enriched in the B. diazoefficiens (SNF) and S. meliloti PPI networks, which are both subnetworks for symbiotic nitrogen fixation, but not so in B. diazoefficiens and M. loti, which are both global PPI networks. This result may suggest that protein biogenesis and cellular material transportation are correlated with each other through PPI in the SNF state in rhizobia. This enhances our understanding on the roles of material supply in the mechanism of symbiotic nitrogen fixation [68]. Common functional modules, such as signal pathways and protein complexes, are relatively conservative among different species [69,70]. An algorithm called GASOLINE is available for the comparison of biological networks to find conserved functional modules [71]. We used this algorithm to compare the PPI networks of several rhizobia for both the global network and the subnetwork under the SNF state. The global PPI networks of B. diazoefficiens USDA110 and M. loti had 26 pairs of conserved functional modules (Density threshold = 0.5, Sigma = 2) with a particularly high-quality score (Table S2). For example, the largest module with the highest score (module 25 in Table S2, six proteins with ISC score 0.81) included the hybrid sensors and regulators of the two-component system for chemotaxis. The protein interactions within this module and the correspondence relationships between the two compared species are shown in Figure 5A. Since these species have special requirements for swimming in soil [72], this module is considered to play a regulatory role in flagellum movement when rhizobia meet attractants in rhizosphere. As shown in Figure 5A, eight PPIs in this module match well between B. diazoefficiens USDA110 and M. loti. Considering the two additional PPIs (blr2194-bll1199, blr2192-bll1199) in B. diazoefficiens USDA110 and the correspondence relationship, it can be hypothesized that Q98HL7-Q98L88 and Q98PD0-Q98L88 in M. loti are also interacting protein pairs that might have been missed in the experiments in [39]. For the SNF state, 10 pairs of conserved functional modules were obtained (Density threshold = 0.7, Sigma = 3) by comparing the PPI networks of B. diazoefficiens USDA110 and S. meliloti (Table S3). For example, the module with the highest score (module 1 in Table S3, 4 proteins with ISC score 0.88) includes ubiquinol-cytochrome enzyme (complex III) subunit and cytochrome c oxidase (complex IV) subunit. The protein interactions within this module and the correspondence relationships between the two species are shown in Figure 5B. Since nitrogen fixation usually requires additional energy in terms of ATP and reductant, cytochrome c oxidase may produce the required ATPs to provide the extra energy consumption [73]. Bhargava et al. proved that the activity of cytochrome c oxidase was enhanced in the process of nitrogen fixation [74]. The existence of the above PPI module ( Figure 5B), identified from B. diazoefficiens USDA110 and S. meliloti, further validates the importance of cytochrome c oxidase in nitrogen fixation. Previous studies have shown that the process of electron transfer between complex III and complex IV needs cytochrome c, but the specific process is still unknown [75]. In our study, the subunit components of complex III (blr0151, blr0150; Q92Z14, Q92U26) and the subunit components of complex IV (blr1170, blr1175; Q92RH0, Q92RG5) all have interactions ( Figure 5B). Furthermore, in the 3DID database [47], the domains (cox1, cox2, cox3) of subunits in complex III and complex IV were identified to be interacting. Based on this evidence, it can be inferred that complex III and complex IV may have direct interaction under some states to improve the efficiency of electron transfer and release more energy for nitrogen fixation.
In addition, with the reconstructed network of B. diazoefficiens USDA110, by network comparison with M. loti using GASOLINE and by consulting GO and KEGG databases, the functions of proteins belonging to COG category S ("function unknown") were inferred (Table S4), which is a direct application of our reconstructed PPI network of B. diazoefficiens USDA110.

Analysis and Comparison of the PPI Networks under FL and SNF States
As the type strain of Bradyrhizobium rhizobia, B. diazoefficiens USDA110 has two typical physiological states: the FL cell and the SNF bacteroid. Therefore, it is meaningful to investigate the differences and similarities of the reconstructed PPI networks in these two states. Subnetworks representing the FL and SNF states were obtained by integrating transcriptome and proteome data [11,14] into the global PPI network of B. diazoefficiens USDA110. The subnetwork of the FL state contains 3650 proteins and 31,541 PPIs, while that of the SNF state contains 1777 proteins and 10,473 PPIs. In order to compare the topological differences between these two networks, five local metrics were calculated for each node. Interestingly, although the proportion of common proteins in the FL and SNF networks exceeds 80% of the SNF network, all the topological metrics are significantly different. The results ( Table 1) show that the median of nodes' degrees is larger for the FL network due to its larger size, and that all the other metrics suggest the SNF network is more compact than the FL network, similar to the situation of the metabolic network of B. diazoefficiens USDA110 [76]. Previous studies have shown that the two components of interacting protein pairs in the PPI network should have similar transcription levels [77], which may reflect the trade-off between the pairing efficiency and synthesis cost. Therefore, the normalized transcription level difference between the two components of each protein pair in the PPI network was calculated and analyzed based on the transcriptome data of B. diazoefficiens USDA110 [11,14,56]. First, the distribution of normalized transcription difference of the protein pairs in FL and SNF networks (real PPIs, the "PPIs" group) was compared with that of all pairwise combinations of proteins in the corresponding network (all possible protein pairs, the "Control" group), respectively. As shown in Figure 6A, the distribution of the "PPIs" group is lower than that of "Control" group in both the FL and SNF states (FL: p-value < 2.2 × 10 −16 , median: 0.40 vs. 0.42, n: 31,541 vs. 13,322,500; SNF: p-value = 3.42 × 10 −4 , median: 0.37 vs. 0.38, n: 10,473 vs. 3,157,729; compared by onesided Wilcoxon rank sum test, α = 0.01), which means that the transcription levels of the interacting proteins are more consistent. Furthermore, similar comparisons were performed in each COG group in the FL and SNF networks. The results showed that for the protein pairs belonging to the COG groups of C (energy production and conversion) and M (cell wall/membrane/envelope biogenesis), the normalized transcription differences of the FL state were significantly less than those of the SNF state, while for the COG groups of G (carbohydrate transport and metabolism), J (translation, ribosomal structure and biogenesis) and Q (secondary metabolites biosynthesis, transport and catabolism), the trend was the opposite ( Figure 6B, Table S5).

Analysis of the Core-Sub-PPI-Network for Symbiotic Nitrogen Fixation
Symbiotic nitrogen fixation is a complex physiological process involving many kinds of perfectly cooperated molecular machines. Functional genes related to symbiotic nitrogen fixation such as nod, nif and fix, and their regulation, have been studied intensively in the field of nitrogen-fixing bacteria, but their relationships with other physiological activities and relevant coordination are not very clear. Based on the reconstructed PPI network, functional modules of the core-sub-network related to SNF were identified and their cooperating relationships were analyzed.
First, a dataset of 128 SNF-associated proteins in B. diazoefficiens USDA110 was determined by genome annotation and literature mining (Table S6). Based on these SNFassociated proteins, corresponding PPIs were filtered out and the SNF core-sub-network (SCSNW) was determined, which contains 195 proteins and 441 PPIs (Supplementary Materials). Using a network clustering tool based on the Markov chain clustering (MCL) algorithm in the clusterMaker plug-in for Cytoscape, called MCL Cluster [78,79], a series of network modules of the SCSNW were obtained. As shown in Figure S2, most of the SNF associated-proteins are hubs of modules, which validates that the modules identified by MCL clustering are functionally meaningful. Then, nine key modules were selected and their functions were deduced by gene annotations from different databases such as KEGG and UniProtKB, as well as by literature mining. As shown in Table 2, all of the modules are associated with different aspects of the phenotype of SNF bacteroid. Furthermore, the whole nitrogen fixing system also depends on the coordination of the modules which can be achieved through the PPIs between their components. In addition to these PPIs, functional modules can also be connected through other hubs, which can be named 'Tie of Modules' (TOM). Hubs with specific properties ( Figure S3) were selected as TOMs. Through the 11 TOMs, more diversified coordination could be established between functional mod-ules ( Figure 7A). Then the functions of PPIs binding TOMs and functional modules were deduced by genome annotation and literature mining (Table S7).  As shown in Figure 7B, we found protein NwsA, a two-component hybrid sensor and regulator coded by gene blr4773, connects three functional modules (m2, m4, m6) and interacts with several key hubs of regulatory cascades in the SNF bacteroid. As to these regulators, NodV and NodW regulate the expression of other nod genes (such as nodYABC) that initiate nodulation [80,81], the FixLJ-FixK2-FixK1 cascade activates genes essential for microoxic respiration in symbiosis (such as fixNOPQ and fixGHIS) and further regulatory genes for nitrogen fixation (rpoN1, nnrR, and fixK1) [82][83][84], NifA controls expression of nitrogen fixation genes (such as nif HDK) [83,85] and NtrC partly controls the regulation of genes related to nitrogen metabolism (such as ginII) [86,87]. Studies have reported that cross talk between the FixLJ-FixK2-FixK1 cascade and RegSR-NifA cascade might allow switching between different expression patterns of genes essential for N 2 fixation [83,88]. Therefore, NwsA might play a critical role in the cross talk, and probably performs an important coordination function between nodulation, nitrogen fixation and nitrogen assimilation in the symbiotic nitrogen fixation system. This waits for experimental validation. Just like NwsA, other TOMs might also be important for symbiotic nitrogen fixation through interactions with related proteins, although further information needs to be acquired.

Conclusions
In this work, the protein interactome (60,839 PPIs among 5638 proteins) of B. diazoefficiens USDA110 was computationally reconstructed by combining Interolog and domainbased methods, and its reliability was validated from four perspectives. The reconstructed interactome of B. diazoefficiens USDA110 was compared with those of other rhizobia (M. loti and S. meliloti) and it was inferred that the slow growth of B. diazoefficiens USDA110 may require more protein modification and repair via protein-protein interaction. By network comparison, 36 conserved functional modules were identified, and the functions of related proteins were annotated. By integrating the reconstructed interactome with transcriptome and proteome data, the subnetworks representing FL and SNF states of B. diazoefficiens USDA110 were derived. Based on the SNF subnetwork and the SNF-associated proteins, nine novel functional modules and eleven protein hubs (named TOMs which connect functional modules) were identified and analyzed to further our understanding of the molecular mechanism of symbiotic nitrogen fixation.  The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. We thank Bao-Hai Hao and Youguo Li for valuable discussion.

Conflicts of Interest:
The authors declare no conflict of interest.