Genome-wide approaches to identify genes that are differentially expressed under different conditions of interest have become a standard approach to investigating mechanisms involved in biological processes. The analysis pipeline used in such studies generally leads quickly to some form of gene ontology analysis, in order to identify biological functions that are associated with the differentially regulated genes. A complementary approach would be to analyse differentially regulated genes in relation to predicted conformational properties of the proteins they encode but such an approach has not been reported.
Characterisation of differentially regulated genes in relation to predicted or known conformational properties of the proteins they encode would be of interest in the light of recent discoveries showing overall relationships between conformational properties and different types of protein functionality or mechanism of action [1
]. For example, the catalytic domains of enzymes are generally ordered globular conformations while transcription factors are characterised by a preponderance of intrinsic disorder leading to ensembles of many alternative conformational forms [2
]. It is now clear that about half the proteins in eukaryotes contain at least one extended (>30 amino acid residues) intrinsically disordered region (IDR) and some proteins are completely disordered [3
Interestingly, IDRs occur more frequently in regulatory proteins and disease-related proteins [4
]. We recently identified genes that are differentially expressed in mantle cell lymphoma (MCL) cells that adhere to stromal cells with which they are co-cultured compared to non-adherent MCL cells in the same culture [6
]. The differentially regulated gene set defined in this in vitro model system showed substantial overlap with genes that are differentially regulated in the lymph node microenvironment of MCL and chronic lymphoblastic leukaemia (CLL) patients. Retention of lymphoma cells in microenvironments is thought to lead to minimal residual disease, in which a subpopulation of cancer cells receives survival signals from normal cells in microenvironments, thus allowing them to survive during treatment and to subsequently cause disease relapse. In vitro, minimal residual disease is mimicked by cell adhesion mediated drug resistance whereby, for example, lymphoma cells residing in close proximity to stromal cells manifest an enhanced level of resistance to cytostatic drugs [7
]. Thus, differentially regulated genes in co-cultured adherent lymphoma cells are likely to represent processes important for cell adhesion mediated drug resistance and minimal residual disease.
In our recent study, we identified 1050 genes that were differentially regulated in MCL cells adhered to stromal cells compared to non-adherent MCL cells in the same co-culture. The four main functional themes characterised by the differentially regulated gene set were cell adhesion, anti-apoptosis and B-cell signalling/immune-modulation, associated with up-regulated genes in adherent cells, as well as early mitotic processes, associated with down-regulated genes [6
]. Here we test whether the differentially regulated gene set or its subsets encode proteins that differ in IDR properties compared to non-regulated genes.
To determine whether there might be a difference in the frequency of IDRs (defined as predicted IDRs ≥ 30 amino acid residues in length) in proteins encoded by adhesion-regulated genes (adsu, n
= 1009) compared to other proteins (nadsu, n
= 17,612), we calculated the percentage of IDRs in adhesion-related proteins for each IDR predictor (Figure 1
, blue line) and compared it to the proportion of genes in the adhesion-regulated gene set (5.4%, Figure 1
, red line). For all predictors, the proportion of predicted IDRs associated with the adhesion gene set exceeded the frequency expected based on the proportion of proteins in the set. For many predictors, Figure 1
also shows a tendency towards a larger number of longer IDRs in proteins encoded by the adhesion gene set at the expense of shorter IDRs.
To determine whether the enhanced frequency of IDRs in proteins encoded by adhesion-regulated genes was significant, we used a resampling approach to test whether the IDR frequency associated with the 1009 adhesion-regulated genes lay outside the distribution of frequencies generated by 1000-fold resampling of 1009 genes from the control gene set (n
= 17,612). A z
-score and associated p
-value was generated for data from each predictor. As shown in Table 1
(adsu vs. nadsu), the enrichment of IDRs in proteins encoded by adhesion-regulated genes (adsu) was significant for all predictors.
Next, we tested whether the enrichment of IDRs associated with the adhesion-regulated gene set could be ascribed to subsets of the adhesion-regulated genes. Comparison of proteins encoded by genes manifesting a greater degree of regulation (fold change ≥ 1.3) relative to the remaining regulated genes showed fewer IDRs in more highly regulated genes compared to less highly regulated genes for all predictors and with lower levels of significance compared to the comparison of regulated and non-regulated genes (data not shown). Thus, there is an enrichment of IDRs in adhesion-regulated genes but the enrichment is not related to the extent of their regulation. Comparison of the up-regulated subset (adsu_up, change >1) relative to the down-regulated subset (adsu_down, change <1), on the other hand, showed an enhanced enrichment of IDRs in proteins encoded by the adsu_down subset compared to the enhancement levels in Figure 1
, with high levels of significance (Table 1
, adsu_down vs. adsu_up). Thus, the enrichment in IDRs in proteins encoded by adhesion-regulated genes is mainly associated with proteins encoded by down-regulated genes.
We next investigated whether the length of IDRs in proteins encoded by adsu genes tends to be longer than in other proteins (nadsu). As expected, IDR length is not normally distributed, as indicated by the consistently higher value of the mean compared to the median (Table 2
), as well as tests of normality (data not shown). Thus, a Mann–Whitney test was used to test the significance of differences in IDR length between groups. Table 2
shows that some predictors (notably PV2, PrDOS and VSL2b) predict longer IDRs in proteins encoded by adsu genes that in other proteins (nadsu), but for other predictors the difference is less significant or lacking in statistical support. IDRs encoded by adsu_down genes were significantly longer than IDRs encoded by adsu_up genes for all predictors. Thus, IDRs in proteins encoded by genes that are down-regulated in adherent cells tend to be both more frequent and longer than IDRs in other proteins.
We next addressed how IDRs are distributed among the proteins encoded by the adsu_down gene set in relation to proteins associated with the adsu_up and nadsu gene sets (Table 3
The proportion of completely disordered proteins was higher for the adsu_down sets than for proteins encoded by the other gene sets, as was the proportion of proteins containing at least one IDR. The median proportion of the protein sequences that were predicted as IDR was higher for the adsu_down group, irrespective of whether all proteins were considered or only proteins containing IDRs. Table 3
shows data for the VSL2b predictor but other predictors generally produced a similar result, especially PrDOS and PV2. The frequency of IDR-containing proteins with different IDR proportions for the different gene sets is compared graphically in Figure 2
For proteins encoded by nadsu and adsu_up, the relative frequency declines progressively as the proportion of IDR per protein increases. Contrastingly, a more even distribution of relative frequencies is seen for adsu_down proteins, with relatively fewer low-IDR content proteins and an increased proportion of high-IDR content proteins. Interestingly, the protein length-normalized number of IDRs per protein is somewhat lower for proteins encoded by adsu_down genes, compared to adsu_up and nadsu genes (Figure 2
B). Thus, the greater IDR content of adsu_down encoded genes tends to be associated with fewer and longer IDRs when only IDR-containing proteins are analyzed.
To further investigate differences in IDR lengths between groups, we plotted the length of the longest IDR in each protein as a function of protein length to compare adsu_down and adsu_up encoded proteins (Figure 3
adsu_down encoded proteins are characterized by both longer protein length and longer length of the longest IDR (VSL2b). There are 14 adsu_down encoded proteins with IDRs longer than 1000 residues and these are also among proteins with the longest IDRs for most other predictors (notably PrDOS and PV2). The IDR score profiles for the 6 proteins that are reproducibly found in the top 14 proteins with longest IDRs by the VSL2b, PV2 and PrDOS predictors (red text in Figure 3
A) are shown in Figure 4
Consistent with Figure 3
A, most of the proteins are predicted to be disordered throughout most of their length. Some contain extended regions with close to maximal intrinsic disorder scores (e.g., ZC3H13), while others are characterized by fluctuating levels of intrinsic disorder (e.g., MKI67). Some proteins contain both patterns in different regions of the protein (e.g., BOD1L1). Many of the proteins have short regions that are predicted to be ordered and that could correspond to folded protein domains. The different types of predicted conformation could inform about mechanisms involved in the function of proteins encoded by down-regulated genes in relation to up-regulated genes (see Discussion).
The main finding of this work is that proteins encoded by genes that are down-regulated in lymphoma cells upon adhering to stromal cells, typically found in microenvironments that increase cancer-cell survival, tend to have more frequent and longer regions of predicted intrinsically disordered conformation than proteins encoded by up-regulated genes or other expressed genes in the same cells. Our previous work has shown that many proteins encoded by down-regulated genes in adherent cells are involved in early stages of mitosis [6
]. The present results complement this observation by suggesting that proteins encoded by the down-regulated gene set tend to function by mechanisms that are associated with intrinsically disordered regions. A secondary finding is that many of the proteins encoded by down-regulated genes are larger than proteins encoded by up-regulated genes.
Intrinsically disordered protein regions can be broadly divided into regions that are always disordered and disordered regions that form one or more ordered conformations in particular molecular environments, such as during coupled binding and folding interactions with partner proteins [8
]. Some IDRs have been shown to bind partners in the disordered state via multi-valent interactions, mediated by short linear motifs that are distributed along the length of the IDR [5
]. However, IDRs have other functions in addition to interaction with partners. One such function is mediation of phase transitions in cells that allow for compartmentalization of cellular regions in so-called “membrane-less organelles” that include nucleoli, nuclear speckles, P-bodies and chromatin [12
]. These kinds of functional mechanisms might be associated with the IDRs that have consistently close-to-maximal prediction scores over extended regions of proteins encoded by down-regulated genes, as exemplified by some of the proteins in Figure 3
and Figure 4
The clearest example of a protein that is predicted to be maximally disordered throughout most of the protein sequence is ZC3H13. Interestingly, ZC3H13 is part of the WTAP complex, which is involved in RNA splicing and processing and is localized in nuclear speckles [17
]. It is likely that such speckles result from phase transition processes and it is possible that the disordered region of ZC3H13 is important for speckle formation or ZC3H13 localization to the speckle. In fact, many documented types of so-called proteinaceous membrane-less organelles are located in the nucleus and include chromatin in addition to nuclear speckles, nucleoli and many other bodies [14
]. The MDC1 protein (Figure 3
and Figure 4
) contains a central region predicted to be completely disordered, flanked by less disordered/structured regions, which are known to mediate binding to several partner proteins at chromatin regions containing double-stranded DNA breaks [18
]. Thus, MDC1 has been regarded as a “scaffold” protein responsible for spreading of DNA-repair factors over the damaged chromatin region and it is tempting to speculate that the central disordered region could play a role in phase-transitions. Other proteins in Figure 3
and Figure 4
that have extensive regions predicted to be completely disordered and that work in a chromatin environment are YLPM1, involved in regulating telomerase activity, and BOD1L1, a protein that protects stalled DNA replication forks.
MKI67 is predicted to be disordered (with varying score) throughout almost its entire length (see Figure 4
). Interestingly, MKI67 orchestrates formation of the perichromosomal layer, which coats the condensed chromosomes during mitosis in order to prevent chromosome aggregation [19
]. In mitotic mammalian cells, the nuclear membrane and nucleolus are broken down and nucleolar proteins including the known phase-transition proteins, Nucleophosmin and Fibrillarin, that drive nucleolus formation in interphase cells [20
], are also found in the mitotic perichromosomal layer. This fact, taken together with the RNA-binding activity associated with MKI67, suggests that the perichromosomal layer may be formed by phase transition phenomena. Interestingly, higher expression of MKI67 is a negative prognostic marker for MCL patients [21
In the IDR class that conditionally adopts ordered conformations in some molecular contexts, the ordered conformations are characterized by varying degrees of “fuzziness”, defined as the existence of a heterogeneous range of ordered conformations in the context of, for example, interaction with a single partner [22
]. Many proteins that conditionally adopt ordered conformations contain pre-structure motifs (PreSMos), defined as short protein regions within IDRs that have a weak propensity for secondary structure formation leading to formation of unstable secondary structure elements in a minority sub-population of IDR-containing proteins [23
]. PreSMos become stabilized during coupled binding and folding, and form part of the folded protein conformation that is seen in complexes with partner proteins. Protein regions encoded by down-regulated genes that show alternating sub-regions of higher and lower intrinsic disorder scores might correspond to these kinds of IDR since the short regions with lower intrinsic disorder scores may represent PreSMos. The CENPE and CENPF proteins are characterized by disordered regions interspersed with regions with lower disorder scores that could represent regions containing PreSMos. This would be consistent with the multiple interactions made by these proteins within the kinetochore structure that binds to the centromeric chromatin of chromosomes during mitosis. TNRC6A is a member of the GW182 family of scaffold proteins that are important for organization of proteins needed for RNA-mediated gene silencing and are found in P-bodies that are formed by a phase transition process [24
Although somewhat speculative, the preceding sections suggest mechanisms by which some of the large proteins with large amounts of intrinsic disorder might contribute the propagation of lymphoma cells in suspension as well as how their down-regulation could lead to reduced proliferation of lymphoma cells adhered to stromal cells. Reduced proliferation is known to increase the survival of cancer cells during chemotherapy, which primarily targets proliferating cells [7
]. Further, the cell cycle arrest that occurs in adherent MCL cells [26
] would be expected to reduce the need for apoptotic responses and we previously showed that adherence to stromal cells is associated with up-regulation of anti-apoptotic genes [6
We have shown that predicted intrinsic disorder can be used to interrogate proteins encoded by transcriptome data and that identification of gene sets encoding proteins with characteristic predicted disorder properties can provide information relevant for understanding the mechanisms underlying the functionality of groups of proteins. This approach complements the commonly used gene ontology analysis approach, which primarily gives information about the cellular components or processes that are characteristic for the function of protein sets. Both approaches provide information that can be used for hypothesis building and the design of further experiments.
In this work, we have only analyzed predicted protein disorder as a conformational characteristic. There are other predictors that could be used to expand the approach in the future and new predictors are continuously being developed as more is learned about how protein functionality is coupled to the conformational flexibility of proteins. Examples are the s2D predictor [27
], which predicts secondary structure elements in relation to random coil regions, and Dynamine [28
], which predicts the rigidity of the peptide backbone throughout protein sequences, as well as the ANCHOR [29
] and MoRFpred [30
] predictors, which predict protein interaction sites. More recently developed predictors include prediction of protein regions involved in phase transitions [31
], prediction of decomposed residue-by-residue solvation free energy [32
] and prediction of residue-by-residue compactness/secondary structure [33
]. Thus, it is easy to see that a battery of predictors could be used to reveal many different conformational aspects of protein sets encoded by groups of differentially regulated genes identified in transcriptome data. Databases like the Database of Disordered Protein Prediction (D2
] or the more recently developed MobiDB [35
], which contain collections of prediction data from different sources, will be useful tools for this purpose.