Understanding Time-Evolving Citation Dynamics across Fields of Sciences

: Scholarly publications draw collective attention beyond disciplines, leading to highly skewed citation distributions in sciences. Uncovering the mechanisms of such disparate popularity is very challenging, since a wide spectrum of research ﬁelds are not only interacting and inﬂuencing one another but also time-evolving. Accordingly, this study aims to understand citation dynamics across STEM ﬁelds in terms of latent afﬁnity and novelty decay, which is based upon Bayesian inference and learning of the Afﬁnity Poisson Process model (APP) with bibliography data from the Web of Science database. The approaches shown in the study can shed light on predicting and interpreting popularity dynamics in diverse application domains, by considering the effect of time-varying subgroup interactions on diffusion processes.


Introduction
Individual information items compete for our attention and generate varied scales of cascade sizes via information diffusion processes. Forecasting the popularity growth over time is significant in a variety of application domains such as online social networking, e-commerce, marketing, risk management, and public policy in order to establish timely strategies and make an efficient control [1]. Accordingly, there have been attempts to predict an individual item's future diffusion trend among diverse communities in online social media [2][3][4][5], academia [6,7], or a nation [8]. However, despite recent advancements in popularity prediction, most prior work has neglected the effects of subgroup interactions on diffusion processes. That is, different social groups interact and exert disproportionate influences on each item's popularity with their own set of interest and motives [9]. For instance, a scholarly paper's citation volume is not only influenced by its own field's attention but also dependent on time-evolving interrelationships with other fields in science due in part to interdisciplinary collaborations, research funds, and scientific movements. In this context, our prior work [9] has proposed the Affinity Poisson Process (APP) for a general framework in order to model popularity dynamics across subpopulations in a complex social system.
This study aims to show how we can interpret rich context of popularity disparity in scholarly publications and to understand time-evolving citation dynamics across fields of sciences, based upon Bayesian inference and learning of the APP. For interpretation, three important counter-balancing factors are investigated: (1) latent affinity between different research communities (fields), considering the effect of intra-and inter-field interactions on popularity growth of publications, (2) heterogeneous preferential attachment reflecting different cumulative popularity within each research field, and (3) field-level time decay capturing fading attention to publications, varying from field to field. Such rich context, attributed subpopulation-level affinity, enables the way of interpretation to be more applicable to a broad range of real-world diffusion scenarios.
For this study, interdisciplinary citation volumes of individual journals are predicted with the APP and two baselines, by using bibliography data from the Web of Science database [10]. This data covers 108 subjects in STEM fields during the last two decades between 1991 and 2011. For macro-level analysis, 108 subjects are grouped into higher-level research subfields and fields, by referring to the classifications of academic programs by the National Research Council (NRC) [11]. Target NRC fields are: (1) Agricultural Sciences (AS), Biological & Health Sciences (BH), Engineering (EG), and (4) Physical & Mathematical Sciences (PM). By conducting experiments on real data, prediction error with the APP is reduced by 15% and 27% over two baselines. Based on parameter estimation, interdisciplinary citation flow in STEM fields is examined in accordance with the counter-balancing factors. In general, the four NRC fields become more interdisciplinary over time, but with time-varying intra-and inter-field affinity. In terms of novelty decay, the EG, PM, AS, and BH fields are aging in that order. The AS and PM fields exhibit slower decay of earlier publications, while the EG field shows the opposite trend, i.e., faster decay of earlier publications. In particular, the BH field shows a consistent aging for different year publications.
To the best of the author's knowledge, present work is the first to incorporate the effect of subfield interactions on interdisciplinary popularity growth of publications across fields of sciences, which has been neglected in previous studies. This study can help reveal attention-space dynamics across subpopulations, applicable to a wide range of diffusion scenarios in the real world.
In the rest of this paper, Section 2 begins with the reviews of related work. Section 3 explains the background of our proposed framework for diffusion processes across a heterogeneous social system. Section 4 conducts experiments on real data for predicting interdisciplinary citation volumes of individual journals. Section 5 interprets citation dynamics across fields of sciences based on parameter estimation, and finally Section 6 concludes this study with future directions.

Related Work
The explosive growth of scientific papers makes it challenging to keep track of all relevant publications. Consequent selective attention not only decays differently in sciences [12], but also leads to forming heavy-tailed distributions of citation volumes [13] with core-periphery linkage structures in literature [14]. As collaboration increasingly plays a crucial role [12,15], its patterns have been investigated with diverse angles, such as distinct modes in the distribution of collaborators [16], the growth of interorganizational collaboration and its driving to field evolution [17], and multiuniversity research teams beyond geographical and disciplinary boundaries [18]. In the context of such skewed popularity and boundless collaborations in sciences, this study focuses on understanding interdisciplinary citation flow in STEM fields, mainly based upon (1) the estimation of latent influence on knowledge propagation and (2) the prediction of citation volumes.
In terms of latent influence, it has been one of essential topics for diffusion studies to infer propagation trajectories of target individual items. For example, causal relationships are estimated by information-theoretic measures at micro [19] and macro levels [20]. Also, both external and internal influences have been quantified to infer their effect on diffusion of Web posts [21,22] or scholarly papers [23]. However, most of the previous studies need prior knowledge of current social network structures, which increases dependency on data and thus limits applicability to real-world scenarios.
On the other hand, predicting popularity of individual items helps understand the underlying diffusion process; i.e., how it has driven drastic popularity disparity [24]. For understanding dynamics of collective attention, the spread of information has been considered as a point process for modeling random events in time and predicting its popularity, based upon Poisson processes [9,22,25] and Hawkes processes [2,8,26]. As a generative probabilistic model, a point process can be easily incorporated into the Bayesian framework to account for Bayes factors and model selection [27,28] via filtering theory [29], or for composite factors [4,6,7,30] such as time relaxation and cumulative attention space. Under the Bayesian framework, these all point process approaches not only improve prediction power but also remove the dependency on domain-specific knowledge by considering the lasting impact of an individual item [1].
However, they all ignore the effect of interactions between subgroups on diffusion processes, which results in inaccurate predictions and insufficient context of popularity distributions. In this study, both perspectives are covered based upon our prior work [9] on generative temporal processes, without dependency on social network structures but with disproportionate influence of different subgroups as a latent factor.

Background
A fundamental assumption of citation flow is that most bodies of information are organized into categories or fields of varying levels of concreteness and abstraction. In academia, information is situated in abstract areas like physics, chemistry, and computer science. These areas are then composed of research fields, such as when computer science entails artificial intelligence and software engineering. In particular, this study targets publications in STEM fields for understanding interdisciplinary citation dynamics across major branches of sciences.

Interdisciplinary Citation Flow
Academic publications are increasingly cited crossing the border of various disciplines, which implies that research becomes more interdisciplinary and collaborative between scholars from different scientific fields [1,9,31]. Accordingly, Figure 1 illustrates the concept of underlying interdisciplinary citations across diverse fields of sciences. As the figure shows, an individual paper p i is cited by publications from its own and/or distinct fields, which is affected by time-evolving intrafield attention as well as interfield affinity.  In this regard, our new framework [9], Affinity Poisson Process (APP), has been proposed to model diffusion processes across a heterogeneous social system and provide interpretable insights, by incorporating latent affinity between subpopulations. This model has shown high performance in predicting citation volumes of publications in computer science, by reducing error an additional 50% compared to the state-of-the-art baselines. That is mainly because the proposed model considers the effect of subgroup interactions on the popularity growth of individual items. In the next section, the main idea of the APP is introduced. Figure 2a shows an example that a paper p 1 accumulates citations over time, which can be considered as each citing paper's arrival at its publication time. Prior work [6] modeled such arrivals as a Poisson process with one intensity function. That is, paper citations are indistinguishable and homogeneous regardless of a citing paper's research field. However, as research become more interdisciplinary, an individual paper is likely cited by both internal and external publications from the same and other research fields respectively. Thus, the single horizontal timeline in Figure 2a cannot differentiate varying citation intensities generated by numerous fields over time. In other words, the prior work has neglected disproportionate influences of different fields on individual papers' citations, exposing limitations of understanding interdisciplinary citation dynamics across fields of sciences.

Affinity Poisson Process
(a) Paper citations as an arrival process (b) Iinterdisciplinary citation framework as multiple arrival processes Figure 2. Citations across different fields. (a) Each square p i represents the i-th paper in its field (specified at top and color-coded). The paper p 1 in the cited field f 1 receives citations from multiple citing fields ( f 1 , f 2 and f 3 ) over time. Here, t n (n = 1, ..., 9) denotes the publication time of the n-th citing paper. (b) The citations are decomposed into different timelines according to the citing fields. As the superposition of Poisson processes is also a Poisson process, this is a generalization of the citation arrival process in (a) by considering interdisciplinary citation flow.
In this respect, a new framework has been proposed to model popularity dynamics by incorporating heterogeneous nature of a social system consisting of subgroups interacting one another (i.e., intra-and interfield interactions) and view the paper citations as the superposition of multiple Poisson processes for different citing fields [9]. As shown in Figure 2b, the paper citations are now decomposed into different timelines depending on the research fields of citing papers, and each timeline is modeled as an independent Poisson process. Since the superposition of Poisson processes is still a Poisson process [32], the new framework extends the prior work [6] from a homogeneous citation process to cross-field popularity dynamics.
Problem Statement. In more detail, let us first define a set of research fields F in sciences and focus on the paper citations in one specific field f cited ∈ F. Suppose that the i-th paper (i = 1, ..., I) in the cited field f cited , published at time t = 0, has received N f i citations from a citing field f ∈ F during a time period [0, T]. When the citations came from the same field, they are called internal citations (i.e., f cited = f ), otherwise called external citations (i.e., f cited = f ). Then, paper i's citation timestamps from each citing field f , D Latent affinity ξ f has hyperparameters, α and β. Empty circles represent unknown random variables, a solid circle denotes observed data, and dark dots indicate parameters. Note that the only graphical model of the cited field is presented.
The parameter values of the APP model are estimated using Bayesian inference. Based on the model formulation in this section, the likelihood of observing paper citation histories are first calculated. Then, by imposing a conjugate prior, the posterior distribution of the latent affinity ξ f is computed. Accordingly, Figure 3 illustrates the corresponding graphical model of the APP model for expressing the overall conditional dependence structure between random variables. Detailed approaches of Bayesian inference and parameter learning are explained in Appendices B and C, respectively.

Sym. Descriptions
F set of all research fields in science f cited research field of the cited paper, f cited ∈ F (usually ommited for brevity) f research field of the citing paper, f ∈ F I total number of papers in the cited field citation intensity of paper i at time t for the citations received from the citing field f affinity of the citing field f towards the cited field aging effect of paper i in the citing field f after time t since its publication

Popularity Prediction
Bibliographic data in STEM fields are now applied to the APP, and the prediction performance is compared with two baselines (APP without a prior and RPP) as done in our previous work [9]. Based on the Bayesian inference and learning in Appendices B and C, estimated parameter values with real data are interpreted in the next section for understanding interdisciplinary citation dynamics across fields of sciences.

Data Statistics
The Web of Science data (WoS) [10] is investigated during the last two decades between 1991 and 2011, since publication data consistent for all target fields is only available during this period. The data contains publication records from a wide range of academic areas, each of which consists of paper profile (e.g., title, keywords, publication year, venue, and associated subjects) and citation relationships (e.g., cited and citing articles). This study focuses on STEM field publications covering 108 subjects, where some disciplines such as Biology and Engineering are more fine-grained. For the less biased and macro-level observations of interdisciplinary citation flows in sciences, these 108 subjects are grouped into higher-level research areas, i.e., 37 subfields and 4 fields, by referring to the classifications of academic programs, conducted by the National Research Council (NRC) [11]. The four NRC fields are Agricultural Sciences, Biological & Health Sciences, Engineering, and Physical & Mathematical Sciences, and the detailed correspondence between the WoS and NRC classifications is presented in Tables A2-A4 in Appendix F.
As data preprocessing steps, journals are first targeted, whose publication records are available during the entire data period (i.e., from 1991 to 2011). In order to secure at least 10-year citation histories, individual articles published between 1991 and 2000 are then selected. Fundamental statistics of the target publications are presented in Table 2. For a macro view of cross-discipline citation flows, an individual journal's citation intensity is estimated (i.e., λ i (t) in Equation (3)) by collecting and decomposing citation time moments into multiple timelines according to citing articles' associated NRC subfields (i.e., horizontal timelines in Figure 2b). Note that there are 878 journals which have entire citation histories during the data period, and an individual journal's citation volume is predicted for separate years between 1991 and 2000. Thus, prediction is conducted 8780 times (878 journals × 10 years) in total.

Prediction of Interdisciplinary Citation Volumes
Popularity prediction tests are conducted with real data and the prediction results are compared between our proposed model (APP) and the baselines (APP without a prior and RPP), as done in [9]. Individual journals' citation volumes are predicted by training the proposed and baseline models with at least 10 year citation histories. The first two plots in Figure 4 illustrate the prediction errors in MAPE (a) as increasing the length of training years (from 10 to 20 years) and (b) as varying the test years (from 1 to 10 years) after training with 10 year citation histories. In both cases, our proposed model outperforms the baseline models, improved by 15% over APP without a prior and by 27% over RPP on average when comparing the prediction performances with the least citation histories (10 years). Figure 4c compares the distributions of citation sizes between real data and prediction results from our proposed model. As shown in this figure, the predicted citation sizes are distributed quite similarly to ones from real data. That is, the APP well explains popularity dynamics of individual journals across fields of sciences, not limited to specific ones.   Figure 5 shows example prediction results from our model. In the figure, three journals are from different NRC subfields, but they all are paid interdisciplinary attention from different subfields. As the figure shows, the APP does not only predict an individual journal's total number of citations but also separately predict its citations received from each citing subfield.   Year Citation Volume 1 9 9 2 1 9 9 3 1 9 9 4 1 9 9 5 1 9 9 6 1 9 9 7 1 9 9 8   Year Citation Volume 1 9 9 2 1 9 9 3 1 9 9 4 1 9 9 5 1 9 9 6 1 9 9 7 1 9 9 8   Based on the parameter estimation, in the next section citation dynamics are analyzed in terms of affinity network, affinity density, and novelty decay across subfields of sciences.

Affinity Map
Model parameters are estimated, such as latent affinity and aging effect between every pair of subfields, using more than 10-year citation histories of an individual journal's publications from 1991 to 2000 until the data end year, 2011. That is, a subfield's accumulated affinity and novelty decay of its publications across subfields are all inferred so that we can examine time-evolving citation dynamics between different publication years.
Based on the estimation, Figure 6 presents affinity maps of all 37 NRC subfields with affinity network (the first column) and affinity density (the second column) for different publication years from 1991, 1993, 1995, 1997, to 2000. As shown in the figure, network clusters (color-coded) and densities are varied by publication years, which implies that affinity between every pair of subfields is time-evolving.   1991, 1993, 1995, 1997, and 2000. Networks in the first column are clustered based on the normalized cut, where each node and link indicate a subfield and affinity. The distance between two nodes reflects the strength of affinity (closer distance for stronger affinity). Density maps in the second column present interdisciplinarity, where nodes with higher interdisciplinarity and density are highlighted in red.

Affinity Network
The first column of Figure 6 illustrates networks, each of which consists of nodes for NRC subfields and links for directed pairwise affinity. Here, the distance between nodes reflects the strength of affinity between two subfields (i.e., closer distance for stronger affinity). The size and color of a node present a subfield's interdisciplinarity and its associated clusters based on the normalized cut on the network [33]. As shown in the figure, affinity between the BH subfields becomes more distant from each other but closer to other NRC field over time. On the other hand, affinity between the PM subfields is locally clustered earlier but they become closer to each other and also to other fields over time. Specifically, computer science, applied mathematics, statistics, and engineering science (blue nodes) are more isolated from their own NRC fields at the beginning but become closer to their own and different fields, while oceanography, earth science, civil engineering, and astrophysics (purple nodes) exhibit consistent membership over time.

Affinity Density
These generated networks are also presented with density maps in the second column in Figure 6, where a node with higher interdisciplinarity and density is closer to red. As shown in the figure, dense areas are changing over time, such as genetics and biochemistry/biophysics in 1991, physics and applied mathematics in 1993, computer sciences and animal sciences in 1995, cell & developmental biology in 1997, and nanoscience in 2000.
Accordingly, Figure 7 presents the keyword distributions of the highly interdisciplinary subfields from the density maps in 1991, 1995, and 2000. Keywords are extracted from titles and abstracts of journal articles published in the corresponding year, which are collected from the Web of Science [10] by querying highly cited journal names of a same publication year within each subfield. In addition, keywords are color-coded by co-occurrence based clusters. Overall, about 13% of keywords are commonly used in a citing subfield with high affinity, while less than 5% of keywords are common between subfields with low or no affinities. That is, our estimated affinity between two subfields well reflects a close relation in their research. In more detail, Table A1 shows example keywords which are commonly used between the subfields in Figure 7 and the top three subfields with the highest affinities for each selected subfield. The keywords of citing subfields in Table A1 are one of the top mostly used 300 keywords, which are collected from individual papers' keyword records in the Web of Science data.
Overall, NRC subfields become more interdisciplinary across the four NRC fields, but affinity between subfields are not static but time-evolving, collectively leading to highly popular subfields of every publication year.
(a) Genetics (1991) (b) Computer Science (1995) (c) Nano Science (2000) Figure 7. Keyword distributions of the highly interdisciplinary subfields from density maps in 1991, 1995, and 2000 in Figure 6. Keyword colors represent clusters based on co-occurrences. Figure 8 shows the novelty decay of the four NRC fields' publications from 1991 to 2000. As shown in the figure, the Agricultural Sciences (AS) and PM fields exhibit slower decay of earlier publications, while the Engineering (EG) field shows the opposite, i.e., faster decay of earlier publications. That is, publications in the EG field are more quickly forgotten than ones in the AS and PM fields, which can be interpreted as citing fields tend to be updated with the latest technological breakthrough of the EG field. On the other hand, the BH field's publications show similar time decay patterns between different publication years, but their aging is slowest among the NRC fields. This implies that aging of the BH field is consistent regardless of publication years and that its publications are less forgotten across the fields compared to other fields' publications. For instance, articles aged more than two years have similar likelihood to be cited no matter when they are published.     Figure 9 summarizes overall affinity maps for all different publication years between 1991 and 2000. As shown in Figure 9a, in general 37 NRC subfields can be grouped into four clusters based on affinity between subfields, which are different from the four NRC fields. That is, different subfields are not only locally clustered but also globally interrelated with each other beyond disciplines and NRC fields. As Figure 9b illustrates, biochemistry/biophysics, genetics, and ecology in the BH field, animal science in the AS field, physics, chemistry, computer science, and applied mathematics in the PM field, electrical engineering, oceanography, and nanoscience in the EG field have been highly interdisciplinary in STEM field. In terms of novelty decay in Figure 9c, the EG, PM, AS, and BH fields are aging in that order. The slowest aging of the BH field reflects larger number of average citations as well as citing subfields than the other NRC fields.

Holistic View
(a) Affinity Network (b) Affinity Density

Conclusions
As the Affinity Poisson Process incorporates the effects of subpopulation-level interactions on diffusion processes, it not only enables to predict the citation volumes of individual publications but also helps to reveal interdisciplinary citation dynamics across subfields of sciences, such as time-evolving latent affinity and novelty decay. Based on Bayesian inference and learning, the main findings are summarized as below.
• Affinity between subfields is time-evolving, and overall NRC subfields become more interdisciplinary across the four NRC fields over time; affinity between the BH subfields becomes more distant but closer to other NRC field over time, while affinity between the PM subfields is locally clustered earlier but becomes closer to other fields over time.

•
In terms of novelty decay, the AS and PM fields exhibit slower aging for earlier publications, while the EG field shows the opposite, i.e., faster decay of earlier publications. The BH field shows a consistent aging for different year publications and the slowest time decay among the four NRC fields. • Overall, 37 NRC subfields are not only locally clustered but also globally interrelated with each other beyond disciplines and NRC fields. Highly interdisciplinary subfields for each NRC field are: biochemistry/biophysics, genetics, and ecology in the BH field, animal science in the AS field, physics, chemistry, computer science, and applied mathematics in the PM field, electrical engineering, oceanography, and nanoscience in the EG field. In terms of novelty decay, the EG, PM, AS, and BH fields are aging in that order.
Note that this study focuses on how to infer affinity between given fields, by employing predefined metadata. As scientific fields are evolving, it is very challenging to classify research fields and identify the associated field of an individual publication, which is beyond the scope of this paper. Nevertheless, clustering affinity network can provide a new aspect of classifying time-evolving fields in academia.
Overall, the way of interpreting dynamics offers a general applicability to a broad range of real-world diffusion scenarios by providing rich dynamics across subgroups of a population. One future work is to improve the framework with other supportive latent factors to recurring popularity in a complex system. Another direction is to define latent affinity as a time-varying function in order to explicitly model its evolving patterns and obtain more accurate results, based on more recent data collection.

Conflicts of Interest:
The author declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A.1. Integral of Lognormal Distributions
Then, the integral of a lognormal distribution is where τ ≡ (ln t − µ) /σ and Φ(·) denotes the cumulative distribution function of the standard normal distribution. The partial derivatives of the integral of a lognormal distribution with respect to the parameters are where φ(·) denotes the probability density function of the standard normal distribution.

Appendix A.2. Logarithm of Lognormal Distributions
The logarithm of the lognormal distribution is Thus, its partial derivatives with respect to the parameters are where τ = (ln t − µ)/σ. Table A1 shows the keywords of citing subfields, which are one of the most used 300 keywords. They are collected from individual papers' keyword records in the Web of Science data. Table A1. For the selected subfields in Figure 7, example keywords are presented from the top 300 mostly used ones in a citing subfield. For each cited subfield, the top three citing subfields are selected, showing the highest affinities for that cited subfield, and they are listed in a descending order of affinities.

Cited NRC Subfield
Citing

Appendix F. Metadata
We grouped 108 subject areas, provided by the Web of Science data, into higher level research areas, 37 subfields and 4 fields, by referring to the classifications of academic programs, conducted by the National Research Council (NRC) in 2011 [11].