Seasonal Entropy, Diversity and Inequality Measures of Submitted and Accepted Papers Distributions in Peer-Reviewed Journals

This paper presents a novel method for finding features in the analysis of variable distributions stemming from time series. We apply the methodology to the case of submitted and accepted papers in peer-reviewed journals. We provide a comparative study of editorial decisions for papers submitted to two peer-reviewed journals: the Journal of the Serbian Chemical Society (JSCS) and this MDPI Entropy journal. We cover three recent years for which the fate of submitted papers—about 600 papers to JSCS and 2500 to Entropy—is completely determined. Instead of comparing the number distributions of these papers as a function of time with respect to a uniform distribution, we analyze the relevant probabilities, from which we derive the information entropy. It is argued that such probabilities are indeed more relevant for authors than the actual number of submissions. We tie this entropy analysis to the so called diversity of the variable distributions. Furthermore, we emphasize the correspondence between the entropy and the diversity with inequality measures, like the Herfindahl-Hirschman index and the Theil index, itself being in the class of entropy measures; the Gini coefficient which also measures the diversity in ranking is calculated for further discussion. In this sample, the seasonal aspects of the peer review process are outlined. It is found that the use of such indices, non linear transformations of the data distributions, allow us to distinguish features and evolutions of the peer review process as a function of time as well as comparing the non-uniformity of distributions. Furthermore, t- and z-statistical tests are applied in order to measure the significance (p-level) of the findings, that is, whether papers are more likely to be accepted if they are submitted during a few specific months or during a particular “season”; the predictability strength depends on the journal.


Introduction
Authors who submit (by their own assumption) high quality papers to scholarly journals, are interested in knowing if there are factors which may increase the probability that their papers be Concerning the number of submitted manuscripts, it was observed that the acceptance rate in JSCS was the highest if papers were submitted in January and February; it was significantly lower if the submission occurred in December. In the case of Entropy, the highest rejection rate was for papers submitted in December and March, thus with a January-February peak; the lowest acceptance rate was for manuscripts submitted in June or December; the highest rate being for those sent in spring months, February to May. One recognizes a journal-dependent seasonal shift of the features. Notice that we adapt the word "seasonal"; even though changes in seasons occur on the 21st of various months, we approximate the season transition as occurring on the next 1st day of the following month.
Here, we propose another line of approach in order to study the submission, acceptance, and rejection (number and rate) diversity based on probabilities, with emphasis on the conditional probabilities, thereafter measuring the entropy and other characteristics of the distributions. Indeed, the entropy is a measure of disorder, and one of several ways to measure diversity. Researchers have their own preference [7,8] in measuring diversity. Here below, we practically adapt the classical measure of diversity, as used in ecology, but other cases of interest pertaining to information science [9,10] can be mentioned.
Let us recall that the general equation of diversity is often written in the form [11,12] (1) in which p i = [z i / ∑ i z i ], and z i the measured variable. For q = 1, q D reduces to the exponential of the Shannon entropy [13,14] to which we will only stick here. Several inequality measures are commonly used in the literature: in the class of entropy related measures, one finds the exponential entropy [15], which measures the extent of a distribution, and the Theil index [16] which emerges as the most popular one [17,18], besides the Herfindahl-Hirschman index [19], measuring "concentrations." "Finally," upon ranking according to their size the measured variable, the Gini coefficient [20], is a classical indicator of non-uniform distributions.
The Theil index [16] is defined by It seems obvious that the Theil index can be expressed in terms of the negative entropy indicating the deviation from the maximum disorder entropy, ln(N), The exponential entropy [15] is The Herfindahl-Hirschman index (HHI) [19] is an indicator of the "concentration" of variables, the "amount of competition" between the months, here. The higher the value of HHI, the smaller the number of months with a large value of (submitted, or accepted, or accepted if submitted) papers in a given month. Formally, adapting the HHI notion to the present case, Notice that HH I = ∑ N i=1 p 2 i . The Gini coefficient Gi [20] has been widely used as a measure of income [21] or wealth inequality [22,23]; nowadays, it is widely used in many other fields. In brief, defining first the Lorenz curve L(r) as the percentage contributed by the bottom r of the variable population to the total value ∑ r z r of the measured (and now ranked) variable z r , i.e., p r = [z r / ∑ r z r ], one obtains the Gini coefficient as twice the area between this Lorenz curve and the diagonal line in the [r, L(r)] plane; such a diagonal represents perfect equality; whence, Gi = 0 corresponds to perfect equality of the z r variables.
Having set up the framework and presented the definition of the indices to be calculated, we indicate quantities of interest and turn to the data and data analysis, in Sections 2 and 3, respectively. Their discussion and comments on the present study, together with a remark on its limitations, are found in the conclusion Section 4.

Definitions
In order to develop the method measuring the disorder of the time series, let us recall the necessary data. The raw data can be found in Reference [6]. For completeness, let the time series of submitted and of accepted papers if submitted during a given month to JSCS and to Entropy be recalled through Figure A1 for the years in which the full data is available, that is, for which the final decisions have been made on the submitted papers.
Let us introduce notations: • the number of monthly submissions in a given month (m = 1, . . . , 12) in year (y) is called N Thereafter, one can deduce the relevant "monthly information entropies" • S in order to pin point whether the yearly distributions are disordered. Moreover, we can discuss the data by not only comparing different years, but also the cumulated data per month in the examined time interval as if all years are "equivalent": and similarly for the accepted papers C leading to the ratio between cumulated monthly data q (m) • and to the corresponding "monthly cumulated entropy", S which will be called the "conditional entropy". Relevant values are given in Tables 1-4 both for JSCS and for Entropy. The diversity and the inequality index values are given in Table 5. Most of the results stem from the use of a free online software [24].    Table 3.
Conditional probability p ; the sum of such probabilities is given; we also report the here so called "conditional entropy" ( c.entr.), either S (y) (a|s) or S (a|s) . The distribution total (sum), mean, standard deviation, confidence interval, tand z-test with p-significance level, are also reported.

Data
First, notice that the 3-year long time series is not in itself part of the main aim of the paper; this is because we intend to compare data with an equivalent number of degrees of freedom, that is, 11, for all studied cases. Nevertheless, for completeness and in order not to distract readers from our framework, we provide relevant figures in the Appendix A, together with a note on the corresponding discrete Fourier transform. A short note, in the Appendix, recalls the meaning of the (p-) significance level.

Analysis
The relevant values for the various indices, given in Tables 1-4, both for JSCS and for Entropy, serve the following analysis. We consider 3 aspects: (i) a posteriori features findings, (ii) non-linear entropy indices, and (iii) forecasting aspects.

A posteriori features findings
Browsing through Table 1, it can be noticed that the distribution of probabilities of submissions is weaker during the February-May months for JSCS, but is rather high for the fall and winter months. For Entropy, the highest probability of submissions also occurs in October-December, and is preceded by a low rate of submissions, the lowest being in February and in August, should one say at vacation times. Let us recall that the extremum entropy (for "perfect disorder") is here ln(12) 2.4849.
Apparently this submission evolution pattern is reflected-see Table 2-in the acceptance rate, except for JSCS which has a low acceptance rate for papers submitted in winter 2014. For Entropy, the weaker acceptance rate occurs for papers submitted during the August-September months, say the end of summer time.
Statistical tests, for example, χ 2 , can be provided to ensure the validity of these findings for percentages, but taking into account the number of observations. In all cases, such a test demonstrates that the distributions are far from uniform, suggesting looking further for the major deviations. See a discussion of other texts in Section 3.2.3.
However, q (m,y) a values only measure the probability of monthly acceptances without considering the number of submissions in a given month. It is in this respect more appropriate to look at the conditional probabilities, q (m) (a|s) , as in Table 3. For JSCS, the highest values of q (m) (a|s) are found for winter months: q (m) (a|s) has a notable maximum in January and the lowest for spring-summer time, from March till August. There is a shift of such a pattern for Entropy: the highest conditional probabilities occur during spring time, except in 2016.
The corresponding values of the monthly entropy, for the given years and for the cumulated distributions, are found in Table 4. All values of the entropy are remarkably 4.1, both for JSCS and Entropy, suggesting some sort of universality. One can notice that the entropy steadily increases as a function of time both for JSCS and Entropy, the growth rate being about twice as large for the latter journal. This is somewhat slightly surprising since one should expect an averaging effect in the case of Entropy because of the multidisciplinarity of the topics involved. Comparing such values indicates that the distributions are far from uniform (The slight difference between the last lines of Tables 3 and 4, displaying the "conditional entropy" is merely due to rounding errors.) indeed.

Non-Linear Entropy Indices
The diversity and inequality measures are given in Table 5. The diversity index 1 D is remarkably similar for both journals (∼11) for the submitted papers and accepted papers distributions. The similarity holds also for the HHI 0.087, although a little bit lower for the Entropy journal 0.085. The diversity index for the conditional probability distributions is however rather different: both increase as a function of time, indicating an increase in concentrations in favor of relevant months. This increase rate is much higher for Entropy than for JSCS.
The inequality between months is rather low, as seen in the Gini coefficient; there is a weak inequality between months. However, there is a factor ∼2 in favor of JSCS, which we interpret as being due to the greater specificity of JSCS, implying a smaller involved community and specially favored topics. This numerical observation reinforces what can be deduced from the Theil index, whence inducing the same conclusion.

Forecasting Aspects
Considering the rather small sizes of both samples (not our fault!), it is of interest to discuss the significance of the findings, in some sense in view of suggesting some "strategy" after the "diagnosis". The notions of "false positives" and "false negatives", as in medical testing, can be applied in our framework.
In brief, a "false positive" occurs as an error when a test result improperly indicates the presence (high probability) of an outcome, when in reality it is not present; obviously, a contrario-a "false negative"-is an error in which a test result improperly indicates no presence of a condition (the result is negative), when in reality it is present. This corresponds to rejecting (or accepting) a null hypothesis, for example, in econometrics. Thus, two statistical tests have been used for such a discussion: (i) the t−Student test and (ii) the z-test. Recall that they are used if one either does not know or one knows the variance (or standard deviation) of the sample and test distributions. Such characteristics are given in Tables 1-4 for each relevant quantity. For completeness, one has also given the confidence interval [µ − 2 σ , µ + 2 σ]. It is easily seen that there is no outlier. This observation would lead us, like other authors, to claim that there is no anomaly in the monthly numbers and subsequent percentages, in contradistinction with the χ 2 values and tests. We should here point out that the t-Student test leads to a p-value < 0.0001, a quite significant result. Concentrating our attention on the (monthly and annual) conditional probabilities N a /N s , the z-test gives the significance reported in Table 4. The values (so called α, or error of type I) in hypothesis testing, indicate that the correct conclusion is to reject the null hypothesis and to consider the existence of "false positives". This is essentially due to the sample size. It is remarkable that the order of magnitude differs for JSCS and for Entropy.

Conclusions
The data on the number of submitted papers is relevant for editors and, more so nowadays, for publishers due to the automatic handling of papers. The relative number of accepted papers is less significant in that respect, but the conditional probability of having an accepted paper if it is submitted in a given month is very relevant for authors. Authors expect a fast and (hopefully) positive response from journals as they are probably interested to discover the best timing for their submission in order to avoid possible editor overload and a negative effect in a particular moment. For these authors, the possible seasonal bias issue is expected to be relevant, as they would like to know whether a specific month of submission will increase the chance that their paper will be accepted. Thus, the probability of acceptance, the so called "acceptance rate," is the relevant variable to be studied. Instead of χ 2 tests or observing the "confidence interval" on monthly distributions, we have proposed a new line of approach: considering the diversity and inequality in the distributions of papers submitted, accepted, or accepted if submitted in a given month through information indices, like the Shannon entropy [25], the diversity index, the Gini coefficients and the Herfindahl-Hirschman index.
From these case studies, a seasonal bias seems stronger in the specialized (JSCS) journal. The features are emphasized because we use a non linear transformation of the data, through information concepts, having their usefulness demonstrated in many other fields [26]. In the present cases, the seasonal bias effects are observed. The overall significance and the universality features might have to be re-examined if more data were available. Indeed, the p-values (so-called α, or error of type I) in hypothesis testing, indicate that the correct conclusion is to consider the existence of "false positives".
Our outlined findings suggest intrinsic behavioral hypotheses for future research. Complementary aspects must be used as ingredients in order to understand whether some seasonal bias occurs [27,28]. One has to take into account the scientific work environment, besides the journal favored topics.
Author Contributions: All authors (M.A., O.N. and A.D.) equally contributed their best to all aspects of this paper; conceptualization, methodology; formal analysis; investigation; resources; data curation; writing-original draft preparation; writing revised version, and editing; visualization.
Funding: This research received no external funding.
Acknowledgments: M.A. greatly thanks the MDPI Entropy Editorial staff for gathering and cleaning up the raw data, and in particular Yuejiao Hu, Managing Editor. Thanks also go to the reviewers and Entropy editor.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Series Data
The time series of submitted and accepted papers if submitted during a given month to JSCS and to Entropy are given in Figure A1. The distributions are markedly non-uniform. Nevertheless, with such a short series, one can observe some periods more important than others. One can also observe that Entropy, a rather new journal, is attracting more submissions since 2015, and has an increased rejection rate. Some "parallelism" in the numbers of submitted and accepted if submitted papers in a given month seems apparent for JSCS. The two largest amplitudes of frequency f in Month −1 , or (periods), resulting from a Fourier analysis of the 3-year time series for N s papers submitted or N a accepted if submitted during a given month to JSCS and Entropy are given in Table A1. The year period is, in 3 cases, one of the two most important ones; the trimester period is the most important for submitted papers to JSCS, and the next largest for N a to JSCS, indicating the more relevant timing for the journal, more prone toward academic authors than Entropy. Table A1. The two largest amplitudes of frequency f in Month −1 , or (periods), resulting from a Fourier analysis of the 3-year time series for papers N s submitted or N a accepted if submitted during a given month to JSCS and Entropy, as displayed in Figure A1. Computational notes https://www.medcalc.org/calc/test_-one_-mean.php This procedure calculates the difference of an observed mean with a hypothesized value. A significance value (p-value) and 95% Confidence Interval (CI) of the observed mean are reported. The p-value is the probability of obtaining the mean observed for the sample if the null hypothesis holds true.

JSCS
The p-value is calculated using the one sample t-test, with t calculated as: where the hypothesized mean is k and the standard deviation σ. In the present context, the hypothesized mean corresponds to that of the uniform distribution. Recall that the p-value is the area of the t distribution, which for N − 1 degrees of freedom falls outside ± t.