Statistical Indicators of the Scientific Publications Importance: A Stochastic Model and Critical Look

A model of scientific citation distribution is given. We apply it to understand the role of the Hirsch index as an indicator of scientific publication importance in Mathematics and some related fields. The proposed model is based on a generalization of such well-known distributions as geometric and Sibuya laws. Real data analysis of the Hirsch index and corresponding citation numbers is given.

Keywords:

citation distribution; Hirsch index; geometric distribution; Sibuya distribution

1. Introduction

In theory, a rather large number of indexes are proposed, which supposedly measure the significance of the scientific publications of an author. Among the most popular of them should be noted:

(i1): the total number of citations of a particular author [1,2,3];
(i2): Hirsch index of the author [4] (see also [5]).

It is these two indexes that we consider in the proposed work.

The definition of the numerical value of the index (i1) is clear from its name.

Recall the definition of the Hirsch index (see [4]). The Hirsch Index h is the number of articles that have been cited at least h times each. This index was introduced in [4], where its properties were explained. In our opinion, these do not correspond to the index purpose. However, we dwell on the description of both the positive and negative sides of the Hirsch index after constructing citation models for scientific articles. One of them has already been stated by us in preprint [6].

2. Citation Model Construction

We now turn to the construction of the author’s citation model. It will be considered as a composite of two models. The first of it describes the process of publishing an article by one author which will be cited, and the second describes the process of citing such an article.

Let us make some assumptions, which we discuss later.

Assumption 1.

Let the probability of rejection or non citing of the manuscript be q and the decisions on publication of different manuscripts are taken independently.

Then it is clear that the probability that the scientist will have exactly k cited papers equals

q {(1 - q)}^{k}

,

k = 0, 1, \dots

. In other words, the number of publications of a scientist has a geometric distribution with parameter q. This distribution supposes that the number of an author publications may be arbitrarily large. However,

{(1 - q)}^{k}

tends to zero rather fast as

k \to \infty

and, therefore, the mean value of the number of publications is not too large. The generating function of this distribution has the form

Q (z) = \frac{q}{1 - (1 - q) z} .

(1)

Of course, here we assume that all the journals to which the author sends manuscripts have the same review system, i.e., all of them accept the manuscripts of this author with the same probability

1 - q

. More realistic is the situation with a random parameter q:

Q (z) = \int_{0}^{1} \frac{q}{1 - (1 - q) z} d Ξ (q),

where

Ξ

is a probability distribution on

[0, 1]

interval and then

I P {X = k} = I E (q {(1 - q)}^{n})

.

Let us go back to (1). How large may be the time spent by a scientist to publish a corresponding number of papers? Of course, this time is a random variable T and we are interested in its distribution. The usual assumption on the working time is its exponential distribution with parameter

λ = I E T

and the Laplace transform

φ (t) = 1 / (1 + λ t)

. Suppose that times needed for the publication of j-th paper is

T_{j}

, and

T_{1}, T_{2}, \dots

are independent and identically distributed as T random variables. Then the time needed for all publications has the Laplace transform

\sum_{k = 1}^{\infty} φ^{k} (t) q {(1 - q)}^{k - 1} = \frac{1}{1 + λ t / q},

i.e., it has exponential distribution with the parameter

λ / q

.

It is natural to assume that each cited publication will produce some number of citations. Of course, the likelihood that the article will be quoted again depends on the number of previous citations.

Assumption 2.

Assume the probability that an article having

k - 1

(

k \geq 1

) citations will not have new quotes equalling

p / k^{γ}

where p is the probability that the article will not be quoted for the first time. The parameter γ is responsible for the speed of convergence of the rejection probability to zero.

Consequently, the likelihood that the article will be quoted exactly k times equals

p / k^{γ} \prod_{j = 1}^{k - 1} (1 - p / j^{γ})

. For the case of

γ = 1

, the generating probability function for the number of citations of this article is

1 - {(1 - z)}^{p}

. The corresponding distribution function is named after Sibuya [7]. Below we consider the case of arbitrary positive

γ

. The corresponding study has general mathematical interest. Therefore, we provide it in a number of sections below.

3. Distribution of Citation Number of a Paper

Let us consider an ordered sequence of experiments

{E_{n}; n = 1, 2, \dots}

, where an event A may appear in each of the experiments with the probability

p_{n}

. Define a random variable X as the number of the first experiment in which A appears. We suppose that X is an improper random variable in the sense that it may take infinite value (that is, the event A will never appear). For the case

I P {X = \infty} = 0

we say that X is a proper random variable. It is clear that, since we define any product from 1 to 0 to be 1,

I P {X = n} = p_{n} \cdot \prod_{k = 1}^{n - 1} (1 - p_{k})

(2)

and

I P {X = \infty} = lim_{n \to \infty} \prod_{k = 1}^{n - 1} (1 - p_{k}) .

Particular cases are:

The probabilities $p_{n} = p$ are constant. So (2) is

$I P {X = n} = p \cdot {(1 - p)}^{n - 1}, I P {X = \infty} = 0$

(3)

corresponding to the classical geometric distribution. Its tail is

$I P {X \geq n} = {(1 - p)}^{n - 1}, m = 1, 2, \dots$

Clearly, the tail and probabilities (3) decrease exponentially fast as n tends to infinity.
The probabilities are given by $p_{n} = p / n$ , where p is a number from the interval $(0, 1)$ . Equation (3) is transformed to

$I P {X = n} = \frac{p}{n} \cdot \prod_{k = 1}^{n - 1} (1 - \frac{p}{k}) .$

(4)

According to (4) X is a proper random variable and has, in this case, the Sibuya distribution with parameter $p \in (0, 1)$ with the following tail

$I P {X \geq n} = \frac{Γ (n - p)}{Γ (n) \cdot Γ (1 - p)} \sim \frac{1}{Γ (1 - p) \cdot n^{p}}$

having heavy power asymptotic for $n \to \infty$ . Such the distribution does not have a finite mean value. It is not difficult to see that

$I P {X = n} \sim p / (n^{p + 1} \cdot Γ (1 - p)), n \to \infty .$

The presented distributions can be respected as a kind of “extreme points” from the perspective of the tail behavior for proper random variable X. Hence, it is natural to study roughly speaking the cases “happening between them”; namely to consider, for example, the situations when

p_{n} = p / n^{γ}

, with

p \in (0, 1)

and

γ > 0

. As it was mentioned above, the parameter

γ

is responsible for the speed of convergence of the rejection probability to zero.

4. Main Result on Citation Number Distribution

The research subject is in the asymptotic behavior of the probabilities (2) for

p_{n} = p / n^{γ}

with

γ \geq 0

. Additionally, to the discussed earlier values of

γ = 0

or

γ = 1

, we distinguish the following two cases:

(A): $0 < γ < 1$ ;
(B): $γ > 1$ .

Let us consider the case (A). We have

I P {X = n} = \frac{p}{n^{γ}} \cdot \prod_{k = 1}^{n - 1} (1 - \frac{p}{k^{γ}}) .

(5)

Consider the product from right-hand-side of (5) in more details.

\prod_{k = 1}^{n - 1} (1 - \frac{p}{k^{γ}}) = exp \{\sum_{k = 1}^{n - 1} log (1 - p / k^{γ})\} = exp \{- \sum_{k = 1}^{n - 1} \sum_{j = 1}^{\infty} \frac{p^{j}}{j k^{γ j}}\}

= exp \{- \sum_{j = 1}^{\infty} \frac{p^{j}}{j} \sum_{k = 1}^{n - 1} \frac{1}{k^{γ j}}\} = exp \{- \sum_{j = 1}^{[1 / γ] + 1} \frac{p^{j}}{j} \sum_{k = 1}^{n - 1} \frac{1}{k^{γ j}}\} exp \{- \sum_{[1 / γ] + 1}^{\infty} \frac{p^{j}}{j} \sum_{k = 1}^{n - 1} \frac{1}{k^{γ j}}\} .

(6)

Here

[1 / γ]

is an integer part of

1 / γ

. It is not difficult to see that

exp {- \sum_{[1 / γ] + 1}^{\infty} (p^{j} / j) \sum_{k = 1}^{n - 1} k^{- γ j}}

has a finite positive limit as

n \to \infty

. This limit may depend on p and

γ

. Let us denote it by

C_{1} = C_{1} (γ, p)

. Therefore,

\prod_{k = 1}^{n - 1} (1 - \frac{p}{k^{γ}}) \sim C_{1} exp \{- \sum_{j = 1}^{[1 / γ] + 1} \frac{p^{j}}{j} \sum_{k = 1}^{n - 1} \frac{1}{k^{γ j}}\} as n \to \infty .

(7)

Relations (5) and (7) give us

I P {X = n} \sim C_{1} \cdot \frac{p}{n^{γ}} \cdot exp \{- \sum_{j = 1}^{[1 / γ] + 1} \frac{p^{j}}{j} \sum_{k = 1}^{n - 1} \frac{1}{k^{γ j}}\} as n \to \infty .

(8)

For

0 < γ j < 1

the following asymptotic representation is known

\sum_{k = 1}^{n - 1} \frac{1}{k^{γ j}} = \frac{n^{1 - γ j}}{1 - γ j} + ζ (γ j) + o (1) as n \to \infty,

(9)

where

ζ (u)

is Riemann zeta function. Further considerations depend on properties of the number

γ

.

(i): Suppose that $1 / γ$ is not integer. Then $γ \cdot [1 / γ] < 1$ and

$\sum_{j = 1}^{[1 / γ] + 1} \frac{p^{j}}{j} \sum_{k = 1}^{n - 1} \frac{1}{k^{γ j}} = \sum_{j = 1}^{[1 / γ]} \frac{n^{1 - γ j}}{1 - γ j} \frac{p^{j}}{j} + \sum_{j = 1}^{[1 / γ]} ζ (γ j) \frac{p^{j}}{j} + \frac{p^{[1 / γ] + 1}}{[1 / γ] + 1} \sum_{k = 1}^{n - 1} \frac{1}{k^{γ ([1 / γ] + 1)}} + o (1) .$

(10)

However, $γ ([1 / γ] + 1) > 1$ and, therefore,

$lim_{n \to \infty} \sum_{k = 1}^{n - 1} \frac{1}{k^{γ ([1 / γ] + 1)}} = \sum_{k = 1}^{\infty} \frac{1}{k^{γ ([1 / γ] + 1)}} < \infty .$

From this and (10) it follows

$I P {X = n} \sim C_{2} \cdot \frac{p}{n^{γ}} \cdot exp \{\sum_{j = 1}^{[1 / γ]} \frac{n^{1 - γ j}}{1 - γ j} \cdot \frac{p^{j}}{j}\},$

(11)

where $C_{2}$ depends on p and $γ$ only.
(ii): Suppose that $1 / γ$ is positive integer. Then $γ [1 / γ] = 1$ and

$\sum_{j = 1}^{[1 / γ] + 1} \frac{p^{j}}{j} \sum_{k = 1}^{n - 1} \frac{1}{k^{γ j}} = \sum_{j = 1}^{[1 / γ] - 1} \frac{n^{1 - γ j}}{1 - γ j} \frac{p^{j}}{j} + \sum_{j = 1}^{[1 / γ] - 1} ζ (γ j) \frac{p^{j}}{j}$

(12)

$+ \frac{p^{[1 / γ]}}{[1 / γ]} \sum_{k = 1}^{n - 1} \frac{1}{k} + \frac{p^{[1 / γ] + 1}}{[1 / γ] + 1} \sum_{k = 1}^{n - 1} \frac{1}{k^{2}} .$

It is known that

$lim_{n \to \infty} \sum_{k = 1}^{n - 1} \frac{1}{k^{2}} = \sum_{k = 1}^{\infty} \frac{1}{k^{2}} < \infty$

and

$\sum_{k = 1}^{n - 1} \frac{1}{k} = log (n) + γ_{e} + o (1),$

where $γ_{e}$ is Euler’s constant. Therefore,

$I P {X = n} \sim C_{3} \cdot \frac{p}{n^{γ + p^{[1 / γ]} / [1 / γ]}} \cdot exp \{\sum_{j = 1}^{[1 / γ] - 1} \frac{n^{1 - γ j}}{1 - γ j} \cdot \frac{p^{j}}{j}\} as n \to \infty .$

(13)

Now we see that the asymptotic behavior of the probability

I P {X = n}

in the case (A) is given by (11) and (13). From the relations (11) and (13) it follows

I P {X = \infty} = lim_{n \to \infty} \prod_{k = 1}^{n - 1} (1 - p / k^{γ}) = 0,

so that X is a proper random variable.

Denote by

b_{m} = \prod_{k = 1}^{m - 1} (1 - p / k^{γ}) .

For the distribution tail

T_{m}

we have

T_{m} = \sum_{n = m}^{\infty} I P {X = n} = (b_{m} - b_{m + 1}) + \dots + (b_{s} - b_{s + 1}) + \dots = b_{m} .

Particularly,

\sum_{n = 1}^{\infty} I P {X = n} = 1 .

If

1 / γ

is not a positive integer, then

T_{m} = \prod_{k = 1}^{m - 1} (1 - p / k^{γ}) \sim C_{4} \cdot exp \{\sum_{j = 1}^{[1 / γ]} \frac{n^{1 - γ j}}{1 - γ j} \cdot \frac{p^{j}}{j}\}, as n \to \infty,

(14)

where

C_{4}

depends on p and

γ

. Similarly, for the case of integer

1 / γ

,

T_{m} \sim C_{5} \cdot \frac{p}{n^{p^{[1 / γ]} / [1 / γ]}} \cdot exp \{\sum_{j = 1}^{[1 / γ] - 1} \frac{n^{1 - γ j}}{1 - γ j} \cdot \frac{p^{j}}{j}\} as n \to \infty .

(15)

Let us consider the case (B). We have

I P {X = n} = \frac{p}{n^{γ}} \cdot \prod_{k = 1}^{n - 1} (1 - \frac{p}{k^{γ}}),

(16)

where

γ > 1

. Transform the product in the right-hand-side:

\begin{matrix} b_{n} = \prod_{k = 1}^{n - 1} (1 - \frac{p}{k^{γ}}) = exp \{\sum_{k = 1}^{n - 1} log (1 - p / k^{γ})\} \\ = exp \{- \sum_{j = 1}^{\infty} \sum_{k = 1}^{n - 1} p^{j} / (j k^{γ j})\} = exp \{- \sum_{k = 1}^{n - 1} \sum_{j = 1}^{\infty} p^{j} / (j k^{γ j})\} \\ = exp \{- \sum_{k = 1}^{n - 1} p / (k^{γ} - p)\} [n \to \infty] ⟶ exp \{- \sum_{k = 1}^{\infty} p / (k^{γ} - p)\} . \end{matrix}

The series under an exponential sign converges because

γ > 1

. From latest relation we see that

I P {X = \infty} = exp \{- \sum_{k = 1}^{\infty} p / (k^{γ} - p)\} > 0,

(17)

and X is an improper random variable.

Therefore, for conditional probabilities we have

I P {X = n | X < \infty} \sim C_{6} \frac{p}{n^{γ}} as n \to \infty,

(18)

where

C_{6}

depends on p and

γ

only.

Summarizing, we obtain the following theorem

Theorem 1.

For the considered experiment scheme with probabilities given in (5) the following statements are true:

If $γ = 0$ then $I P {X = n} = p {(1 - p)}^{n - 1}$ , $n = 1, 2, \dots$ .
If $0 < γ < 1$ and $1 / γ$ is not a positive integer then

$I P {X = n} \sim C_{2} \cdot \frac{p}{n^{γ}} \cdot exp \{- \sum_{j = 1}^{[1 / γ]} \frac{n^{1 - γ j}}{1 - γ j} \cdot \frac{p^{j}}{j}\} a s n \to \infty .$

(19)

If $0 < γ < 1$ and $1 / γ$ is a positive integer then

$I P {X = n} \sim C_{3} \cdot \frac{p}{n^{γ + p^{[1 / γ]} / [1 / γ]}} \cdot exp \{- \sum_{j = 1}^{[1 / γ] - 1} \frac{n^{1 - γ j}}{1 - γ j} \cdot \frac{p^{j}}{j}\} a s n \to \infty .$

(20)
If $γ = 1$ then

$I P {X = n} \sim p / (n^{p + 1} Γ (1 - p)), n \to \infty .$

(21)
If $γ > 1$ then

$I P {X = n | X < \infty} \sim C_{4} \frac{p}{n^{γ}} a s n \to \infty,$

(22)

and

$I P {X = \infty} = exp \{- \sum_{k = 1}^{\infty} p / (k^{γ} - p)\} > 0,$

(23)

All $C, C_{1} - C_{6}$ depend on parameters p and γ only.

One of the reviewers of the first version of the paper advised us to study the form of the constants for some particular cases. We are very grateful him for the advice. Below we consider the case

γ \in (1 / 2, 1)

. In this case

[1 / γ] = 1

so that the sum under exponential sign in (19) contains only one summand. The calculations similar to give above leads to the following expression

I P {X = n} = \frac{p}{n^{γ}} exp \{- \frac{p}{1 - γ} n^{1 - γ} - \sum_{k = 1}^{\infty} \frac{p^{k}}{k} ζ (k γ) + o (1)\} .

In other words, the constant

C_{2}

has form

C_{2} = exp \{- \sum_{k = 1}^{\infty} \frac{p^{k}}{k} ζ (k γ)\} > 0 .

However, precise calculation of all other constant is rather difficult. We do not these constants for the aims of this paper and omit any other calculations of constants.

5. Comments

Theorem 1 shows that for

0 \leq γ < 1

, the tail of the corresponding distribution is not heavy. Namely, the distribution has finite moments of all positive orders. However, the tail becomes heavier with growing

γ \in [0, 1)

. In the case of

γ \in [0, 1]

the distribution is unimodal with mode equal to 1. For the values

γ \in [1, \infty)

, the distribution has a power-type tail, which is heavier than the ones occurring for

γ \in [0, 1)

. In the case

γ \in [1, 2)

the conditional distribution under condition

X < \infty

does not have the finite mean. However, for growing values of

γ \in [1, \infty)

the tails of conditional distributions look to be less heavy. In the case of

γ \in [1, \infty)

the conditional distribution has mode at 1.

6. The Case of Growing $p_{n}$

Above, we considered the case of the probability of event A decreasing with increasing iment number. For completeness, consider the case of an increase of this probability.

Namely, suppose that in (1)

p_{n} = 1 - q / n^{γ}

for

q \in (0, 1)

and

γ > 0

. Then

I P {X = n} = (1 - q / n^{γ}) \prod_{k = 1}^{n - 1} \frac{q}{k^{γ}} = \frac{q^{n - 1}}{{((n - 1)!)}^{γ}} - \frac{q^{n}}{{(n!)}^{γ}} .

(24)

It is clear that

I P {X = \infty} = 0

, and the tail of the distribution

T_{m} = \frac{q^{m - 1}}{{(Γ (m))}^{γ}}

is a quickly decreasing function of m. Of course, distribution of X has finite moments of all orders and it may have a mode not only at 1.

7. Back to the Distribution of Citation Number of One Author

We suppose now that the distribution of citation number of one paper has the form (5):

I P {X = n} = \frac{p}{n^{γ}} \cdot \prod_{k = 1}^{n - 1} (1 - \frac{p}{k^{γ}}), n = 1, 2, \dots

with

γ > 0

. Corresponding probability generating function is

P (z) = \sum_{n = 1}^{\infty} z^{n} I P {X = n} .

(25)

As was mentioned above, the number of cited paper is distributed according to geometric law with probability generating function (1):

Q (z) = \frac{q}{1 - (1 - q) z}, q \in (0, 1) .

The probability generating function of citation number of one author equals to the composition of

P

and Q, i.e., it is

P (Q (z))

. It is clear that the tail of corresponding distribution is not heavy for

γ \in [0, 1)

, it is heavy for

γ = 1

, and the distribution is improper for

γ > 1

.

Although the case of improper distribution seems to be not realistic, we discuss it for some particular cases below, after consideration of proper cases

γ \in [0, 1]

.

Let us remind that the case

γ \in (0, 1)

leads to the light tailed distributions while

γ = 1

leads to the laws with the heavy tail. The choice between models with light or heavy tails can only be made based on real data. Below we analyze some data of this kind.

7.1. Analyzing Data from Scholar Google “Mathematics"

Let us give the data for the part “Mathematics" on 16 February 2020 (see Table 1). The data given concern are the first 10 in the number of citations of authors. We do not give the names of these scientists. The table shows:

Table 1. Citations “Mathematics”.

The serial number of the author;
The total number of citations by the author;
Hirsch Index;
The number of citations of the most popular work (By the most popular work we understand the work of this author having the largest number of citations among the works of this scientist);
Ratio of citations to squared Hirsch index;

Table 1 shows the first scientist has 2.76 times more citations than the second. In other words, the maximum of the observations is essentially greater than previous one. This observation leads us to think that the corresponding distribution has heavy tails (see [8,9]). As we have seen, it is possible for the case

γ = 1

only.

7.2. Analyzing Data from Scholar Google “Biostatistics"

Let us give the data for the part “Biostatistics" on 16 February 2020 (see Table 2). The structure of Table 2 is the same as that of Table 1.

Table 2. Citations “Biostatistics".

Table 2 shows the first scientist has 1.59 times more citations than the second. Although it is it is less than the case of Table 1, the number is large enough to support our hypothesis on the presence of a heavy tail.

We do not give the data on the part “Statistics” but mention the situation is similar to that of the Table 1 and Table 2.

7.3. Final Model for the Distribution of Citations

From the considerations of the two previous subsections, it follows that the most natural way to describe the distribution of citations is to choose

γ = 1

. This means

P (z) = 1 - {(1 - z)}^{p}, Q (z) = \frac{q}{1 - (1 - q) z}

and the probability generation function of citations distribution is given by

R (z) = P (Q (z)) = 1 - {(1 - \frac{q}{1 - (1 - q) z})}^{p} .

Denote by Y the number of citations of a given scientist. It is clear that

I P {Y = n}

may be found as the n-th coefficient of expansion

R (z)

in power series. We have

\begin{matrix} R (z) & = 1 - {(1 - q)}^{p} {(1 - z)}^{p} {(1 - (1 - q) z)}^{- p} \\ = 1 - {(1 - q)}^{p} \sum_{s = 0}^{\infty} {(- 1)}^{s} (\sum_{m = 0}^{s} (\binom{- p}{m}) (\binom{p}{s - m}) {(1 - q)}^{m}) z^{s} \\ = 1 - {(1 - q)}^{p} + \sum_{s = 1}^{\infty} {(- 1)}^{s + 1} (\binom{p}{s})_{2} F_{1} (p, - s, 1 + p - s, 1 - q) z^{s}, \end{matrix}

where

_{2} F_{1}

is a hypergeometric function. Therefore,

\begin{matrix} I P {Y = 0} & = 1 - {(1 - q)}^{p}; \\ I P {Y = s} & = {(- 1)}^{s + 1} (\binom{p}{s})_{2} F_{1} (p, - s, 1 + p - s, 1 - q), s = 1, 2, \dots \end{matrix}

(26)

It is possible to verify that

I P {Y = 0} > I P {Y = 0} > I P {Y = s}

for all integers

s \geq 2

. Therefore, we meet a scientist without papers or with citing papers with maximal probability. If we limit ourselves by consideration of the scientists having at least one citation then the highest probability corresponds to authors with one citation.

The Laplace transform of the distribution of Y has form

R (e^{- t}) = 1 - {(1 - \frac{q}{1 - (1 - q) e^{- t}})}^{p}, t \in [0, \infty) .

Its asymptotic as

t \to 0

is

1 - R (e^{- t}) \sim {(\frac{1 - q}{q})}^{p} \cdot t^{p}, as t \to + 0 .

(27)

This relation shows that the random variable Y has moments of order less than p and does not have moments of higher order. Because

p < 1

the variable Y has infinite mean. In practice, this means that some scholars have a very large number of citations. These citations refer to publications by a relatively small number of scholars. Of course, the data in Table 1 and Table 2 are in agreement with these statements. It is important that the model is built on the assumption of the same capabilities of scientists. Even so, we must observe a greater variability in the number of citations of their publications. Thus, the difference in the number of citations can be purely random and not say anything about the real contribution of the scientist into corresponding science field.

Of course, the proposed model is very idealistic, since it does not take into account the real difference in the capabilities of scientists, as well as in their equipping with the necessary tools and equipment. Taking into account the noted differences is likely to lead to the need to consider mixtures of the proposed distributions with different parameters p and q. However, such a complication will not make it possible to distinguish scientists with a large contribution to science from those with a smaller impact.

Surely, the arguments presented for the choice of

γ = 1

are rather crude, i.e., in reality, it may happen that

γ

is close to unity. Although in this case, the distribution tail is not heavy, but over a very large (but finite) interval it is close to heavy. So, qualitatively, our conclusions will remain unchanged.

Based on the foregoing, we conclude that it is practically senseless to use the number of citations of a scientist’s work to assess his contribution to science.

7.4. Remarks on the Model with $γ > 1$

In this subsection, we are trying to justify the possibility of using models with gamma greater than one. As already noted, in this model the probability

I P {Y = \infty}

is not equal to zero. It is unlikely that this corresponds to the situation with the consideration of all scientists working in this field of science. However, a very long citation process (ideally, endless) is quite possible in the case of the most prominent scientists. For example, in the field of Mathematics, the works of Professor Andrei Nikolaevich Kolmogorov (1903–1987) continue to be cited. Over the past 15 years, they have been cited about 30,000 times, although more than 30 years have passed since the death of their author. It is highly probable that the citation process for these works will continue for a long time.

In addition, the concept of citation is somewhat arbitrary in our opinion. For example, in Mathematics, some theorems or other objects bear the names of scientists who were related to their preparation. Does the mention of these theorems and the corresponding names in some articles mean their citation? For example, many articles and books mention the Gaussian distribution without reference to the corresponding publication by Gauss. Is this mention a quotation? It seems to us that such kind of nominal results are not counted in determining the citation index. However, they certainly indicate the scientific significance of the result. It is very likely that for accounting for citations of this kind, models with a

γ

greater than 1 may be required.

8. Hirsch Index

Recall that the definition of the Hirsch index was given on Page 1. Hirsch states that the proposed index h is intended to rank authors of articles in the field of Physics. At the same time, it is noted that the index can be used in other fields of science. Since the number of citations is used in determining the index h, it seems plausible that h is associated with this number. Hirsch notes that the number of citations is given by

N = κ h^{2}

. He wrote: “I find empirically that

κ

ranges between 3 and 5” (We change notations of Hirsch. Namely, his a is our

κ

.). Further, Hirsch wrote: “

κ > 5

is very atypical value”.

Below we show that the Hirsch statements presented here are doubtful. In addition, the use of this index seems unreasonable.

Let’s start by analyzing the data in Table 1 and Table 2. Remind that the column 5 gives corresponding values of

κ

. Table 1 does not contain any

κ \leq 6

while Table 2 has only one such value

κ = 4.69

. Other values of

κ

are “very atypical”, especially for Table 1. Table 2 contains 2 values of

κ \in (5, 6)

. Therefore, at least for such fields as “Mathematics” and “Biostatistics”, Hirsch’s conclusion about the “typical” form of proportionality between the number of citations of an author and the square of corresponding Hirsch’s index seems to be incorrect. However, was Hirsch right in the field of “Physics"?

8.1. Data in “Physics”

Now we give the data on field “Physics”, arranging them into a table in the same way as for Table 1.

Again, Table 3 has only one

κ \leq 5

, namely

κ = 4.88

. However, there are six values

κ \in (5, 6)

. The kappa values for the “Physics” area look smaller than for the “Biostatistics” area and significantly smaller than for the “Mathematics” area. The value of the Hirsch index for Physics has much less variability than for Biostatistics and Mathematics. The differences in citation numbers are much greater for Mathematics than in the case of Physics.

Table 3. Citations “Physics".

So, we see that Hirsch’s understanding of the situation in Physics is closer to reality than in the case of Biostatistics and, especially, Mathematics.

8.2. Data Comparison

Continue the analysis of the data in Table 1, Table 2 and Table 3.

The average value of the Hirsch index in the case of Table 1 is 99.3 with a standard deviation of 66.45. The same indicators for Table 2 are 153.8 and 47.97, and for Table 3—198.2 and 21.73. We see that the standard deviation of the Hirsch index in the case of Mathematics is three times greater than in the case of Physics. On the contrary, the average value of the index is maximum in the case of Physics and minimum in the case of Mathematics. This shows that if Hirsch index is useful in the field of Physics, then its usefulness in the field of Mathematics is doubtful. Probably, it is true for Biostatistics too.

Authors with a higher Hirsch index are often inferior to others in the number of citations of the most popular works. For example, in Table 1, Author 1, having the highest Hirsch index, is inferior to Authors 2, 4, 5, 6 and 7 in the number of citations of the most popular work. In this case, Author 1 wrote his most cited work with co-authors, while author 2 did without co-authors.

It is clear that the Hirsch index does not exceed the number of cited publications of the author, which has an exponential distribution. Thus, the distribution of the Hirsch index has a light tail. Since the number of citations has a heavy tail, it is more variable than the Hirsch index. However, these two indicators are stochastically strongly related. Indeed, for the data in Table 1, the sample correlation coefficient between these indicators is

ρ 1 = 0.94

. On the other hand, the correlation coefficient between the Hirsch index and the number of citations of the most popular works is

ρ 2 = - 0.23

. This coefficient indicates a small relationship between the indicators, and it is negative. In other words, a large Hirsch index is most likely not found among authors with highly cited individual articles. For Table 2, the values of the correlation coefficients equal to

ρ 1 = 0.702

,

ρ 2 = 0

, and for Table 3

ρ 1 = 0.36

,

ρ 2 = - 0.57

.

The increase in the Hirsch index with a decrease in the number of citations of the most popular work may result in the division of the work into a series of publications. However, when assessing the quality of a scientist’s contribution, one should take into account that the publication of a series of articles instead of one may be caused not by a desire to increase the number of publications, but, for example, by a gradual insight into the essence of the problem under consideration. Such insight often requires a very long time, i.e., publication of a series of articles is justified. It should be noted that the publication of a series of articles naturally leads to an increase in the number of self-citations. This increase cannot be considered as a flaw of the author and does not mean attempts to artificially increase the number of citations. At the same time, the presence of a series of publications (which increases the Hirsch index) cannot be considered as preferable to one highly cited work.

The presence of higher values of the Hirsch index in Physics compared to Mathematics can be explained by the use in modern Physics of expensive equipment in experimental Physics and/or the results obtained on it in theoretical Physics. Often this equipment is used by some laboratory or scientific group, and then transferred to another or others. After some time, this equipment again becomes available to the first group. Thus, new experimental facts arrive intermittently, and during the break they are processed and published. A theoretical analysis of the observed facts is also taking place. Then comes new information related to new experiments. Therefore, the very flow of information (both experimental and theoretical) contributes to the publication of not a single article, but a series of articles. This circumstance leads to an increase in the Hirsch index with a relative decrease in the number of citations of popular works.

A similar situation is absent in Pure Mathematics. Therefore, there the appearance of the series has much fewer reasons. Separate works appear, which often cover a substantial part of the problem under consideration. They cause a stream of citation of this particular work, and in a series of works. Thus, the Hirsch index becomes smaller than it would be if a series of articles were published instead of this one, but the most popular work causes more citations than each individual work in the series.

So, the use of the Hirsch’s index has some basis in the field of Physics, but it is not related to what is happening in Mathematics.

For some areas of Applied Mathematics, a situation may be observed that is intermediate between what is happening in Physics and in Pure Mathematics.

However, it is not clear to us why not replace the Hirsch index with two. The first of these could be the number of all citations, and the second - the number of citations of the most popular work. The Hirsch index is stochastically quite closely linked to the number of all citations, so it and this number are “interchangeable”. However, after the termination of the work of a scientist in a given field of science, the number of his publications does not increase and, therefore, the Hirsch index remains limited, while the number of citations can continue to grow unlimitedly. This is exactly what happens with the works of the most outstanding scientists of the past.

9. Distribution of the Hirsch Index

In this section, we obtain the probability distribution of the Hirsch index.

We introduce some notation. It is clear that the Hirsch index is a random variable. Let us denote it by H. We will denote the values of this H by h. Our aim here is to determine the probabilities that

H = h

, i.e.,

I P {H = h}

. In order for the event

H = h

to occur, it is necessary and sufficient that:

(a): no less than h works were published;
(b): h of the published works are cited at least h times, and the rest - less than h times.

Suppose that l works are published, and

l \geq h

. The probability of this event is

q {(1 - q)}^{l}

. Recall, the probability that a published work will be quoted k times equals to

(p / k) \prod_{j = 1}^{k - 1} (1 - p / j)

. Therefore, the probability that the published work will be cited at least h times equals to

\sum_{k = h}^{\infty} \frac{p}{k} \cdot \prod_{j = 1}^{k - 1} (1 - p / j) = \frac{Γ (h - p)}{Γ (h) \cdot Γ (1 - p)},

where

Γ

is Euler gamma function.

The probability that a published work will be cited less than h times is defined as

1 - \frac{Γ (h - p)}{Γ (h) \cdot Γ (1 - p)} .

Thus, the probability that l papers are published, and the Hirsch index H has taken the value h is

q {(1 - q)}^{l} (\binom{l}{h}) \cdot {(\frac{Γ (h - p)}{Γ (h) \cdot Γ (1 - p)})}^{h} \cdot {(1 - \frac{Γ (h - p)}{Γ (h) \cdot Γ (1 - p)})}^{l - h} .

Now we see that

I P {H = h} = \sum_{l = h}^{\infty} q {(1 - q)}^{l} (\binom{l}{h}) \cdot {(\frac{Γ (h - p)}{Γ (h) \cdot Γ (1 - p)})}^{h} \cdot {(1 - \frac{Γ (h - p)}{Γ (h) \cdot Γ (1 - p)})}^{l - h} = {(\frac{Γ (h - p)}{Γ (h) \cdot Γ (1 - p) - Γ (h - p)})}^{h} \cdot q \cdot \frac{μ^{h}}{{(1 - μ)}^{h + 1}},

where

μ = (1 - \frac{Γ (h - p)}{Γ (h) \cdot Γ (1 - p)}) \cdot (1 - q) .

So, the random variable H has the following distribution

I P {H = h} = (1 - ν) \cdot ν^{h},

where

ν = \frac{(1 - q) Γ (h - p)}{q Γ (h) Γ (1 - p) + (1 - q) Γ (h - p)} .

Note that this distribution is not geometric one because the value of

ν

depends on h.

Next, we are interested in estimating the tail of the distribution of H. To do this, we estimate the asymptotic behavior of the

ν

. An application of the Stirling formula allows one to easily obtain that

ν = ν (h) \sim \frac{1 - q}{q Γ (1 - p)} \cdot \frac{1}{h^{p}} .

This formula immediately leads us to an asymptotic expression for the logarithm of probability

I P {H = h}

for

h \to \infty

. Namely,

log I P {H = h} \sim p \cdot h \cdot log h, h \to \infty .

It follows that the probability of the event

{H = h}

decreases faster than the exponential function for

n \to \infty

. Of course, the tail of the distribution of H also decreases faster than the exponential function. Therefore, there are moments of all orders of this distribution. Note that the distribution of the number of citations of articles by this author has an infinite mean value. So, if an author has a fairly large number of citations, then the ratio of the number of citations to the square of the Hirsch index can be arbitrarily large. This fact contradicts Hirsch’s claim that

κ

is bounded.

Author Contributions

Conceptualization, L.B.K.; investigation, L.B.K., Y.V.K. and Z.E.V. The authors have equally contributed to the writing, editing and style of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

The study was partially supported by grant GAČR 19-04412S (Lev Klebanov).

Conflicts of Interest

The authors declare no conflict of interest.

References

Garfield, E. Citation Indexes for Science. Science 1955, 122, 108–111. [Google Scholar] [CrossRef] [PubMed]
Garfield, E. Citation Index in Sociological and Historical research. Curr. Contents 1969, 9, 42–46. [Google Scholar]
Garfield, E. The evolution of the Science Citation Index. Int. Microbiol. 2007, 10, 65–69. [Google Scholar] [CrossRef] [PubMed]
Hirsch, J.E. An index to quantify an individual’s scientific research output. Proc. Natl. Acad. Sci. USA 2005, 102, 16569–16572. [Google Scholar] [CrossRef] [PubMed]
Richter, M. Was misst der h-Index (nicht)?—Kritische Überlegungen zu einer populären Kennzahl für Forschungsleistungen. WiSt Wirtsch. Stud. 2018, 47, 64–68. [Google Scholar] [CrossRef]
Klebanov, L.B. One look at the rating of scientific publications and corresponding toy-models. arXiv 2017, arXiv:1706.01238v1. [Google Scholar]
Sibuya, M. Generalized Hypergeometric, Digamma and Trigamma Distributions. Ann. Inst. Statist. Math. 1979, 31, 373–390. [Google Scholar] [CrossRef]
Klebanov, L.B.; Antoch, J.; Karlova, A.; Kakosyan, A.V. Outliers and related problems. arXiv 2017, arXiv:1701.06642v1. [Google Scholar]
Volchenkova, I.V.; Klebanov, L.B. Characterization of the Pareto distribution by the properties of neighboring order statistics. Zap. Nauchnih Semin. POMI 2019, 486, 63–70. (In Russian) [Google Scholar]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

1	2	3	4	5
1.	448,557	270	28,303	6.15
2.	162,457	98	44,406	16.92
3.	159,123	147	26,929	7.36
4.	138,820	64	110,393	33.89
5.	101,662	59	35,640	29.20
6.	99,206	78	41,647	16.31
7.	85,288	59	55,293	24.50
8.	84,918	48	18,901	36.86
9.	77,319	98	11,715	8.05
10.	73,989	72	17,153	14.27

1	2	3	4	5
1.	478,691	227	66,611	9.29
2.	301,786	132	59,613	17.32
3.	253,221	208	26,127	5.85
4.	223,038	218	10,184	4.69
5.	199,143	169	23,447	6.97
6.	178,855	117	39,271	13.07
7.	150,695	105	42,485	13.67
8.	119,199	111	20,666	9.67
9.	108,648	140	20,842	5.54
10.	100,491	111	30,315	8.16

1	2	3	4	5
1.	326,718	206	25,605	7.70
2.	259,321	223	7275	5.21
3.	240,376	200	15,651	6.01
4.	232,057	206	26,535	5.47
5.	231,746	218	15,589	4.88
6.	227,530	206	15,684	5.36
7.	217,495	144	35,746	10.49
8.	200,565	191	11,807	5.50
9.	198,735	190	7497	5.50
10.	197,679	198	25,649	5.04

Statistical Indicators of the Scientific Publications Importance: A Stochastic Model and Critical Look^†

Abstract

1. Introduction

2. Citation Model Construction

3. Distribution of Citation Number of a Paper

4. Main Result on Citation Number Distribution

5. Comments

6. The Case of Growing $p_{n}$

7. Back to the Distribution of Citation Number of One Author

7.1. Analyzing Data from Scholar Google “Mathematics"

7.2. Analyzing Data from Scholar Google “Biostatistics"

7.3. Final Model for the Distribution of Citations

7.4. Remarks on the Model with $γ > 1$

8. Hirsch Index

8.1. Data in “Physics”

8.2. Data Comparison

9. Distribution of the Hirsch Index

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Statistical Indicators of the Scientific Publications Importance: A Stochastic Model and Critical Look †

Abstract

1. Introduction

2. Citation Model Construction

3. Distribution of Citation Number of a Paper

4. Main Result on Citation Number Distribution

5. Comments

6. The Case of Growing p n

7. Back to the Distribution of Citation Number of One Author

7.1. Analyzing Data from Scholar Google “Mathematics"

7.2. Analyzing Data from Scholar Google “Biostatistics"

7.3. Final Model for the Distribution of Citations

7.4. Remarks on the Model with γ > 1

8. Hirsch Index

8.1. Data in “Physics”

8.2. Data Comparison

9. Distribution of the Hirsch Index

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Statistical Indicators of the Scientific Publications Importance: A Stochastic Model and Critical Look^†

6. The Case of Growing $p_{n}$

7.4. Remarks on the Model with $γ > 1$