A Bibliometric Analysis of COVID-19 across Science and Social Science Research Landscape

: The lack of knowledge about the COVID-19 pandemic has encouraged extensive research in the academic sphere, reﬂected in the exponentially growing scientiﬁc literature. While the state of COVID-19 research reveals it is currently in an early stage of developing knowledge, a comprehensive and in-depth overview is still missing. Accordingly, the paper’s main aim is to provide an extensive bibliometric analysis of COVID-19 research across the science and social science research landscape, using innovative bibliometric approaches (e.g., Venn diagram, Biblioshiny descriptive statistics, VOSviewer co-occurrence network analysis, Jaccard distance cluster analysis, text mining based on binary logistic regression). The bibliometric analysis considers the Scopus database, including all relevant information on COVID-19 related publications ( n = 16,866) available in the ﬁrst half of 2020. The empirical results indicate the domination of health sciences in terms of number of relevant publications and total citations, while physical sciences and social sciences and humanities lag behind signiﬁcantly. Nevertheless, there is an evidence of COVID-19 research collaboration within and between di ﬀ erent subject area classiﬁcations with a gradual increase in importance of non-health scientiﬁc disciplines. The ﬁndings emphasize the great need for a comprehensive and in-depth approach that considers various scientiﬁc disciplines in COVID-19 research so as to beneﬁt not only the scientiﬁc community but evidence-based policymaking as part of e ﬀ orts to properly respond to the COVID-19 pandemic.


Introduction
The world has seen two large-scale outbreaks of disease since the 2000s. Respectively emerging in 2003 and 2012, these are Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS), which posed a threat around the world and claimed thousands of lives [1]. In December 2019, a new strain of coronavirus (SARS-CoV-2), not previously identified in humans, emerged in Wuhan City, in the Hubei province of China. The virus soon spread across countries with the number of cases and deaths related to COVID-19 quickly exceeding the numbers of the two other coronaviruses (SARS-CoV-1 and MERS-CoV). This rapid spread of COVID-19 around the world led to the World Health Organization (WHO) to declare it a pandemic on 11 March 2020 [2]. The COVID-19 pandemic is a typical public health emergency. Its high infection rate means it is a huge threat to global public health [3][4][5]. However, its rapid proliferation has not only affected the lives of many people on the planet, but disrupted patterns of social and economic development, bringing incalculable social and economic losses [6]. Within just a six months of the outset of the COVID-19 pandemic (by 1 July 2020), some 10.3 million cases and 0.5 million deaths were registered at the global level [7]. International institutions have therefore announced the global economy is now in recession as bad or

Materials and Methods
Comprehensive bibliometric data on COVID-19-related research were obtained in two consecutive phases, as presented in Figure 1. The first phase involved identifying all relevant documents or publications from 1 January 2020 to 1 July 2020 in the Scopus database on document information, a database also widely recognized in previous research [10,14,31,35]. The applied search query extended previous narrowly-defined queries [33,34] by including a broad range of COVID-19 related keywords: "novel coronavirus 2019", "coronavirus 2019", "COVID 2019", "COVID19", "COVID 19", "COVID-19", "SARS-CoV-2", "HCoV-19", "2019-nCoV" and "severe acute respiratory syndrome coronavirus 2". The keyword search was set to include titles, abstract and keywords. In addition, the search period was limited to include documents published between 1 January 2020 and 1 July 2020. Finally, only documents in the English language were considered for the review process. According to the presented search query, a total of 21,400 documents was identified as relevant in COVID-19 research. Interestingly, the number of documents obtained by using an identical search query had increased by 58.8% since on 1 June 2020 the same search produced 13,480 documents. This implies that interest in COVID-19 research is growing exponentially. Due to the Scopus export limitation up to 20,000 records at a time, the unique academic work identifier assigned in Scopus bibliographic database (EID) was utilized to obtain basic citation metadata for 21,400 documents (author(s); document title; year; source title; volume, issue, pages; citation count; source and document type). Moreover, due to the additional Scopus export limitation up to 2000 records at a time on detailed document information (citation Sustainability 2020, 12, 9132 4 of 30 information, bibliographical information, abstract and keywords, funding details, other information), the EID was also used to split the found relevant documents into smaller blocks of data. The data were exported in comma-separated values (csv) format. Finally, the mentioned blocks of data were merged to create a full dataset containing 21,400 documents. information with Scopus CiteScore metrics exported in csv format from Scopus Sources page that contain source-related information (e.g., citations, rankings, source-normalized impact per paper (SNIP) etc.). These two data sets were merged by using the International Standard Serial Number (ISSN). The merging process revealed that some documents from Scopus had no match in Scopus CiteScore metrics (n = 4534), meaning they were not considered in the bibliometric analysis. The biggest proportion of these documents were articles (61.1%), followed by letters (11.5%), reviews (10.0%), notes (8.3%), editorials (7.6) and other (1.5%). The screening process thus resulted in a database of 16,866 documents. The data preparation process, i.e., obtaining, merging and cleaning the relevant data, was facilitated by the Python programming language using the Pandas and Numpy libraries [37]. Python code used in the analysis is available and documented at GitHub repository: https://github.com/covid-bib/bibliometric. An in-depth bibliometric analysis then followed, allowing for an innovative literature review approach and significantly upgrading traditional literature review techniques. Namely, a structured literature review is a traditional approach to analyzing and reviewing scientific literature, providing an in-depth overview of the content. However, this approach suffers from several limitations associated with subjective factors, time-consumption and efficiency. The application of modern bibliometric approaches reduces these limitations and entails an effective way of handling extensive collections of scientific literature [38]. Thus far, bibliometric studies on COVID-19 research applied well-established bibliometric approaches by utilizing VOSviewer (see Hamidah et al. [32]), SciMAT (see Herrera-Viedma et al. [13]) and basics of machine learning (see De Felice and Polimeni [39]). Still, bibliometric studies mostly overlook the fact that scientific disciplines overlap strongly, resulting in these studies making similar findings and conclusions and producing a lack of knowledge in less- The second phase involved supplementing the presented Scopus database on document information with Scopus CiteScore metrics exported in csv format from Scopus Sources page that contain source-related information (e.g., citations, rankings, source-normalized impact per paper (SNIP) etc.). These two data sets were merged by using the International Standard Serial Number (ISSN). The merging process revealed that some documents from Scopus had no match in Scopus CiteScore metrics (n = 4534), meaning they were not considered in the bibliometric analysis. The biggest proportion of these documents were articles (61.1%), followed by letters (11.5%), reviews (10.0%), notes (8.3%), editorials (7.6) and other (1.5%). The screening process thus resulted in a database of 16,866 documents. The data preparation process, i.e., obtaining, merging and cleaning the relevant data, was facilitated by the Python programming language using the Pandas and Numpy libraries [37]. Python code used in the analysis is available and documented at GitHub repository: https://github.com/covid-bib/bibliometric. An in-depth bibliometric analysis then followed, allowing for an innovative literature review approach and significantly upgrading traditional literature review techniques. Namely, a structured literature review is a traditional approach to analyzing and reviewing scientific literature, providing an in-depth overview of the content. However, this approach suffers from several limitations associated with subjective factors, time-consumption and efficiency. The application of modern bibliometric approaches reduces these limitations and entails an effective way of handling extensive collections of scientific literature [38]. Thus far, bibliometric studies on COVID-19 research applied well-established bibliometric approaches by utilizing VOSviewer (see Hamidah et al. [32]), SciMAT (see Herrera-Viedma et al. [13]) and basics of machine learning (see De Felice and Polimeni [39]). Still, bibliometric studies mostly overlook the fact that scientific disciplines overlap strongly, resulting in these studies making similar findings and conclusions and producing a lack of knowledge in less-explored areas [33]. Therefore, in order to supplement existing research and assess the state of current COVID-19 research across different research landscapes (health sciences, life sciences, physical sciences and social sciences and humanities), innovative bibliometric approaches are relied on in this paper. The bibliometric analysis was performed by considering the Scopus hierarchical classification of documents based on the All Science Journal Classification scheme (ASJC) and in-house experts' opinions. Accordingly, the documents were classified in three hierarchically arranged groups: (1) subject area categories; (2) subject area classifications; and (3) fields.
On this basis, the following bibliometric approaches were applied. First, for descriptive analysis, including a Venn diagram for detecting the overlap of scientific disciplines, the Biblioshiny application [40] and the Python library Pyvenn [41] were used. Second, in order to depict relations among keywords and fields, a co-occurrence network analysis was performed with VOSviewer, a software tool for constructing and visualizing bibliometric networks [42]. Moreover, to examine relationships between different subject area classifications within COVID-19 research a cluster analysis was undertaken based on the Jaccard distance (JD) (Jaccard index subtracted from 1). The Jaccard distance measures dissimilarity between two fields (subject-area classifications). In other words, it counts the number of documents that belong to exactly one field and divides this number by the number of documents that belong to at least one field. In terms of measurement, Jaccard distance ranges from 0 to 1, with 0 suggesting perfect overlap and 1 indicating no overlap [43]. The Jaccard distance is calculated with Python library Scipy [44], while the clustermap is designed using Python's most powerful visualization libraries, i.e., Matplotlib and Seaborn [41,45]. Scopus database classifies its documents into 27 subject area classifications (SAC). Excluding multidisciplinary SAC, order remaining 26 SAC C 1 , C 2 , . . . , C 26 are covered in our data set. We can define the similarity between two SAC C i and C j using the Jaccard coefficient (see Equation (1)) where C = {documents that are classified to SAC C i }. The Jaccard coefficient counts the number of documents that belong to both C i and C j (power of intersection C ∩ C | ) and divides this number by the number of documents that are classified to C i or C j (power of union C ∪ C | ). In the paper, the Jaccard coefficient is further used for clustering of SAC. Since clustering algorithms used dissimilarities (instead of similarities) J C i , C j is replaced by Jaccard distance JD C i , C j , i.e., by subtracting the Jaccard coefficient from 1 (see Equation (2)).
Finally, to predict a document's subject area based on its abstract, a text-mining-based classification was used [46]. For this purpose, binary logistic regression was selected as a prediction model. Accordingly, four different binary logistic models were tested for each individual subject area, with the binary dependent variable having the value of 1 if a document belongs to the individual subject area and 0 if the document belongs to other remaining subject areas. Based on the results of fitting the model to the data, the binary logistic regression also provides information on which words are most characteristic for a particular subject area (which discriminate the most between two subject areas). This approach requires documents to have a full abstract. Text mining was performed with the Natural Language Toolkit (NLTK), a Python package for natural language processing [47]. In the first phase, pre-processing is performed (abstracts are converted to lowercase, accents are removed, word punctuation is used as tokenization). WordNet lemmatization is then applied [48], the set of extracted words is further filtered with a list from nltk.corpus and manually-added stop words [49]. To construct features (bag of words), the "term frequency-inverse document frequency (tf-ifd)" method was employed. The class TfidfVectorizer from sklearn.feature_extraction.text [50] was used with the following parameters: sublinear term frequency (tf) scaling, smooth inverse document frequency (idf) weights, unicode transformation format (utf)-8 encoding, l2 norm regularization, min data frequency = 1, max data frequency = 10. To extract new features for classification, a search for unigrams (single words) and bigrams (sequence of two words) was performed. The top 100 features are created and then used as predictors (independent variables) in a binary logistic model. Binary logistic regression was used to empirically verify if it is possible to predict subject area of a document from its abstract. For every subject area (S 1 = Health Sciences, S 2 = Life Sciences, S 3 = Physical S 4 = Sciences Social Sciences and Humanities) we define an indicator variable Y i which takes values 1 (a document is classified to a subject area S i ) and 0 (otherwise). The variables Y 1 , Y 2 , Y 3 and Y 4 are further treated as separate dependent variables for logistic regression models. For the predictor variables we used p = 300 terms extracted from documents' abstracts. The values of the predictor variables (X 1 = term "acute", X 2 = term "admission", X p = X 300 = term "year")) are TF-IDF statistics (top 300 terms were included). The models estimate the conditional probabilities P Y i = 1 X 1 , X 2 , . . . , X p that a document is classified to subject area S i ). The formula of binary logistic regressions used in the paper correspond to:

Results
An overview of the scientific documents utilized in this study is presented in Table 1. A total of 16,866 documents written by 66,504 distinct authors and published in 2548 journals was relied on in this study, where 7422 (44.0%) have at least one citation in the Scopus database, providing a total of 100,683 citations. For these documents, the average citations per document were 13.57 while the average authors per document were 3.94. The biggest proportion of these documents were articles (41.5%) and letters (26.5%). A much smaller proportion of them were reviews (10.2%), editorials (10.1%) and notes (9.4%). Finally, there was a negligible share of other documents (2.4%) such as short surveys, conference papers, errata and data papers. The presented characteristics of these scientific documents on COVID-19 research are largely in line with previous research [32,33]. in three hierarchically arranged groups: (1) subject area categories; (2) subject-area classifications; and (3) fields. The distribution of documents according to these groups is presented in Table 2. Nearly two-thirds of documents are in the area of the health sciences (65.2%), with medicine (91.0%) being the most exposed, whereby the dominant focus is on infectious diseases (10.2%) and general medicine (9.7%). This is in harmony with earlier bibliometric studies which show that COVID-19 research is the main domain of the health-related sciences [31][32][33][34][35]. A much smaller number of documents is in the area of the life sciences (19.0%). Nevertheless, biochemistry, genetics and molecular biology (35.3%), as well as immunology and microbiology (31.4%) are identified as the most relevant subject-area classifications, while virology (11.6%) and immunology (10.2%) are recognized as the most important research fields within the life sciences. The smallest share of documents is found in the physical sciences (7.5%). These are focused on environmental science (31.4%) and engineering (15.4%), with the research field of pollution (10.7%) being the most exposed. Finally, a relatively small share of documents is found in the area of the social sciences and humanities (8.3%). Still, the social sciences (44.2%) and psychology (24.6%) are recognized as the most relevant subject-area classifications, while sociology and political science (9.2%) is identified as the most important research field within the social sciences and humanities. The aforementioned gives support for the claims of a lack of knowledge in less-explored areas, including the life, physical and social sciences [33]. Therefore, it is no surprise that many calls have been made for more extensive COVID-19 research in less-explored scientific disciplines. Table 3 presents the most relevant (top 20) journals in COVID-19 research by number of documents. They contain almost one-fifth (17.6%) of total documents and cover a significant share (41.3%) of total citations. Regarding different scientific disciplines or subject areas (classifications), the most relevant journals mainly operate in the area of the health sciences (medicine), covering the following research fields: infectious diseases, general medicine, microbiology (medical), psychiatry and mental health, public health, environmental and occupational health, critical care and intensive care medicine, dermatology, endocrinology, diabetes and metabolism, epidemiology as well as internal medicine. Further, a smaller share of the most relevant journals operate in the area of the life sciences (immunology and microbiology as well as neuroscience), with a focus on biological psychiatry and virology. Some of these journals also publish on the physical sciences (environmental science, mathematics, physics and astronomy), focusing on the following research fields: applied mathematics, environmental chemistry, environmental engineering, general mathematics, general physics and astronomy, health, toxicology and mutagenesis, pollution, statistical and nonlinear physics, and waste management and disposal.
Finally, there is only one journal, which operates in the area of the social sciences (psychology), covering the research field of general psychology. There is also one journal classified as multidisciplinary. Most of these journals rank in the first quartile (Q1) and have a relatively high source-normalized impact per paper (SNIP), which is consistent with the existing research [31,35]. Further, most of these journals are from the UK, the Netherlands, and the USA. Similar findings are also made in previous COVID-19 bibliometric studies [33,34]. However, all of the current bibliometric studies overlook the large overlap that exists among scientific disciplines, leading to biased results and thus a lack of comprehensive understanding of COVID-19 research across different scientific disciplines [33].

Bibliometric Analysis across Different Subject-Area Categories
According to the Scopus classification, the documents may be classified in four different subject areas: health sciences, life sciences, physical sciences, and social sciences and humanities. However, these subject areas strongly intersect, meaning that an individual document can be classified in several subject areas at one time. Therefore, to address the comprehensiveness of COVID-19 research, Figure 2 shows a Venn diagram of the presented subject areas and all possible sets that can be made from them. This also enables the so-called pure sciences to be determined by covering only those documents that exclusively belong to just one subject area (without intersecting with other subject areas). According to the number of documents obtained on 1 July 2020 (the number of documents obtained on 1 June 2020 is presented in parentheses), health sciences contain a total of 14,187 (8896)  source-normalized impact per paper (SNIP), which is consistent with the existing research [31,35]. Further, most of these journals are from the UK, the Netherlands, and the USA. Similar findings are also made in previous COVID-19 bibliometric studies [33,34]. However, all of the current bibliometric studies overlook the large overlap that exists among scientific disciplines, leading to biased results and thus a lack of comprehensive understanding of COVID-19 research across different scientific disciplines [33].

Bibliometric Analysis across Different Subject-Area Categories
According to the Scopus classification, the documents may be classified in four different subject areas: health sciences, life sciences, physical sciences, and social sciences and humanities. However, these subject areas strongly intersect, meaning that an individual document can be classified in several subject areas at one time. Therefore, to address the comprehensiveness of COVID-19 research, Figure 2 shows a Venn diagram of the presented subject areas and all possible sets that can be made from them. This also enables the so-called pure sciences to be determined by covering only those documents that exclusively belong to just one subject area (without intersecting with other subject areas). According to the number of documents obtained on 1 July 2020 (the number of documents obtained on 1 June 2020 is presented in parentheses), health sciences contain a total of 14,187 (8896) documents, of which 10,394 (6575) documents are identified as in the area of the pure health sciences. Further, life sciences encompass a total of 4143 (2549) documents, of which 928 (599) documents are to be in the area of the pure life sciences. Moreover, physical sciences include a total of 1625 (878) documents, of which 568 (314) documents belong to pure physical sciences.  Lastly, the social sciences and humanities cover a total of 1812 (977) documents, of which 771 (323) are to be in the area of the pure social sciences and humanities. A comparison of different subject areas reveals that health sciences are the most relevant in COVID-19 research, while the second-most relevant subject area is life sciences. Moreover, physical sciences and social sciences and humanities seem to be the least popular thus far, as found by previous research [33]. However, considering growth in the number of documents in June 2020, the social sciences seem to be the most increasing scientific discipline as the total number of documents in this subject area rose by 85.5% and even by 138.7% in the pure social sciences. This is consistent with the expectations and recent COVID-19 bibliometric studies on economics (see Mahi et al. [29]) and business and management (see Verma and Gustafsson [30]). Namely, the first immediate response to the COVID-19 pandemic has been to protect public health, while addressing the real socio-economic consequences may be expected to come later. This path is also revealed by the recent scientific literature on COVID-19 published in the first half of 2020 and a review of the latest COVID-19 publications from 2020, indicating that a shift is underway from health to other relevant scientific disciplines [19]. Moreover, some documents (273) are considered to be multidisciplinary, making it impossible to include them in the further bibliometric analysis across different subject-area categories.
Finally, additional insight into openness of journals reveals that COVID-19 research is very open as 81.3% of total documents are published in open access journals. The highest openness of COVID-19 research is observed for health sciences (82.9% in general and 82.0% for pure science) and life sciences (85.9% in general and 86.7% for pure science), while lower openness is identified for physical sciences (67.9% in general and 50.0% for pure science) and social sciences and humanities (73.7% in general and 70.1% for pure science). In addition, the most relevant (top three) journals and authors (by number of citations) overlapping between at least three subject areas (excluding multidisciplinary subset) are identified. The highest overlap is identified for the following journals:  Figure 3 presents the most relevant countries of COVID-19 research by subject area. It shows the top five countries, providing the largest number of documents by a corresponding author. The most relevant country is the USA, significantly dominating in all scientific disciplines, except the physical sciences where it ranks in second place. In addition to the USA, which significantly outperformed other countries, China and Italy also dominate in COVID-19 research since they are among the top three countries in all scientific disciplines, except in the social sciences where Italy is replaced by India. These findings are consistent with existing bibliometric studies (which do not consider scientific disciplines separately) that state that the USA and China are world leaders in COVID-19 research [31][32][33][34][35]. Figure 4 shows the most relevant institutions by the number of documents in COVID-19 research across subject areas. Due to the strong overlap among individual scientific disciplines, to some extent they may share the same most relevant institutions. The most involved institution is the Huazhong University of Science and Technology, providing a significantly higher number of documents in health sciences (n = 1380) and life sciences (n = 420). Moreover, the Zhongnan Hospital of Wuhan University and Icahn School of Medicine at Mount Sinai also play important roles in these two scientific disciplines. Moreover, Fudan University dominates in the physical sciences (n = 68), while providing an enviable number of publications also in the life sciences (n = 155). Finally, the California Department of Public Health as well as Public Health-Seattle and King County are the two most relevant institutions in the social sciences and humanities, also with an important role in the physical sciences. The findings are to some extent comparable with existing bibliometric studies on COVID-19 research [33,35].          This finding is inconsistent with existing bibliometric studies, presumably due to the different criteria applied [33]. Figure 7 presents the keyword co-occurrence network for: (a) health sciences, (b) life sciences, (c) physical sciences, and (d) social sciences and humanities separately. To ensure a greater distinction between individual subject areas, only pure sciences (without intersecting with other sciences) are considered in the bibliometric analysis. Moreover, the bibliometric analysis is conducted on the 100 most frequent (author and index) keywords by considering the exclusion of the keywords used in the search query, elimination of stop words, and consolidation of keywords describing the same phenomenon.
The bibliometric analysis (keyword co-occurrence) reveals the research hotspots by subject area. For the health sciences, three clusters are identified, addressing the following topics: (1) pandemics; (2) risk factors and symptoms; and (3) mortality. Accordingly, the health sciences deal predominantly with health-related issues associated with the COVID-19 pandemic. Next, in the life sciences, four clusters are found, dealing with: (1) pandemics; (2) virology; (3) immunology; and (4) drug efficiency. The focus of the life sciences seems to be oriented more to knowledge about the spread of the virus and ways to efficiently prevent the disease with appropriate drugs. This corresponds with findings of other recent bibliometric studies on COVID-19 research, predominantly emphasizing health-related issues [33,34]. In addition, the results for less-explored subject areas are as follows. Regarding the physical sciences, three clusters are recognized, related to: (1) pandemics; (2) China and disease transmission; and (3) air pollution. The physical sciences focus on knowledge relating to how fast the COVID-19 pandemic is spreading and environmental-related issues. Finally, in the social sciences and humanities, six clusters are identified, addressing the following topics: (1) pandemics; (2) epidemics; (3) viral disease and China; (4) respiratory disease; (5) social distancing; and (6) mental health. A detailed synopsis of the research hotspots, including the top 10 keywords, related to COVID-19 in an individual scientific discipline is presented in Table A1 in Appendix A.
Moreover, in order to predict a document's subject area based on its abstract, a text-mining-based classification was used. For this purpose, binary logistic regression was selected as a prediction model. Accordingly, four different binary logistic models were tested for each individual subject area. Based on the results of fitting the model to the data, binary logistic regression also provides information on which words are most characteristic for a particular subject area (which discriminate the most between two subject areas). This approach requires documents with a full abstract, with 8347 documents meeting this criterion. To extract new features for classification, the search for the top 100 characteristic words resulted in 99 unigrams (single words) and 1 bigram (a sequence of two words). These features are further used as predictors (independent variables) in binary logistic models.    (c) physical sciences, and (d) social sciences and humanities separately. To ensure a greater distinction between individual subject areas, only pure sciences (without intersecting with other sciences) are considered in the bibliometric analysis. Moreover, the bibliometric analysis is conducted on the 100 most frequent (author and index) keywords by considering the exclusion of the keywords used in the search query, elimination of stop words, and consolidation of keywords describing the same phenomenon.  The bibliometric analysis (keyword co-occurrence) reveals the research hotspots by subject area. For the health sciences, three clusters are identified, addressing the following topics: (1) pandemics; (2) risk factors and symptoms; and (3) mortality. Accordingly, the health sciences deal predominantly with health-related issues associated with the COVID-19 pandemic. Next, in the life sciences, four clusters are found, dealing with: (1) pandemics; (2) virology; (3) immunology; and (4) drug efficiency. The focus of the life sciences seems to be oriented more to knowledge about the spread of the virus and ways to efficiently prevent the disease with appropriate drugs. This corresponds with findings of other recent bibliometric studies on COVID-19 research, predominantly emphasizing healthrelated issues [33,34]. In addition, the results for less-explored subject areas are as follows. Regarding the physical sciences, three clusters are recognized, related to: (1) pandemics; (2) China and disease transmission; and (3) air pollution. The physical sciences focus on knowledge relating to how fast the COVID-19 pandemic is spreading and environmental-related issues. Finally, in the social sciences and humanities, six clusters are identified, addressing the following topics: (1) pandemics; (2) The results of the text-mining-based classification (see Table A2 in Appendix A) reveal the following. The goodness-of-fit statistics for all of the estimated binary logistic models are shown to be adequate, as suggested by the Pseudo R 2 value, ranging from a minimum of 0.146 (health sciences) to a maximum of 0.403 (social sciences and humanities), and very low values of the Log-Likelihood Ratio (LLR) p-value (<0.001) [51]. In addition, evaluation measures of models (area under receiver-operating-characteristic curve (AUC), classification accuracy (CA), precision and recall) suggest very good discrimination (ability to classify documents belonging to an individual subject area and documents belonging to other remaining subject areas) [52]. Table 4 presents a summary of the results of the text-mining-based classification of COVID-19 documents across subject areas. It shows the most discriminant words (having a significant and positive regression coefficient) for predicting a corresponding subject area based on the binary logistic regression. For the health sciences, the top three most characteristic words are "patient", "health" and "healthcare". The regression coefficient for "patient" suggests that if a tf-idf of the word »patient« in a document increases by the amount of t, the probability of this document belonging to the health sciences increases by exp. (4775). The same interpretation also holds for all of the regression coefficients. Regarding other scientific areas, the top three most characteristic words are "protein", "human" and "vaccine" for life sciences, "factor", "lockdown" and "area" for physical sciences, and "crisis", "pandemic" and "mental" for social sciences and humanities. According to the presented results, some interesting relationships between different bibliometric aspects can be identified. Namely, Table 3 shows the most relevant journals in COVID-19 research and consequently the most productive subject areas by number of documents (particularly health sciences), while Table 2 supplements these findings by describing additional lagging subject areas (especially physical sciences and social sciences and humanities [33]. Finally, Table 4 complements these two tables in the sense of highlighting the most discriminant words regardless of subject area dominance and relevance of the journal. Since COVID-19 research is obviously a relatively new field, with the science still evolving, it is important to important to understand the issue from different perspectives. Nevertheless, the field of COVID-19 research will certainly continue to develop in the future, presumably making a shift from health to other relevant scientific disciplines [19].

Bibliometric Analysis across Different Subject-Area Classifications and Fields
To examine the relationships between different subject-area classifications within COVID-19 research, a cluster analysis based on the Jaccard distance (JD) (Jaccard index subtracted from one), measuring dissimilarity is performed (see Figure 8). Jaccard distance ranges from zero to one, with zero suggesting perfect overlap and one indicating no overlap [43]. Based on the results, the following clusters may be identified. The first and most relatively pronounced cluster is engineering, bringing together: computer science, energy, materials science, chemistry, chemical engineering and engineering. The strong connection between these subject-area classifications is further confirmed by the relatively low Jaccard distance. This is reflected especially between engineering and chemical engineering (JD = 0.69), meaning that 31% (1-0.69) of COVID-19 related documents belonging to either engineering or chemical engineering belong to both subject-area classifications at the same time. One of the strongest (23%) overlaps in this cluster is also found for chemical engineering and chemistry. The second and most pronounced cluster concerns mathematics and physics, as suggested by the lowest Jaccard distance between mathematics and physics and astronomy (JD = 0.58), meaning there is a 42% overlap between these two subject-area classifications.
Furthermore, according to the results, the other subject-area classifications are not very different from each other (the Jaccard distance is equal to or very close to one), making it difficult to identify meaningful or homogeneous clusters. Nevertheless, some further potential or emerging clusters can be identified. Accordingly, the third cluster is the humanities and psychology, grouping the individual subject-area classifications of the arts and humanities and psychology with a 16% overlap. The fourth cluster is business, management and economics, covering business, management and accounting, economics, econometrics and finance and social sciences, where the most connected subject-area classifications are social sciences and economics, econometrics and finance with an 11% overlap, then social sciences and business, management and accounting with a 9% overlap. The fifth cluster is about decision and earth sciences, grouping individual subject-area classifications of decision sciences and earth and planetary sciences with an 11% overlap. Finally, the sixth cluster concerns health and the environment, covering neuroscience, biochemistry, genetics and molecular biology, immunology and microbiology, medicine, pharmacology, toxicology and pharmaceutics, health professions, veterinary, agricultural and biological sciences, environmental science, nursing and dentistry. The biggest overlap in this cluster is identified between medicine and immunology and microbiology (9%) and immunology and microbiology and biochemistry, genetics and molecular biology (8%).  Figure 9 presents the field co-occurrence network for the: (a) health sciences, (b) life sciences, (c) physical sciences, and (d) social sciences and humanities separately. To ensure a greater distinction between individual subject areas, only pure sciences (without intersecting with other sciences) are considered in the bibliometric analysis. Moreover, the bibliometric analysis is conducted on the 297 research fields distributed among these four main subject areas. The bibliometric analysis (field cooccurrence) reveals different clusters related to COVID-19 within an individual subject area. For the health sciences, nine clusters are identified, namely: (1) internal medicine; (2) radiology and hematology; (3) dermatology and neurology; (4) cardiology, pulmonary and anesthesiology; (5) surgery; (6) pharmacology; (7) epidemiology; (8) sports medicine and rehabilitation; and (9) public health. Next, in the life sciences, seven clusters are found, addressing: (1) pharmacology and genetics; (2) biotechnology and toxicology; (3) biochemistry and pharmacology; (4) microbiology and ecology; (5) molecular biology and biochemistry; (6) immunology, neuroscience and endocrine systems; and (7) virology and microbiology. Regarding the physical sciences, four clusters are recognized, related Regarding the overlap of COVID-19 research among different subject-area classifications outside of the identified clusters, the strongest connection is identified between environmental science and energy, physics and astronomy and material science and environmental science and social sciences (8%). This is followed by the overlap between the social sciences and psychology (7%) as well as the connection between the agricultural and biological sciences and mathematics and decision sciences and business, management and accounting (6%). These results provide additional evidence on COVID-19 research collaboration occurring within and between different subject-area classifications [22]. Figure 9 presents the field co-occurrence network for the: (a) health sciences, (b) life sciences, (c) physical sciences, and (d) social sciences and humanities separately. To ensure a greater distinction between individual subject areas, only pure sciences (without intersecting with other sciences) are considered in the bibliometric analysis. Moreover, the bibliometric analysis is conducted on the 297 research fields distributed among these four main subject areas. The bibliometric analysis (field co-occurrence) reveals different clusters related to COVID-19 within an individual subject area. For the health sciences, nine clusters are identified, namely: (1) internal medicine; (2) radiology and hematology; (3) dermatology and neurology; (4) cardiology, pulmonary and anesthesiology; (5) surgery; (6) pharmacology; (7) epidemiology; (8) sports medicine and rehabilitation; and (9) public health. Next, in the life sciences, seven clusters are found, addressing: (1) pharmacology and genetics; (2) biotechnology and toxicology; (3) biochemistry and pharmacology; (4) microbiology and ecology; (5) molecular biology and biochemistry; (6) immunology, neuroscience and endocrine systems; and (7) virology and microbiology. Regarding the physical sciences, four clusters are recognized, related to: (1) electrical/electronic and mechanical engineering; (2) general computer science and engineering (3) mathematics and physics; and (4) environment and pollution. Finally, in the social sciences and humanities, eight clusters are identified, addressing the following topics: (1) business, management and economics; (2) health, philosophy and psychology; (3) education and applied psychology; (4) geography and tourism; (5) humanities and anthropology; (6) sociology and economics; and (7) social and clinical psychology; and law and safety. A detailed synopsis of the clusters, including the top five fields, related to COVID-19 in an individual scientific discipline is presented in Table A3 in Appendix A.

Discussion and Conclusions
The outbreak of COVID-19 is a typical public health emergency where the high infection rate poses a huge threat to not only global public health but economic and social development. In order to be able to solve such emergencies, it is vital to fully understand the problem, its implications for different areas, and the solutions that may be effective and efficient in addressing potential devastating consequences. Therefore, scientific knowledge on COVID-19 is essential because it leads

Discussion and Conclusions
The outbreak of COVID-19 is a typical public health emergency where the high infection rate poses a huge threat to not only global public health but economic and social development. In order to be able to solve such emergencies, it is vital to fully understand the problem, its implications for different areas, and the solutions that may be effective and efficient in addressing potential devastating consequences. Therefore, scientific knowledge on COVID-19 is essential because it leads to answers to real-life questions. However, the extent of the COVID-19 pandemic calls for in-depth knowledge so as to allow numerous issues in different areas to be identified. It is hence not surprising that COVID-19 research has seen such an unprecedented rise since the pandemic started [36,53]. The COVID-19 pandemic has led to the generation of a large amount of scientific publications, which might engender possible problems with the velocity and availability of information and scientific collaboration, particularly in the early stages of the pandemic [54]. The current state of COVID-19 research therefore needs a comprehensive analysis to help guide the agenda for further research, especially from the perspective of cooperation among different scientific disciplines at varying stages of pandemic prevention and control, by applying innovative scientific approaches [55][56][57].
Accordingly, this paper provides extensive bibliometric analysis of COVID-19 research across the science and social science research landscape by relying on a wide variety of bibliometric approaches, including descriptive analysis, network analysis, cluster analysis based on the Jaccard distance and text mining based on binary logistic regression. The results generally show that a total of 21,400 documents related to COVID-19 research were published in the Scopus database in the first half of 2020. Interestingly, the number of documents had risen by 58.8% in June since May 2020, suggesting an exponential interest in COVID-19 research. The database suitable for the review process includes a total of 16,866 documents, written by 66,504 different authors and published in 2548 different journals, that together provide a total of 100,683 citations. The biggest share of the documents were articles (41.5%) and letters (26.5%), which agrees with previous bibliometric studies [23,32]. Moreover, the distribution of the COVID-19 related documents according to the Scopus hierarchical classification reveals that nearly two-thirds (65.2%) of them are found in the area of the health sciences, supporting the claims that COVID-19 research is lacking knowledge in less-explored subject areas, including the life, physical and social sciences and the humanities [33]. Furthermore, the most relevant journals in COVID-19 research account for almost one-fifth (17.6%) of total documents and a significant share (41.3%) of total citations. With regard to different scientific disciplines or subject areas (classifications), the most relevant journals mainly publish in the health sciences (medicine), while other scientific disciplines (life sciences, physical sciences and social sciences and humanities) remain in the background. Most of these journals rank in the first quartile (Q1) and have a relatively high source-normalized impact per paper (SNIP), which is in line with existing research [31,35]. Finally, most of these journals come from the UK, the Netherlands and the USA. Similar findings have also been made in earlier COVID-19 bibliometric studies [33,34].
A more detailed comparison of COVID-19 research between four scientific disciplines shows that subject areas strongly intersect, which calls for an in-depth analysis of individual subject areas separately. The results of bibliometric analysis across different subject-area categories show the following. According to the number of documents, health sciences is the most relevant subject area in COVID-19 research, the second-most relevant subject area is life sciences, while physical sciences and social sciences and humanities seem to be the least popular hitherto. However, during June 2020 the social sciences seem to be the fastest-growing scientific discipline, with the total number of documents in this subject area rising by 85.5% and even by 138.7% in the pure social sciences. A shift from health to other relevant scientific disciplines is observable in the review of the latest CORD-19 publications as well as in recent COVID-19 bibliometric studies on economics (see Mahi et al. [29]) and business and management (see Verma and Gustafsson [30]). Moreover, the results reveal that most published documents on COVID-19 (81.3%) are found in open access journals. The highest openness of COVID-19 research is observed for health sciences and life sciences, while lower openness is identified for physical sciences and social sciences and humanities. In addition, the most relevant journals and authors overlapping between at least three subject areas (excluding the multidisciplinary subset) are identified. As regards to journals, the highest overlap is identified for Morbidity and Mortality Weekly Report  [31][32][33][34][35]. Furthermore, the results of keyword co-occurrence analysis by main subject areas reveal different research hotspots for individual scientific disciplines, with the common point of pandemics. The health sciences are focused more on health consequences (see Hossain [33] and Lou et al. [34]), the life sciences are more strongly oriented to drug efficiency, the physical sciences are more focused on environmental consequences, whereas the social sciences are more oriented to socio-economic consequences. In addition, the results of text-mining-based classification based on binary logistic regression reveal the most characteristic words for predicting a corresponding area. For the health sciences, the top three most characteristic words are "patient", "health" and "healthcare". As regards to other scientific areas, the top three most characteristic words are "protein", "human" and "vaccine" for the life sciences, "factor", "lockdown" and "area" for the physical sciences, and "crisis", "pandemic" and "mental" for the social sciences and humanities.
Further bibliometric analysis on COVID-19 research across different subject-area classifications and fields provides additional in-depth insights. Namely, a cluster analysis based on the Jaccard distance reveals six different clusters: engineering, mathematics and physics, humanities and psychology, business management and economics, decision and earth sciences and health and environment. Regarding the overlap of COVID-19 research among different subject-area classifications outside of the identified clusters, the strongest connection is seen between environmental science and energy, physics and astronomy and material science and environmental science and social sciences. These results provide further evidence about COVID-19 research collaboration occurring within and between different subject-area classifications [22]. The results of field co-occurrence analysis by main subject areas also reveal different research clusters of fields, providing a detailed segmentation of different scientific disciplines.
Several limitations of the present study should be noted. First, the bibliometric analysis is only based on COVID-19 related documents retrieved from the Scopus database and published in journal with available Scopus CiteScore metrics. Although Scopus is considered to be one of the largest abstract and citation databases of peer-reviewed literature, it might not cover the complete collection of COVID-19 research. Therefore, the inclusion of other databases, especially the expanding body of preprints available in the Google Scholar database, could have provided additional insights not available in this study. Second, this study is based on a short time period (first half of 2020). Although this limitation cannot be solved at this stage, a repeated study with a longer period would yield further time-dimensional insights. This would also be beneficial in terms of achieving a higher number of publications in some under-represented disciplines, especially the social sciences and humanities. Another limitation is that only titles, abstracts and keywords in the English language were included in this study, which might cause some publication bias. Future studies should therefore address this issue. Finally, another study limitation is the lack of citation and collaboration networks that could be identified using sophisticated methodological approaches due to the small number of studies and continuously changing citations metrics. Accordingly, future bibliometric studies should address these limitations and further examine the evolution of scientific knowledge about COVID-19 across different scientific disciplines over time.
Notwithstanding the above limitations, the findings of the paper highlight the importance of a comprehensive and in-depth approach that considers different scientific disciplines in COVID-19 research. In order to address the economic, socio-cultural, political, environmental and other (non-medical) consequences of the COVID-19 pandemic, in the near future COVID-19 must appear higher up the research agenda of non-health sciences, particularly the social sciences and humanities,. Namely, understanding of the evolution of emerging scientific knowledge on COVID-19 is not only beneficial for the scientific community, but for evidence-based policymaking with a view to fully addressing the implications of the COVID-19 pandemic.  Table A3. Clusters based on the field co-occurrence network in COVID-19 research across different subject areas (January-June 2020).

Subject Area Clusters Fields
Health Sciences