The Role of Societal Aspects in the Formation of Official COVID-19 Reports: A Data-Driven Analysis

This paper investigates the role of socioeconomic considerations in the formation of official COVID-19 reports. To this end, we employ a dataset that contains 1159 pre-processed indicators from the World Bank Group GovData360 and TCdata360 platforms and an additional 8 COVID-19 variables generated based on reports from 138 countries. During the analysis, a rank-correlation-based complex method is used to identify the time- and space-varying relations between pandemic variables and the main topics of World Bank Group platforms. The results not only draw attention to the importance of factors such as air traffic, tourism, and corruption in report formation but also support further discipline-specific research by mapping and monitoring a wide range of such relationships. To this end, a source code written in R language is attached that allows for the customization of the analysis and provides up-to-date results.


Introduction
Research on the COVID-19 pandemic has grown rapidly since the outbreak of the disease; however, despite the enormous media attention on countries' reports, only a few articles address the number of officially reported cases and deaths as a social phenomenon. As many studies have pointed out, there is a significant discrepancy between the officially confirmed data and recently published estimates (see, e.g., [1,2]). However, what do these data reflect, beyond the true nature of the virus? Of the few articles dealing with this question, authors [3] examined the protective effect of BCG (Bacillus Calmette-Guérin) vaccine on COVID-19 infections and the death toll while using indicators such as the Human Development Index (HDI), per-capita GDP, and urban population percentage as additional control variables. Moreover, in line with [4], they applied the Corruption Perception Index (CPI) as a proxy for the reliability of reported COVID-19 data. Other authors [5] found that measures of globalization, related to the migrant stock and air travel, were positively associated with both total cases and deaths. Results [6] suggest that more equitable access to healthcare does indeed increase testing rates and lower the death rate. Authors in [7] showed that more democratic political institutions experienced deaths on a larger per-capita scale and sooner than did less democratic countries, and based on [8], the population size and government health expenditure are strongly related to COVID-19 cases.
In contrast to these (mostly) discipline-specific studies, our goal is to map, analyze, and monitor a wide range of such relationships in time and space by applying a datadriven approach. To provide a deeper understanding of the role of societal aspects in the formulation of COVID-19 reports, we employ a rank-correlation-based complex method and a dataset that contains 1159 pre-processed indicators from the World Bank Group GovData360 and TCdata360 platforms and an additional 8 COVID-19 variables. The results not only draw attention to the importance of factors such as air traffic, tourism, and corruption in report formation but also support decision makers and discipline-specific research by providing a source code written in R language (in R Notebook) that allows for the customization of the analysis and provides up-to-date results.
The paper is organized as follows. Section 2 introduces the data and methodology used during the calculations. Section 3 presents and discusses the results. Section 4 contains the measurement error analysis of the results. Finally, Section 5 provides the conclusions and proposes future research directions.

Joint Dataset of GovData360, TCdata360, and COVID-19 Reports
This paper follows the steps of [9] when creating a linked database of governance, trade, and competitiveness indicators with COVID-19 reports. (To derive the up-to-date dataset, we use the author's R source code, which is publicly available at [10]). Former indicators were obtained from the GovData360 and TCdata360 platforms using the data360r (version: 1.0.8) R package [11]. From these platforms, only annual indicators from 2015 and later were collected, and their missing values were replaced with previous annual values in descending order by year until 2015. During pre-processing, indicators (columns) where the ratio of missing values exceeded 50% were filtered out. Then, the same filtration was applied above 25% in the case of countries (rows). Finally, highly correlated variables and variables with near-zero variances were removed, and the standardized form of the retained 1159 indicators was connected with 8 COVID-19 variables, generated on the basis of the official reports of 138 countries [12]. Note that auxiliary indicators measuring the number of data sources and standard error were also filtered out, and variables with nearzero variances were eliminated using the default settings of nearZeroVar function contained by the caret (version: 6.0-85) R package [13]. The presented data were compiled on 22 July 2020. Table 1 shows the description of the variables in the structure of the final dataset. 5% * c x l TC 6% * c * Average missing value ratio. a: [12], the population data are from [14]. b: Own categories based on the country data of [12]. c: [11].

Community-Based Model Reduction
Our goal is to map the time-and space-varying relationship between COVID-19 (furthermore dependent) variables (Y := {y 1 , .., y M }) and indicators from GovData360 and TCdata360 platforms (furthermore independent variables, X := {x 1 , .., x N }). To obtain an easily interpretable, comprehensive picture from these connections, similar GovData360 and TCdata360 indicators are grouped and characterized by latent variables. The applied steps are as follows.
First, standardized independent variables that have higher absolute (Spearman) rank correlation than an α parameter with at least one dependent variable are selected and denoted as X ⊆ X . Formally: Then, the rank correlation matrix of the selected variables is used as an adjacency matrix A, in which absolute rank correlation values below a β parameter are substituted by 0. Formally: Note that the adjacency matrix A defines a network, where the vertices are the selected variables (V = X), edges are indicated by the nonzero values (e ij ∈ E ⇔ [A] ij = a ij > 0), and their weight is the absolute rank correlation between the selected variables (w : E → R + , w(i, j) = a ij ). (Note that the same strategy was applied by, e.g., [15,16], to visualize variable similarity.) To group similar variables, our goal is to separate this network into groups of vertices that have fewer connections between them than inside the communities. In the literature, this task [17] is referred to as modularity-based community analysis (see, e.g., [18]) or simply community detection (see, e.g., [19]). Although the proposed method may seem more complicated than traditional model reduction methods, they cannot be used because of the large number of variables and relatively few observations. In addition, the visualization of the (correlation) network facilitates control over community formation (especially if N is large). This benefit is realized by using the Louvain community detection method [20] with an associated filtrating procedure that gains heterogeneity between the groups by eliminating weakly connected group members.
After Louvain community detection, we obtain C := {c 1 , .., c n } communities on G(V, E), which specifies {G c 1 , .., G c n } = G partitions of network G. As a next step, each community is represented by a single composite (so-called latent) variable (x c i ) obtained by the weighted linear combination of member variables: where is the eigenvector centrality of node j, and λ > 0 is a constant. Louvain modularity and eigenvector centrality were calculated using the cluster_louvain and eigen_centrality functions of igraph (version: 1.2.4.2) R package, respectively [21]. Note that the use of eigenvector centrality as a weight ensures that deeper embedded variables (within the given community) play a greater role in the formation of the latent variable. (Also note that the use of standardized independent variables results in standardized latent variables).
To increase the homogeneity of communities, we calculate the absolute (Spearman) rank correlation of each variable within the community i with the related latent variablex c i , and variables that have weaker absolute rank correlation than a γ parameter are removed from their communities. Finally, the steps of this paragraph (from community detection to filtration) are repeated until no more variables can be eliminated. Note that while the proposed algorithm finds strongly interrelated indicators, correlations between modules still can exist. Thus, completely independent communities are not guaranteed. Although the use of factor analysis with orthogonal rotation may result in independent communities, as there are more observations than variables in the studied dataset, the use of this method is not recommended. At the end of the process, we rank communities (characterized by latent variables) by their absolute rank correlation with dependent variables. Then, we select the top C ≤ n interpretable communities and investigate their relationship with the dependent variables through their absolute rank correlation coefficients. To examine the regional differences in addition to the study of time-varying relationships, these correlations are identified as well on the subset of different regions (see variable r * in Table 1). The calculation steps are summarized in Figure 1.

Results and Discussion
Following the calculation order presented above, we first illustrate and interpret the results of community detection to identify the most important topics reflected in official COVID-19 reports. Then, we investigate the time-and space-varying relations of these communities with different pandemic variables.

Topics Most Related to COVID-19 Reports
When setting α, β, and γ parameters, our goal was to group the widest possible range of important GovData360 and TCdata360 indicators without obtaining communities that are difficult to interpret. To accomplish this goal, we set the parameters as α = 0.535, β = 0.828, and γ = 0.770, which resulted in a network containing 319 indicators (vertices) and 1669 edges, representing the strong correlations among the indicators. Figure 2 illustrates five communities (C = 5) detected within this network and helps their interpretation with word clouds generated on the basis of the names of member indicators. Word clouds were constructed after text cleaning and pre-processing by using the wordcloud (version: 2.6) R package [22]. Since pre-processing was based on a frequency list of terms contained by the names of the indicators, overlapping terms may occur in the composed word clouds. For example, the term "infrastructure" is included in several indicator names such as Electricity and telephony infrastructure and Quality of air transport infrastructure, even though they relate to different aspects of infrastructure.
1 Figure 2. Detected communities. The five communities of correlated GovData360 and TCdata360 indicators. Indicators are denoted by vertices and the edges representing the strong correlations among them. The interpretation of communities is supported by word clouds generated on the basis of the names of member indicators. Note that the applied method may not create purely separable communities, so the same words can occur in different word clouds. This can also occur if the same word appears in the names of significantly different indicators.
As Figure 2 shows, the topics most related to official COVID-19 reports are (1) tourism and trade, (2) infrastructure and digitalization, (3) business and ICT, (4) regulation and corruption, and (5) protectionism. From these, tourism and trade associates with the flow of people and goods, as reflected by the most frequent terms such as tourism, travel, merchandise, and imports. The contribution of the database also confirms this finding since most indicators of this community are part of the World Travel & Tourism Council (27%), United Nations Conference on Trade and Development Statistics (19%), and World Integrated Trade Solution (15%) datasets. The second community describes the infrastructure, especially in the field of digitalization, including variables such as ICT access, public services, and secure internet servers/million pop. Most of these variables are derived from the World Economic Forum Global Competitiveness Index (34%), Global Innovation Index (24%), and World Development Indicators (17%) databases. The third community, so-called business and ICT, is adjacent to infrastructure and digitalization. As its name suggests, it is in connection with information and communication technology (ICT); however, this community focuses more on business aspects such as innovation, efficiency, and competitiveness. The group regulation and corruption includes variables such as regulatory quality, political environment, and corruption. The sources of most of these variables are the World Justice Project-Rule of Law (31%), Global Innovation Index (25%), and Global State of Democracy (19%). Finally, the fifth community is labeled protectionism because all of its variables are related to tariffs. The variables for each community are detailed in the Supplementary Materials.

Relations with COVID-19 Reports
To visualize the absolute rank correlation between the COVID-19 variables and the communities characterized by latent variables, radar charts are employed. In Figure 3, these relations are classified into three groups. The first focuses on the time elapsed since the first registered data, while the other two relate to the officially reported cases and deaths per capita aggregated by using different time windows. As Figure 3 shows, indicators measuring the appearance of the virus are strongly correlated with tourism-and trade-related activities. Taking a closer look at the standalone variables within this community, GCI 4.0: Air transport and outbound travel and tourism expenditure have the strongest (Spearman) rank correlation coefficients with the days passed since the first case was reported (0.789 and 0.777, respectively). Moreover, this COVID-19 indicator has a strong connection with variables of international trade as well, such as Merchandise: Trade matrix by product groups, imports (0.755), Index Of Export Market Penetration (0.736), and Number of export partners (0.707) from the same community. In light of these close relationships, it may be surprising to find that this community has a relatively weak connection with the reported number of cases and deaths; however, increased controls at airports and the rapid closure of borders could be reflected in this result.
In contrast to tourism and trade, these COVID-19 indicators are closely linked to the other four communities, especially to infrastructure and digitalization. From this community, variables related to digital development such as fixed broadband subscriptions (per 100 people) and online creativity show the strongest positive (Spearman) rank correlation with the number of deaths per capita (0.659 and 0.653, using the 60-day time window), which suggests that a significantly higher death toll has been reported by more developed countries. It is also reflected by the positive correlation of this COVID-19 variable with A. Health indicator calculated from healthy life expectancy (0.582) as well as by its strong negative relationship with GCI 4.0: Exposure to unsafe drinking water (−0.676).
On the basis of these results, while the data suggesting the appearance of the virus seem to be reliable and relatively easy to explain, reports on cases and deaths appear highly distorted. On the one hand, this distortion may be a consequence of the poor health infrastructure that makes measurement difficult, but on the other hand, political interests could also be tied to underreporting. Since the regulation and corruption community's regulatory quality and freedom from corruption score indicators have a strong positive correlation with the reported number of cases (0.560 and 0.533, respectively) and deaths (0.565 and 0.523, respectively, using the 60-day time window), the reports of countries with higher levels of corruption seem much less authentic. Furthermore, detected communities contain strikingly many indicators related to the development of the information society, which counteracts disinformation.
To support discipline-specific research, we detailed the correlations of each member variable with different COVID-19 indicators in the Appendix A. These correlations can provide a deeper understanding of phenomenons mapped by using latent variables. For instance, the strongest correlations were found between the spread of the virus and the latent variable of the tourism and trade community. On the basis of Table A1, the strength of these relationships mainly is due to the indicators such as GCI 4.0: Air transport, International tourism and number of arrivals, and Outbound Travel & Tourism Expenditure, or more generally, due to the number of inbound and outbound travels. Similar to latent variables (see Figure 3), these standalone indicators typically show increasing correlations with time window expansion; however, this change over time can vary significantly from region to region.

Regional Differences
To examine how the results presented in the previous subsection differ from region to region, countries are divided into four groups by using the region ID variable (denoted as r * in Table 1). These groups are Europe, Asia, Americas, Africa, and Oceania; however, the last group was omitted from the investigation due to its small sample size (two countries). The regional differences in the relations of COVID-19 and latent variables are presented in Figure 4.
On the basis of Figure 4, we can conclude that the impact of tourism and trade on the spread of the virus is significant regardless of region; however, the variable measuring the appearance of the first case shows the highest (Spearman) rank correlation with this community in the Americas and Europe. In these two regions, variables such as GCI 4.0: Air transport (0.803 and 0.791, respectively) and government spending on travel and tourism service (0.881 and 0.715, respectively) have one of the highest correlations with days elapsed since the first case. Moreover, in the Americas and Europe, this community, and especially its tourism-and air-transport-related indicators, shows an increasingly close relationship with the number of registered deaths per capita as the time window expands. Accordingly, regulations on foreign travel restrictions and airport controls are particularly important in these regions.
Next to the Americas and Europe, in Asia, variables measuring the spread of the virus are also strongly tied to the tourism and trade community, but these variables have a stronger rank correlation with the data related to first death. Unlike other regions, reports from Asian countries are mostly related to infrastructure and digitalization and protectionism communities; however, even these relations appear weak in comparison with the relationships detected in other regions. To obtain stronger ties, it may be worthwhile to map the topics that contain the most important variables separately for this region. Finally, reported data both in Africa and Europe have remarkably close connections with the regulation and corruption community, especially with indicators such as political environment and freedom from corruption score. While in Africa these variables are typically related to reported case numbers (0.568 and 0.551), in Europe they show a stronger correlation with deaths (0.495 and 0.583, using the 60-day time window, respectively), which suggests that the reports of these regions are less credible. Note that the correlations of standalone variables calculated on different regional subsamples are contained in the Supplementary Materials.

Measurement Error Analysis
As it was discussed in Section 3.2, while the data suggesting the appearance of the virus seem to be reliable and relatively easy to explain, reports on cases and deaths appear highly distorted. Since these measurement errors can affect the community detection outcomes through the distorted rank correlation coefficients, in this section, we conduct a simulation to analyze the validity of our results under the presence of measurement errors.
During the simulation, by using the multiplicative measurement error model, we added random measurement errors to the dependent variables and conducted all the calculation steps described by Figure 1. The applied measurement error model is as follows: where the reported and the masked dependent variables are denoted by y i and y * i , and ε i is an independent random variable following a normal distribution with a mean 1 and standard deviation σ ε . During the simulation, y * i was estimated by using different σ ε values, then the resulted communities were examined. In order to characterize the structure of the communities we investigate the number of vertices and edges to the correlation network of the variables, number of communities found, and number of included variables in each community (see Table 2). The calculation parameters used are as follows: α = 0.535, β = 0.828 and γ = 0.770. * Original case, where y i = y * i .
Although number of extracted communities does not change significantly when σ ε ≥ 0.1, both the structure of the correlation network and the sizes of the communities start to vary. On the basis of this additional calculation, communities presented in Section 3.1 are stable in the event of a small or moderate measurement error.

Conclusions and Future Work
Although some of the recent studies have already investigated the relationship of COVID-19 data with different socioeconomic indicators, the role of societal considerations in the formation of official COVID-19 data is not yet clear. In contrast to these studies, our goal was to map, analyze, and monitor a wide range of such relationships in time and space by applying a data-driven approach. To this end, we employed a rank-correlation-based complex method and a dataset that contains 1159 pre-processed indicators from the World Bank Group GovData360 and TCdata360 platforms and an additional 8 COVID-19 variables generated on the basis of the officially reported number of cases and deaths.
From our results, the topics most related to official COVID-19 reports are tourism and trade, infrastructure and digitalization, business and ICT, regulation and corruption, and protectionism. By examining these topics and the variables they compress, we found that tourism-and air-transport-related variables are key factors in the spread of the virus, especially in the Americas and Europe. In these two regions, the variables of the tourism and trade community show close connections with the reported death toll as well, which also emphasizes the importance of regulations on foreign travel restrictions and airport controls. In addition, the number of reported cases and deaths seems unreliable since developed countries generally reported more cases and deaths than developing countries. In line with the results, the two possible reasons for underreporting may be the poor health infrastructure that makes measurement difficult and the political will that is opposed to exploring and presenting the real epidemiological situation. Accordingly, we experienced the closest relationship between the level of corruption and reported data in Europe and Africa.
Using the proposed analysis, further interesting regional and temporal patterns can be identified, as the data will be updated over time. To support this research, we attach an R Notebook file (see Supplementary Materials) that not only updates the dataset but is also able to conduct all the analysis steps, including variable filtering and the compilation of figures. As a further advantage, this source code can be easily customized and allows researchers to apply arbitrary time frames during the analysis. Finally, in the Appendix A, we provide all the relationships identified during the analysis to support discipline-specific investigations.
Supplementary Materials: The following are available online at www.mdpi.com/xxx/s1. File B.1: Metadata. The metadata of GovData360 and TCdata360 indicators used. File B.2: Regional correlations. Standalone correlations in the regional dataset. File B.3: Source of COVID-19 data. The source of the COVID-19 dataset collected by [13]. According to the place of their publication, these sources are websites of ministries of health (43%), websites of public health institutes (9%), websites from other national authorities (6%), WHO websites, and WHO situation reports (2%), and official dashboards (10%). Besides, ECDPC screens social media accounts maintained by national authorities, for example Twitter, Facebook, YouTube, or Telegram accounts run by ministries of health (28%) and other official sources (e.g., official media outlets) (2%). File B.4: Data generation and analysis. Datasets were generated and analyzed with R Notebook, which can be used to update datasets and customize the analyses.

Conflicts of Interest:
The authors declare no conflict of interest.