Next Article in Journal
Expected Shot Impact Timing (xSIT) and Other Advanced Metrics as Indicators of Performance in English Men’s and Women’s Professional Football
Previous Article in Journal
A Multi-Class Labeled Ionospheric Dataset for Machine Learning Anomaly Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on the Development and Application of the GDELT Event Database

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Data 2025, 10(10), 158; https://doi.org/10.3390/data10100158
Submission received: 3 July 2025 / Revised: 25 September 2025 / Accepted: 28 September 2025 / Published: 1 October 2025

Abstract

This study investigates the development and application of the GDELT (Global Database of Events, Language, and Tone) news database. Through experiments, we conducted a quantitative statistical analysis of the GDELT event database to evaluate its practical characteristics. The results indicate that although the database achieves comprehensive coverage across all countries and regions and includes most major global media outlets, the accuracy rate of its key fields is only approximately 55%, with a data redundancy as high as 20%. Based on these findings, while the GDELT data demonstrates good coverage and data integrity, data correction and deduplication are recommended before its use in research contexts and industrial applications. Subsequently, a survey of the existing literature reveals that current studies using GDELT primarily focused on event-related metrics, such as event quantity, tone, and GoldsteinScale, for application in international relations analysis, crisis event prediction, policy effectiveness testing, and public opinion impact analysis. Nevertheless, news constitutes a fundamental channel of information dissemination in media networks, and the propagation of news events through these networks represents a critical area of study for information recommendation, public opinion guidance, and crisis intervention. Existing research has employed the Event, GKG, and Mentions tables to construct cross-national news flow network models. However, the informational correlations across different data table fields have not been fully leveraged in preliminary data selection, leading to substantial computational overhead. To advance research in this field, this study employs chained list queries on the Event and Mentions tables within GDELT. Using social network analysis, we constructed a media co-occurrence network of event reports, through which core hubs and associative relationships within the event dissemination network are identified.

1. Introduction

Events refer to the interactions between entities within specific temporal and spatial contexts, such as cooperation among individuals and communication between countries. To gain deeper insight into the patterns of human society, scholars at home and abroad have devoted efforts to collecting and organizing data on various societal events. Early event databases, such as in COPDAB (Conflict and Peace Data Bank), WEIS (World Event Interaction Survey), GSR (Gold Standard Reports) and SPEED (The Social, Political and Economic Event Database Project), were primarily constructed through manual curation. This process required experienced practitioners to collect the relevant news reports from major newspapers, and then organize and code them into event data. With the development of global internet technology, the volume of event data has been growing explosively. The sheer scale of information made it infeasible to rely solely on manual processing within limited timeframes. Consequently, researchers began to incorporate automated text processing technology to enable machine-assisted encoding and organization of events. This shift led to the development of event databases such as KEDS (Kansas Event Data System), Phoenix, ICEWS, and GDELT. Among these, GDELT is the most widely used, especially in international relations analysis [1,2], event detection and prediction [3,4], and analysis of social events development [5,6].

2. Overview of GDELT and Exploration of Practical Properties

The GDELT database is an open-source event database project co-founded by Yahoo! Inc. (New York, United States) and Georgetown University, with Google Inc. (California, United States) responsible for data processing and storage [7]. The database fetches the latest article information from major news sources around the world every 15 min and encompasses a vast volume of event records dating from 1979 to the present. Since its initial release in 2012, GDELT has undergone two major versions. The second version (updated in February 2015) expanded the original structure by introducing the Mentions and GKG (Global Knowledge Graph) tables [8]. However, the core of this database is still the Event table, which captures key information such as event participants and event types (Table 1). The Mentions table mainly records information about the indexing of the event within source documents, while the GKG table mainly stores information about the context of the article involved in the event.
To further evaluate the data quality of the GDELT database, we carried out experiments to analyze attributes of GDELT through statistics, including event volume, geographic and media coverage, accuracy, and redundancy, and derived the results of the experimental analyzes, which will provide a reference for the subsequent research.

2.1. Data Volume

We counted the distribution of data volume of Event data table of the GDELT database over a ten-year period from 2015 to 2024 (Figure 1). From the statistics, the annual data volume consistently reaches tens of millions of records, with an average of about 63 million data per year. This scale of data is sufficient to support big data-driven research applications.

2.2. Geographical Coverage

To explore the geographical coverage of the event database, we examined the presence of all 245 country and region codes from the global coding standard within GDELT. The results demonstrate that the GDELT database achieves 100% global coverage. Further, we explored the characteristics of the data in terms of geographical distribution by counting the distribution of the number of events in each region globally, and the top 10 country-regions in terms of the amount of event data are the United States, the United Kingdom, Russia, India, China, Israel, Nigeria, France, Canada, and Australia, with a total share of 51.7% (Table 2). From the experimental results, it can be seen that the GDELT database contains more data about the United States as well as European countries, followed by emerging powers such as China and India, indicating that the Western world still dominates the global media.

2.3. Media Coverage

To measure the coverage of GDELT’s global news media, we selected the top 500 news websites from the Alexa global online media traffic rankings as a reference set. and we matched the information of the main domain names in the database and calculated the inclusion of these websites. The results show that the GDELT database has 94.4% coverage of mainstream media. Further, we explored the characteristics and patterns of the distribution of events in media, and the top 10 media outlets in terms of data volume are MSN, Reuters, Love Radio, Daily Mail, Yahoo, Times of India, PanAfrican.com, Houston Chronicle, Washington Times, and San Francisco Chronicle (Table 3). From the experimental results, it can be seen that GDELT can cover most of the mainstream media around the world, and in terms of the distribution of specific media, the number of media originating from the United States and other English-speaking countries is higher.

2.4. Data Accuracy

The event data in GDELT are generated through automated machine extraction and coding, which may introduce errors during these processes. To measure the accuracy of the event data, we randomly extracted several event records from the GDELT database and manually compared the participant and event type information with the original text information of each event record to evaluate the correctness of field extraction and coding. According to the simple random sampling sample size calculation Equation (1) proposed in Reference [9] and considering the requirements of confidence level and maximum allowable error, the minimum required sample size is determined to be 385. Therefore, three sets of non-duplicate event records with directly accessible source links were randomly selected from GDELT data during 2021, resulting in a total of 1200 records. Three researchers familiar with the GDELT database independently reviewed the original text of 400 event records, located the relevant sentence segments, and evaluated the accuracy of the field annotations based on their understanding of the text. It was stipulated that when evaluators deemed the field content inconsistent or inappropriate with the original text’s semantics, it was marked as incorrect. However, when the field content was incomplete, it was still recorded as correct. Experimental statistics were conducted for the two categories of source articles in English and non-English, respectively. The experimental results show that the accuracy rate of the English articles is about 56% and the accuracy rate of the non-English articles is about 53% (Table 4). Given the relatively higher accuracy of English-derived data, when conducting research requiring high data precision, appropriate preprocessing is recommended beforehand to improve the accuracy of the data. It should be noted that as the data were extracted from events in 2021, its accuracy could over time.
n = 1 / ( 1 N + d 2 u a / 2 2 S 2 ) = N u a / 2 2 S 2 N d 2 + u a / 2 2 S 2
In the formula, n is the minimum sample size, N is the total sample size, u a / 2 is the standard normal distribution value corresponding to the confidence level, S 2 is the estimated value of the total variance, and d is the accuracy requirement.
In their experiment evaluating the accuracy of GDELT fields, the authors also compiled several typical information extraction errors and summarized their causes, as detailed in Table 5.

2.5. Data Redundancy

Different media outlets often report on the same event, and a single outlet may publish multiple updates about it. As GDELT obtains event records from different data sources, this results in the existence of redundant data in the event database. Although the database will do certain de-emphasis processing when the event data are imported, given the actual use of the database, a non-negligible amount of redundancy remains in practice. Therefore, we designed experiments to measure data redundancy. Accurately measuring the redundancy of data often requires comparing the original information of all relevant events. Due to the large amount of data, we simplified the criteria for determining duplicate data to facilitate the assessment. In this way, events with the same participants, type, location, and occurring within a time difference of 24 h are regarded as duplicated data. Given the sheer volume of news events originating from the United States and the wide range of media sources covering them, selecting data from US-based events for sampling ensures a large dataset with balanced distribution. This approach can fully reflect the data redundancy within GDELT. In the experiment, data from the United States in 2021 were selected, and data retrieval match records from 48 days (the 1st, 7th, 14th, and 21st of each month) were extracted to statistically analyze and infer the overall redundancy of the database. The redundancy index of the event database was defined as follows:
R d = n i N i
In the formula, N i represents the total number of deduplicated event records on i day, n i represents the number of duplicate records on i day, R d represents the proportion of duplicate events collected in the event database, and the sampling statistics are shown in Table 6.

3. Typical Application Status

In recent years, the application of autoencoding techniques has enabled research based on massive event datasets. Through systematic analysis, comparison, and evaluation of the ICEWS and GDELT open-source event repositories, Li et al. found that GDELT offers advantages such as broad coverage and rapid data updates [10]. Consequently, it has been widely applied in research fields including crisis event monitoring, economic indicator forecasting, and international relations assessment. Although the GDELT database suffers from limited accuracy in keyword fields (about 55%) and high data redundancy (up to 20%), recent research [11] has proposed a two-layer deduplication method to address duplicate issues within GDELT. This approach improves the quality of raw data, thereby enhancing its effectiveness and reliability—a conclusion consistent with our own redundancy analysis.

3.1. In Terms of Crisis Event Monitoring

By mining and analyzing refugee-related news after the death of Alan Kurdi, a two-year-old Syrian boy, Himarsha R et al. [12] depicted the temporal evolution pattern of media fermentation, buzz, and calm before and after the event, and tracked the hotspots prone to refugee-hatred through geospatial data information, which provided a methodology for the predictive monitoring of crisis event. In a related study, Qiu Lin et al. [13] utilized GDELT as the data source, calculated the global (local) conflict index according to the number of events, impact, attention and other indicators, and worked out a distance-based time series conflict detection method, which provides support for conflict early warning.

3.2. In International Relations Research

Chi Zhipei et al. [14] used the GDELT event database to assess Sino-US relations between 1993 and 2016. They compared the results with concurrent measurements from the “China and Great Powers Relationship Score Table” published by Tsinghua University, demonstrating the feasibility of using GDELT for quantitative analysis of international relations. Li Bing et al. [15] adopted an ordered clustering method to classify the geopolitical relations between China and Southeast Asian countries into time phases. Using social network analysis, they identified central actors within the network to study the temporal and spatial evolution of the cooperation and conflict between these countries, and combined the use of mathematical statistics and data analysis.

3.3. In Terms of Policy Effect Testing

Following the “South China Sea ruling case”, Lin Qiaoqiao et al. [16] used quantitative analysis of GDELT data to depict the role of China’s military foreign policy in eliminating the negative impacts. This analysis avoided the lagging of information and the singleness of samples caused by the traditional public opinion analysis method, and provided a timely and effective reference for policy evaluation and adjustment. Based on GDELT data, Wang Jinbo et al. [17] employed multiple, multi-period differencing methods, and took the institutional, cultural, and cognitive distances between Southeast Asian countries and China as the explanatory variables, to explore the impact of “Belt & Road” initiative on China’s image.

3.4. In Terms of Public Opinion Impact Analysis

Le Viet Hoang et al. [18] explained the impact of news sentiment related to the Russian-Ukrainian conflict from October 2021 to June 2022 on the airline and defense industry markets, as well as on the direction of the airline stock market. In addition, scholars have also used GDELT to analyze the impact of media reports on attracting foreign investment [19], outbound investment [20], inbound tourism [21], and cross-border e-commerce [22]. These studies extend the application of GDELT data analysis into exploratory research in the economic field.
A survey of current applications reveals that scholars have widely used GDELT database to carry out research on specific problems in different fields. Many of the analytical models constructed in these studies rely on metrics such as the volume of event reports over time, the proportion of cooperation/conflict, the degree of tone, and GoldsteinScale statistics, to assess national influence, international relations, political and economic trends, etc. Beyond these specific event-driven applications of mathematical statistics, a growing body of research has begun to use GDELT to analyze the structural characteristics of the global media ecosystem itself. For instance, studies [23,24,25] independently employed the Event, GKG, and Mentions tables to construct cross-national news flow network models, identifying “super-spreader” media organizations—representing an expanded application of GDELT datasets. However, such research must first address how to match media outlets with their respective countries. Existing methods typically rely solely on domain name queries using the SOURCEURL, SourceCommonName, and MentionSourceName fields to determine the country of origin for event coverage. However, at least three shortcomings exist. First, since GDELT’s media sources encompass diverse news outlets across nearly all regions worldwide, the domain matching process becomes labor-intensive if isolated reports are not excluded beforehand [26]. Second, GDELT data are collected at 15-min intervals, and existing studies constructing directed networks generally treat all participating media within the first 15 min as initial disseminators, neglecting the mutual information transmission among them, which introduces errors. Third, since most researchers use GDELT for social science research, existing methods rely heavily on mathematical models, presenting challenges for some researchers. Therefore, this paper explores event and mention linkage queries, proposing a novel application of co-occurrence network analysis methods to examine media dissemination characteristics. First, events with reposting/citation relationships are filtered using the NumSources field in the Event table to reduce the number of media outlets and source countries requiring subsequent matching. Then, by linking the Event and Mentions tables, media outlets with reposting/citation relationships are identified to construct a media co-occurrence network, avoiding the error of treating all media outlets within the first 15 min as initial disseminators. Finally, core media outlets within the network are identified through multi-event co-occurrence analysis, with Gephi software (version 0.10.1) providing a more straightforward and intuitive implementation approach.

4. Media Co-Occurrence Network Analysis

Our media co-occurrence network analysis is grounded in an Analytical Framework that integrates information diffusion theory [27] and social network analysis [28]. In this framework, media organizations are treated as nodes in a network, and edges represent information flows between them. The identification of central nodes (hubs) helps reveal which organizations wield the most influence in shaping narratives around specific events or topics. Media co-occurrence refers to the fact that different media have carried out the same report on the same event at a certain time due to reprinting and citation relationship, and there exists a specific media as a communication central node (hubs) when reporting on similar events.

4.1. Constructing Co-Occurrence Network

In this study, we take media as nodes, and citation or reprint relationships between them as edges, to construct a “node-edge” network topology. This structure allows for the visualization and analysis of information propagation pathways among media outlets. Specifically, if different media report very similar stories (same content, word count, and emotional value) about an event within 24 h, it is assumed that there is a reprinting or quoting relationship between them. Each such instance is defined as one co-occurrence, so as to construct a co-occurring network with undirected power.
G = ( V , L , E , d , w )
where E denotes the set of source media and all reprinted media that reported on an event, L denotes the set of edges formed by the co-occurrence relationship between two of these media, and V denotes all the events included in a certain type of report in the study interval. There are two important feature parameters in an undirected entitled co-linear network.

4.1.1. Node Centrality ( d )

Centrality characterizes the number of connections a node has to other nodes within the network. There is no distinction between outgoing and incoming degrees in an undirected network, and the value of the degree increases when two nodes co-occur. Node centrality serves as an indicator of its importance within the network: higher centrality values correspond to greater node significance, reflecting the broader influence of the corresponding media outlet.
d i = i E δ j i ,   δ j i = 1 ,   i   and   j   are   directly   connected   by   an   edge 0 ,   i   and   j   have   no   directly   connected   by   an   edge

4.1.2. Edge Weight ( w )

Edge weight indicates the total number of connections between two specific nodes, i.e., the number of edges, usually a higher edge weight indicates a stronger association between the two nodes, reflecting the closer relationship between the media.
w L = L V δ L

4.2. Data Screening Processing

In the GDELT database, no field directly describes the reprint citation relationship between media outlets, nor has previous literature addressed this specific linkage. Therefore, after fully understanding the relationship between the fields of different data tables in GDELT, we integrated data from both Event and Mentions tables for analysis. Since the GlobalEventID field in the Event table corresponds to the GLOBALEVENTID field in the Mentions table, when different media report a certain time (i.e., the value of the NumSources field in the Event table is >1), GlobalEventID can serve as a foreign key to link the Event table and the Mentions table to each other. Mentions table for a correlation query, if different media (MentionSourceName) report the similar event with the same number of words, word order, and sentiment value (MentionDocLen, SentenceID, MentionDocTone), we infer a reprinting or quoting relationship between the reports of the above-mentioned media on an event (Table 7). After obtaining the GDELT data from the above table query, the edge data of nodes and weights are generated based on the co-occurrence relationship to construct the co-occurrence network of media reports about a certain type of event, which can be imported into the Gephi software for visualization and analysis.

4.3. Research Example

Below is a case study of the co-occurrence of Chinese business-related reports in Indian mainstream media. As two of the world’s largest emerging economies, China and India maintain competitive relationships in areas such as trade, technology, and geopolitics. In recent years, India has benchmarked against China’s economy progress, resulting in substantial media attention on Chinese business developments. Employing the method proposed in this paper, we study the dissemination patterns formed by Indian mainstream media (top 20 outlets in terms of influence) in their coverage of Chinese business from 2020 to 2024.
(1)
Use an SQL statement to filter events (GlobalEventID) in the Event table where Indian mainstream media and other global outlets jointly reported on Chinese business developments, specifically: Actor1CountryCode = ‘CHN’, Actor1TypeCode = ‘BUS’, NumSources > 1, EventDate BETWEEN ‘2020-01-01’ AND ‘2024-12-31’, and SOURCEURL belongs to Indian mainstream media (As detailed in Table 8).
(2)
Perform a linked query using GlobalEventID from the Event table and GLOBALEVENTID from the Mentions table as foreign keys. If different media outlets (MentionSourceName) report on the same event (GLOBALEVENTID) with similar word counts (MentionDocLen), sentence sequences (SentenceID), and sentiment values (MentionDocTone), we infer a reposting or quoting relationship between the reports of the aforementioned media on the event.
(3)
Designate the filtered media (MentionSourceName) as co-occurring outlets for the event (GLOBALEVENTID). Since GDELT collects data in 15-min intervals, the first reporting media cannot be precisely identified. Therefore, undirected edges with a weight of 1 are formed between each pair of co-occurring media.
(4)
Merge all related events into an undirected co-occurrence network. Multiple edges between media are weighted. Core media outlets are identified by calculating node centrality using Gephi software and the network is thus visualized.
As shown in Figure 2 and Table 9. In the co-occurrence network of media outlets during the selected time period, there are 227 media nodes, forming 4365 undirected edges. First, The Times of India and Business Standard, the largest Indian comprehensive/business media outlets, form the absolute core of the communication network. They exhibit extensive connections outward to international media such as Reuters, MSN, and The New York Times, and inward to domestic core media or major business media such as News.18, NDTV, India Today, Financial Express, and Moneycontrol, serving as a news dissemination hub. Second, the edge weight between Business-Standard and Reuters is the highest, indicating the closest relationship between them. They serve as the core of connection and information dissemination regarding Chinese business topics between Indian media and global media. Third, Indian media have not been found to reprint or cite reports from Chinese official media, suggesting that there is a certain degree of mistrust between the two countries’ public opinion environments.

5. Limitations and Future Work

While this study provides a systematic assessment of the GDELT database and proposes a novel method for constructing media co-occurrence networks, several limitations should be acknowledged, which also pave the way for future research.
Firstly, while our evaluation of GDELT’s data quality offers important insights into accuracy and redundancy, it is subject to certain constraints. The accuracy audit was conducted on a sample from a single year (2021) and may not fully capture temporal variations in GDELT’s machine coding performance. Future work should expand the validation to a multi-year longitudinal analysis and include a wider range of low-resource languages (e.g., Chinese, Arabic) to more comprehensively assess systemic biases and extraction errors. Furthermore, although our method for inferring media reprinting relationships is innovative, the current heuristic (relying on exact matches in word count and tone) might miss semantically similar reports that are paraphrased. Future methodologies could incorporate natural language processing (NLP) techniques, such as semantic similarity models and quote tracing, to capture more nuanced dissemination pathways.
Secondly, although the media co-occurrence network analysis in this study represents a considerable advance over traditional event-counting approaches, its current scope remains confined to the ecosystem of news media. As noted by Ruan et al. [29,30,31], information diffusion in the modern era is a cross-platform phenomenon, spanning from traditional news to social media (e.g., Twitter, Reddit). A critical and promising future direction is to integrate GDELT’s media network with data from social platforms. This would enable researchers to track how events propagate from news sites to social media and vice versa, uncovering a more complete picture of the global information ecosystem.
Lastly, the Western-centric and English-language bias inherent in GDELT, as identified in our coverage analysis, has profound implications. It reflects deeper theoretical issues of agenda-setting and media gatekeeping in global communications. Researchers must be cautious of this bias when concluding international relations or public opinion in underrepresented regions. Conversely, this limitation also presents valuable research opportunities to develop bias-aware models or to combine GDELT with regional datasets to create a more balanced and comprehensive view.
In conclusion, despite these limitations, this study underscores the need for rigorous preprocessing of GDELT data and demonstrates the considerable potential of mining inter-table relationships to uncover the hidden structures of media dissemination. The proposed method provides a foundational framework for developing more advanced, cross-platform, and semantically aware models of information diffusion.

Author Contributions

Conceptualization, D.H. and Z.F.; methodology, D.H.; validation, Z.F.; formal analysis, D.H.; investigation and data curation, D.H. and Z.F.; writing—original draft preparation, D.H.; writing—review and editing, Y.P.; funding acquisition and project administration, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Murali, R.; Patnaik, S.; Cranefield, S. Mining International Political Norms from the GDELT Database. In Proceedings of the Coordination, Organizations, Institutions, Norms, and Ethics for Governance of Multi-Agent Systems XIII (COIN/COINE), London, UK, 9 May 2020; Aler Tubella, A., Cranefield, S., Frantz, C., Meneguzzi, F., Vasconcelos, W., Eds.; Springer: Cham, Switzerland, 2021; pp. 39–58. [Google Scholar]
  2. Wang, Y.; Tao, Y. The effect of fluctuations in bilateral relations on trade: Evidence from China and ASEAN countries. Humanit. Soc. Sci. Commun. 2024, 11, 32. [Google Scholar] [CrossRef]
  3. Owuor, I.; Hochmair, H.H.; Cvetojevic, S. Tracking hurricane dorian in GDELT and twitter. Agil. GISci. Ser. 2020, 1, 19. [Google Scholar] [CrossRef]
  4. Voukelatou, V.; Pappalardo, L.; Miliou, I.; Gabrielli, L.; Giannotti, F. Estimating countries’ peace index through the lens of the world news as monitored by GDELT. In Proceedings of the 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), Sydney, Australia, 6–9 October 2020; IEEE: New York, NY, USA, 2020; pp. 216–225. [Google Scholar]
  5. Deng, S.; Rangwala, H.; Ning, Y. Dynamic Knowledge Graph Based Multi-Event Forecasting. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, San Diego, CA, USA, 6–10 July 2020; pp. 1585–1595. [Google Scholar]
  6. Zhao, L.; Gao, Y.; Ye, J.; Chen, F.; Ye, Y.; Lu, C.-T.; Ramakrishnan, N. Spatio-Temporal Event Forecasting Using Incremental Multi-Source Feature Learning. ACM Trans. Knowl. Discov. Data 2021, 16, 1–28. [Google Scholar] [CrossRef]
  7. Kalev, L.; Philip, A.S. GDELT: Global Data on Events, Location and Tone, 1979–2012. ISA Annu. Conv. 2013, 2, 1–49. [Google Scholar]
  8. The GDELT Project. GDELT 2.0: Our Global World in Realtime [EB/OL]. 2015. Available online: https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/ (accessed on 3 September 2025).
  9. Yuan, J.; Li, K. A Comparative Study on Sample Size Calculation Methods. Stat. Decis. Mak. 2013, 1, 22–25. [Google Scholar] [CrossRef]
  10. Li, Z.; Gao, Z.; Zhou, Y.; Ma, Z.; Hu, Z.; Shi, J. Comparative Study and Evaluation of Open-Source Incident Repositories. J. China Electron. Technol. Acad. 2022, 17, 129–133. [Google Scholar]
  11. Deepti, J.; Regina, W.; Dalton, H.; Ratcliff, S.; Samal, A.; Soh, L.K. Deduplication of the media-based event databases. J. Comput. Soc. Sci. 2025, 8, 76. [Google Scholar] [CrossRef]
  12. Jayanetti, H.R.; Frydenlund, E.; Weigle, M.C. Exploring Xenophobic Events Through Gdelt Data Analysis. arXiv 2023, arXiv:2305.01708. [Google Scholar]
  13. Qiu, L.; Qin, K.; Luo, P.; Yao, B.; Zhu, Z. Quantitative expression of conflict intensity and detection of conflict events based on GDELT news data. J. Geo Inf. Sci. 2021, 23, 1956–1970. [Google Scholar]
  14. Chi, Z.; Hou, N. A quantitative study of big data and bilateral relations: Taking GDELT and China-US relations as an example. Int. Political Sci. 2019, 4, 67–88. [Google Scholar] [CrossRef]
  15. Li, B.; Peng, F. Evolution of geopolitical relations between China and Southeast Asian countries based on GDELT database. World Geogr. Res. 2021, 30, 1127–1139. [Google Scholar]
  16. Lin, Q.; Wang, H.; Luan, W. Research on International Impact of “South China Sea Arbitration Case” Based on GDELT Database. China Soft Sci. 2021, 9, 25–33. [Google Scholar]
  17. Wang, J. “Belt and Road” and Southeast Asian Countries’ Image Perception of China: An Empirical Study Based on GDELT Big Data. Nanyang Stud. 2023, 73–89. [Google Scholar] [CrossRef]
  18. Hoang, V.L.; Jörg, H.M.V.; Stéphane, G.; Goutte, S.; Liu, F. News-based sentiment: Can it explain market performance before and after the Russia-Ukraine. J. Risk Financ. 2023, 24, 72–88. [Google Scholar]
  19. Cheng, Y.; Cheng, D.; Lu, J. The impact of international public opinion on the introduction of foreign investment in China: An empirical study based on Gdelt news big data. World Econ. Res. 2021, 19–33+135. [Google Scholar] [CrossRef]
  20. Jin, Y.; Chen, T.-T. Does negative host country public opinion affect China’s outward foreign direct investment?—A test based on GDELT big data. Int. Econ. Coop. 2023, 2, 52–65+93. [Google Scholar] [CrossRef]
  21. Cheng, Y.; Cheng, D.; Li, J. The impact of international public opinion on China’s inbound tourism trade—An empirical study based on Gdelt news big database. Soc. Sci. Res. 2022, 2, 113–125. [Google Scholar]
  22. Xie, Y. Research on the Impact of Bilateral Relations on China’s Cross-border E-Commerce Exports. Master’s Thesis, Zhejiang University, Hangzhou, China, 2023. [Google Scholar] [CrossRef]
  23. Liang, T.; Qin, K.; Ruan, J.; Yu, X.; Zhou, Y.; Liu, D.; Xin, L. Research on Measurement and Community Detection of Geographic Multiple Flow Based on Multi-layer Network Methods. J. Geo Inf. Sci. 2024, 26, 1843–1857. [Google Scholar]
  24. Mo, X.; Shen, H.; Yu, D. Analysis of Global News Flow Patterns Based on Complex Networks. J. Southwest Univ. (Nat. Sci. Ed.) 2020, 42, 15–24. [Google Scholar]
  25. Shayan, A.; Niccolò, D.M.; Michele, A.; Etta, G.; Cinelli, M.; Quattrociocchi, W. The drivers of global news spreading patterns. Sci. Rep. 2024, 14, 1519. [Google Scholar] [CrossRef]
  26. Gong, W.; Zhu, M.; Zhang, S.; Luo, J. Media Hegemony, Cultural Circle and the Global Dissemination of the Orientalist Discourse: Taking Public Opinion on China in GDELT as an Example. Sociol. Stud. 2019, 34, 138–164+245. [Google Scholar]
  27. Rogers, E.M. Diffusion of Innovations; Free Press of Glencoe: New York, NY, USA, 1962. [Google Scholar]
  28. Faust, K. Social Network Analysis: Methods and Applications; Cambridge University Press: Cambridge, UK, 1994. [Google Scholar]
  29. Ruan, T.; Qin, L. Public perception of electric vehicles on reddit over the past decade. Commun. Transp. Res. 2022, 2, 100070. [Google Scholar] [CrossRef]
  30. Ruan, T.; Qin, L. Public perception of electric vehicles on Reddit and Twitter: A cross-platform analysis. Transp. Res. Interdiscip. Perspect. 2023, 21, 100872. [Google Scholar] [CrossRef]
  31. Ruan, T.; Kong, Q.; Zhang, Y.; McBride, S.K.; Lv, Q. An analysis of twitter responses to the 2019 ridgecrest earthquake sequence. In Proceedings of the 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Exeter, UK, 17–19 December 2020; IEEE Computer Society: New York, NY, USA, 2020. [Google Scholar]
Figure 1. Annual distribution of GDELT data volume in the last ten years.
Figure 1. Annual distribution of GDELT data volume in the last ten years.
Data 10 00158 g001
Figure 2. Indian mainstream media’s dissemination network for commercial reports on China.
Figure 2. Indian mainstream media’s dissemination network for commercial reports on China.
Data 10 00158 g002
Table 1. Description of data field attributes in the Event table of GDELT database.
Table 1. Description of data field attributes in the Event table of GDELT database.
Field LocationAttribute Category
1–5EventID and Date Attributes
6–25Actor Attributes
26–35Event Action Attributes
36–59Event Geography
60–62Data Management Fields
Table 2. Top 10 regions in terms of the number of events.
Table 2. Top 10 regions in terms of the number of events.
RankingRegionPercentage
1United States0.264
2United Kingdom0.047
3Russia0.039
4India0.035
5China0.030
6Israel0.025
7Nigeria0.020
8France0.020
9Canada0.019
10Australia0.018
Total-0.517
Table 3. Top 10 media outlets in terms of the number of events.
Table 3. Top 10 media outlets in terms of the number of events.
RankingMediaCountryPercentage
1MSNUnited States0.031
2ReutersUnited Kingdom0.009
3Love RadioUnited States0.009
4Daily MailUnited Kingdom0.005
5YahooUnited States0.004
6The Times of IndiaIndia0.004
7Pan-African NetworkSouth Africa0.004
8Houston ChronicleUnited States0.003
9Washington TimesUnited States0.003
10San Francisco ChronicleUnited States0.003
Total--0.079
Table 4. Experimental results of field accuracy assessment.
Table 4. Experimental results of field accuracy assessment.
Event Type (Non-English)Event Type (English)Participants (Non-English)Participants (English)
0.5320.5620.5330.554
Table 5. Typical Extraction Error Cases and Root Cause Analysis in GDELT.
Table 5. Typical Extraction Error Cases and Root Cause Analysis in GDELT.
No.Original TextExtraction Results (Actor1/Event Code/Actor2)Root Cause
1The idea is that the president will meet the governors and analyze the security situations in the various states of the federation before meeting the members of the House of Representatives next Thursday.House of Representatives/Express intent to meet or negotiate/PresidentActors reversed
2The San Francisco-based law firm recently filed 85 lawsuits against Uber, mostly in San Francisco County Superior Court, with 321 cases pending, and filed more than 20 lawsuits against Lyft, with 517 cases pending, lawyers told KPIX.San Francisco/Bring lawsuit against/LawyerLocations are randomly selected for the Actors
3The Bahrain Shura Council has condemned broadcasts by the Qatari Al Jazeera channel, saying that it lacked credibility and that it has been using an approach to sow chaos, terrorism and violence.Bahrain/Criticize or denounce/QatariModifier extraction for Actors
4Trump arrived in Cleveland aboard Air Force One ahead of the debate.Air Force/Make a visit/ClevelandMisunderstanding Proper Nouns
5But wait, President Trump is not conceding, according to his lawyer Rudely Guillani, former New York City mayor.President/Yield, not specified below/LawyerMisinterpretation of the original text’s meaning
Table 6. Statistical results of data redundancy.
Table 6. Statistical results of data redundancy.
Total Number of Event RecordsNumber of Duplicate Event RecordsRedundancy
6,640,7481,367,1410.206
Table 7. Meaning of some fields in GDELT.
Table 7. Meaning of some fields in GDELT.
FieldsMeaningSource Table
GlobalEventIDGlobal event unique identifierEvent
GLOBALEVENTIDGlobal Event Unique IdentifierMentions
ActorCountryCodeCountry CodeEvent
ActorTypeCodeEntity type of the event participantEvent
NumSourcesTotal number of sources for the same eventEvent
MentionSourceNameMain domain name of the information source mediaMentions
SentenceIDMention of the event’s position in the textMentions
MentionDocLenTotal number of characters in the documentMentions
MentionDocToneTone value for the entire articleMentions
Table 8. Selection of Mainstream Media in India.
Table 8. Selection of Mainstream Media in India.
No.MediaNo.Media
1The Times of India11Aaj Tak
2India Today12ABP News
3Hindustan Times13Dainik Bhaskar
4The Indian Express14Dainik Jagran
5NDTV15Mathrubhumi
6BBC News India16Economic Times
7Reuters India17Business Standard
8The Wire18Moneycontrol
9Mint19The Print
10Firstpost20News18
Table 9. Top 10 edge weighting.
Table 9. Top 10 edge weighting.
RankingEdgeWeight
1Business Standard—The Times of India13
2Business Standard—Reuters10
3Business Standard—Financial Express8
4Reuters—Channel NewsAsia8
5Reuters—MSN7
6Reuters—The Times of India7
7The Times of India—NDTV6
8Business Standard—DNA India6
9Business Standard—Moneycontrol6
10Reuters—Yahoo6
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hong, D.; Fu, Z.; Zhang, X.; Pan, Y. Research on the Development and Application of the GDELT Event Database. Data 2025, 10, 158. https://doi.org/10.3390/data10100158

AMA Style

Hong D, Fu Z, Zhang X, Pan Y. Research on the Development and Application of the GDELT Event Database. Data. 2025; 10(10):158. https://doi.org/10.3390/data10100158

Chicago/Turabian Style

Hong, Dengxi, Zexin Fu, Xin Zhang, and Yan Pan. 2025. "Research on the Development and Application of the GDELT Event Database" Data 10, no. 10: 158. https://doi.org/10.3390/data10100158

APA Style

Hong, D., Fu, Z., Zhang, X., & Pan, Y. (2025). Research on the Development and Application of the GDELT Event Database. Data, 10(10), 158. https://doi.org/10.3390/data10100158

Article Metrics

Back to TopTop