1. Introduction
As Ragnar Frisch explained in the first editorial note of
Econometrica (1933), ‘
econometrics is the unification of statistics, economic theory, and mathematics’. A modern rephrasing could be, ‘
Application of statistical and mathematical methods to economic data, guided by economic theory’. Each perspective is necessary but insufficient for understanding quantitative relationships in modern economic life. Econometrics aims to give empirical content to economic relationships. The three key ingredients are economic theory, economic data, and statistical methods. Neither ‘theory without measurement’ nor ‘measurement without theory’ is sufficient for explaining economic phenomena. It is, as Frisch emphasized, their union that is the key to success in the future development of econometrics [
1].
Defining R precisely is not straightforward. R is both a software environment for quantitative analysis and a programming language oriented toward data science. Moreover, R evolved from the S language, originally developed at Bell Laboratories for quantitative analysis. The most appropriate description seems to be that R is an environment (actually a programming language) which is focused on the processing of data and their (statistical) analysis, along with wide visualization (graphic) possibilities. important features of the R environment are its flexibility and the possibility of constantly expanding its applications. This is most evident in the large number of additional packages available in the official repository.
Previous systematic reviews of
R packages have focused on fields unrelated to the present study. One review focused on the application of exploratory factor analysis (EFA), examining the implementation and methodological aspects of this statistical technique within
R [
2]. Another review pertained to sports statistics and sports analytics, evaluating
R packages designed for analyzing and interpreting data in the context of sports performance and related metrics [
3].
Despite
R’s increasing adoption in econometrics, systematic evaluation of econometrics-related
R packages remains largely unexplored. This paper addresses that gap by offering a structured overview of 207
R packages relevant to economic and econometric analysis. (A list of the packages with information is provided as electronic
Supplementary Material.) The objective is to classify, summarize, and analyze these packages with respect to their statistical features, metadata, and publication profiles. Moreover, we assess trends in package development, identify methodological concentrations (e.g., Bayesian vs. frequentist), and explore regional contributions to the software ecosystem. To assess development trends, documentation, and publication patterns, we utilize text mining techniques. This overview is, to the best of our knowledge, the first attempt to delineate the packages dedicated to the field of economics in general. By linking the methodological scope to documentation and dissemination practices, the study provides new empirical insights into the structure and reproducibility of econometric software, highlighting both the strengths and challenges of R’s role in modern econometric research.
The remainder of this paper is structured as follows:
Section 2 outlines the methodology used to identify and select relevant
R packages from CRAN and the Econometrics Task View.
Section 3 presents the results of the statistical analyses, including descriptive summaries and inferential relationships. Its subsections investigate author profiles, update behaviors, documentation practices, and citation networks.
Section 3.11, specifically, highlights key findings from a text mining analysis of the package descriptions. Finally,
Section 4 concludes with a summary of this paper’s contributions and potential directions for future work.
2. Methods
To systematically identify econometrics-related R packages, a structured and reproducible search strategy was employed. The goal was to compile a comprehensive set of packages that are actively used in econometric analysis, ensuring relevance across multiple subfields such as time series analysis, panel data modeling, causal inference, regression methods, Bayesian econometrics, and data collection through web scraping.
The search began with a CRAN keyword query using ‘econom’ to identify packages explicitly linked to econometrics. This approach ensured the inclusion of packages whose titles, descriptions, or metadata reference econometric applications. Next, the scope was expanded to include additional relevant areas, incorporating packages that contribute to econometric modeling, statistical inference, and computational methods widely used in the discipline. Packages covering time series, panel data, causal inference, regression, Bayesian econometrics, or web scraping were included. A total of 64 packages were included in this process.
To further refine the selection, the CRAN Task View on Econometrics was reviewed. All packages listed under this category were included in the dataset, except for one package that was deemed irrelevant to econometrics. The Task View provided a curated collection of tools recommended by domain experts, ensuring that the dataset covered key methodologies and widely used econometric techniques. After this process, 145 packages were also included.
The search was conducted between 1 and 31 January 2025. A total of 209 packages were identified; however, 2 packages were excluded from the analysis. The first,
lotterybr, was omitted as it contains data related to Brazilian lottery games, focusing on the exploration of their dynamics and outcomes. The second,
xts, was excluded due to its primary function of facilitating the uniform handling of
R’s various time-based data classes by extending the
zoo package. This extension preserves native format information, enables user-level customization and extension, and enhances cross-class interoperability. Following these exclusions, a total of 207 packages remained for further analysis. This process is visually depicted in
Figure 1, illustrating the systematic approach adopted for this overview.
Once the relevant packages were identified, detailed metadata were recorded for each package to facilitate subsequent analysis. The extracted metadata included package name, first release year, creator country, number of authors, number of updates, availability of vignettes, presence of built-in datasets, and references in journal and book publications. Additionally, to assess the broader impact and usage of each package, reverse dependencies (imports, suggests, enhances) were documented, along with whether the package was specifically designed for Bayesian analysis or web scraping. Lastly, the gender of the package creator was noted where possible, allowing for demographic insights into the development community.
By implementing this structured search strategy, a total of 207 R packages were identified and systematically analyzed. This methodology ensured that the selection was both comprehensive and relevant, providing a solid foundation for evaluation of the role of R in contemporary econometric research.
3. Statistical Analysis
3.1. Main Characteristics
Table 1 summarizes the main characteristics of the collected dataset. In terms of time period, most packages were produced between 2009 and 2018 (46%), with fewer being produced in recent years (2019–2024) (28%) and even fewer between 1999 and 2008 (26%). Most creators are based in Europe (56%) and North America (27%), with less representation from other continents.
Most packages were created by small teams; 56% had one or two authors, while only 4% had more than ten. Update frequency followed a similar trend: 54% of packages received 1–10 updates, while 9% underwent extensive revisions (51–216 updates). In terms of content and documentation, 68% of the items contained data, while 46% included vignettes, indicating a moderate level of user support and reproducibility. Regarding publication types, about 45% were linked to journals, while only 19% were associated with books. Thus, journal-based dissemination has a stronger connection than book-based dissemination.
Most packages (93%) were reverse-imported, i.e., suggested by or enhanced by other packages. Furthermore, 94% of the creators in the dataset were male. Only 7% of the packages were ‘dataset-only’, the same percentage used Bayesian methods, and 8% involved web scraping. Finally, 73% of packages were categorized under the CRAN Task View for Econometrics, reflecting a strong focus on econometric applications.
3.2. Journal Publication and Author Group
Of the 94 published packages, 46 (48.9%) were created by teams of three to ten authors, compared to only 36 (31.9%) of the 113 unpublished packages, as indicated in
Table 2. Packages created by smaller groups of one or two authors were more likely to remain unpublished (73, 64.6%) than to be published (44, 46.8%). Only 8 packages, evenly divided between published and unpublished, were created by large teams of 11–26 authors.
The test of independence was used to assess the relationship between author group size and publication status. The test indicated a significant association (p-value = 0.035) between team size and the likelihood of journal publication. Notably, mid-sized collaborations (3–10 authors) were more likely to achieve academic publication than smaller or larger groups.
This suggests that group size reflects not only the scope of collaboration but also the professional networks and institutional support that facilitate publication in peer-reviewed journals. In the context of computational econometrics, collaborative teams may benefit from the division of labor in package development, increased methodological rigor, and improved documentation—factors that align with academic standards emphasizing transparency, replicability, and the collective vetting of analytical tools [
4,
5]. Thus, author team composition may serve as an indirect indicator of both the scholarly orientation and the publication potential of econometrics-related
R packages.
3.3. Vignettes and Journal Publication
An analysis of vignette inclusion and journal publication revealed notable differences among the econometrics
R packages. As shown in
Table 3, of the 94 packages with a peer-reviewed journal publication, 59 (62.8%) included a vignette. By contrast, only 37 of the 113 unpublished packages (32.7%) included a vignette. This suggests that vignettes—extended documentation or tutorials that improve reproducibility, usability, and adoption—are more common in packages with academic support [
6].
The
test revealed a significant association between publication status and vignette inclusion (
p-value
). This suggests that academic standards, which emphasize transparency and reproducibility in quantitative research, as well as sound software engineering practices, encourage the creation of vignettes [
4,
5].
In econometrics, vignettes signal methodological robustness, as reproducibility of empirical models and procedures is essential for both validation and extension. Packages with vignettes are therefore more likely to be published, because they align with the evolving norms in computational social sciences, which value openness and replicability [
7].
3.4. Reverse Imports/Suggests/Enhances and Journal Publication
Of the 94 published packages, 90 (95.7%) had reverse citations, while only 4 (4.3%) did not (
Table 4). Similarly, 103 of the 113 unpublished packages (91.2%) had at least one reverse reference, while 10 (8.8%) had none. Although published packages showed a slightly higher proportion of reverse citations, the difference was negligible.
No significant association was found between being referenced by other packages and journal publication (
p-value = 0.300). This indicates that network-based popularity or reuse, measured through reverse dependencies, does not predict whether an econometrics
R package is published in an academic journal. This may suggest that scholarly publication and community participation follow different logics; one is academically focused on peer review, and the other is practically driven by utility and reuse within the
R ecosystem [
8]. Although reverse dependencies often indicate a package’s importance within a software ecosystem, their lack of association with publication suggests differing incentives and audiences in academic and developer communities.
3.5. Reverse Imports/Suggests/Enhances and Updates
The ecosystem of econometrics-related packages shows a notable pattern of interconnectedness: packages with reverse imports or suggestions tend to form new dependencies and undergo more frequent updates. As shown in
Table 5, of the 195 packages with at least one reverse dependency, 100 (51.3%) had 1–10 updates, 75 (38.5%) had 11–50 updates, and 18 (9.2%) had 51–216 updates. In contrast, among the 14 packages without reverse dependencies, 12 (85.7%) had only 1–10 updates, and none had more than 50.
A significant association was found (
p-value = 0.045), indicating that update behavior is related to a package’s degree of integration within the
R system. This suggests that packages with reverse dependencies are updated more frequently, likely because they face higher user demand, greater visibility, or pressure to maintain backward compatibility. In econometric software, where reliable computation is essential, closely connected packages often serve as infrastructure for other tools. The tendency to update these packages more quickly aligns with software engineering best practices and reflects community dynamics where release schedules are shaped by user feedback and inter-package dependencies [
4,
5,
9].
3.6. Continent of Creator and Year of Creation
Temporal and geographic patterns reveal how contributions to econometrics
R packages have evolved over time. According to
Table 6, the first packages (1999–2008) were primarily developed in Europe (29) and North America (19), with minimal contributions from other continents. Between 2009 and 2018, European contributions rose sharply (55), and geographic diversity expanded, with notable input from Asia (7), Oceania (4), and South America (8). In 2019–2024, Europe continued to dominate (33), while contributions from other continents remained comparatively low.
No significant association was found between release period and geographic origin (p-value = 0.322). This suggests that, although Europe has remained the dominant contributor, the relative proportions across regions have not changed significantly over time. This reveals the gradual globalizing effect on open-source development, especially in more technical disciplines such as econometrics. While Europe and North America remain centers for statistical software production, increasing contributions from Asia and South America show a growing spirit of inclusion in the global organization of the R ecosystem. As open-source avenues drop barriers to participation, the geographic diversity of software development may increase even more with the general wave toward democratization in scientific computing.
3.7. Year of Creation and Updates
Analysis of the relationship between release period and update frequency revealed a notable pattern in software maintenance over time. As shown in
Table 7, packages released between 1999 and 2008 received more updates, with 36 (58.1%) in the moderate range (11–50) and 14 (22.6%) in the high range (51–216). By contrast, most packages from 2009–2018 (n = 64, 66.7%) had only 1–10 updates, with fewer showing moderate (28, 29.2%) or high (4, 4.2%) numbers of revisions. Recent packages (2019–2024) showed an even stronger tendency toward infrequent updates: 44 of 57 (77.2%) had only minimal maintenance, and none had more than 50 updates.
There is a strong association between package age and maintenance intensity (
p-value
). Older packages tend to receive more consistent long-term updates, likely due to established user bases, integration into workflows, and sustained feedback. This trend reflects they general principles of scientific software lifecycles, where long-term maintenance, community effort, and adaptability determine longevity and impact [
10,
11]. By contrast, newer packages may be in early adoption phases or designed for niche applications, and thus show limited post-release activity. This raises questions about the long-term sustainability of newer software projects and the institutional incentives for maintaining research code [
12,
13].
3.8. Continent of Creator and Bayesian Analysis
The relationship between the inclusion of Bayesian analysis features and the geographic location of package authors was examined, and only slight regional differences were found. According to
Table 8, Bayesian techniques appeared in only 15 packages across all regions. In North America, 3 of 55 packages (5.5%) included Bayesian methods, compared to 10 of 117 in Europe (8.5%). Asia, Oceania, and South America contributed at most one package each with Bayesian features.
The regional disparities were not statistically significant (p-value = 0.845), indicating no relationship between the inclusion of Bayesian features and the creator’s region. This suggests that, while Bayesian analysis remains marginal in econometric R software, it is not concentrated in any particular region.
The slow adoption of Bayesian methods in econometrics may reflect a persistent disciplinary bias toward the frequentist approach or perceptions of Bayesian inference as complex and computationally demanding [
14,
15]. However, as computational resources expand and Bayesian methods become more common in empirical studies, future research could revisit this issue to track regional adoption trends.
3.9. Dataset Availability and Journal Publication
Whether dataset-only econometrics
R packages, those without analytical functions, or methodological implementations were more or less likely to be published in academic journals was assessed using a contingency table. As shown in
Table 9, of the 14 dataset-only packages, only 2 (14.3%) were published, while 12 (85.7%) were not. In contrast, of the 193 packages with analytical functions or models, 92 (47.7%) had associated journal publications.
A statistically significant association was found between dataset-only status and publication (
p-value = 0.032). These results indicate that dataset-only packages are significantly less likely to be published in peer-reviewed journals. This reflects academic norms in econometrics, where methodological innovation, algorithmic development, or model implementation is typically required for publication [
16,
17]. While dataset packages may serve important pedagogical or empirical roles, they often do not meet the innovation threshold required by scholarly journals. This aligns with broader trends in computational social science, where publication incentives prioritize methodological contributions over data curation, despite the recognized value of open data [
18].
3.10. Growth of Packages
Figure 2 presents two cumulative line plots that collectively illustrate the temporal and geographic evolution of econometrics-related packages from 2000 to 2025. These visualizations provide insight into both the overall growth of the field and the global distribution of its development. This pattern mirrors broader trends in international scientific collaboration and knowledge production, where once-dominant regions are being increasingly joined by emerging contributors across the Global South [
19,
20].
Figure 2a presents the cumulative number of packages created over time. The horizontal axis provides information about the year of creation while the vertical axis provides the cumulative count. The curve is approximately exponential: growth began in the early 2000s, accelerated after 2010, and continued steeply into the 2020s. This indicates that the ecosystem of econometric tools in
R is maturing and rapidly expanding. Also, there is a growing dependence on
R as a key computational platform in academic and applied econometrics [
21,
22]. The sharp growth demonstrates how
R has become an essential platform for the distribution of econometric tools within the open-source community.
Figure 2b disaggregates the cumulative count by continent—Asia, Europe, North America, Oceania, and South America—revealing important regional dynamics in the development of econometrics packages. North America and Europe dominate the landscape, contributing the largest number of packages and showing strong growth, particularly in the post-2010 period. Asia demonstrates a growing presence, with a steady rise in contributions pointing to expanding research and development capacity in econometrics. Although Oceania and South America contributed fewer packages overall, both regions show steady growth, reflecting the global diffusion of econometric expertise and
R development.
Together, the two panels suggest that econometric practice is becoming globalized through open-source development. This development is led by a few dominant regions, but is becoming increasingly inclusive over time. The steep overall trajectory in the left panel is primarily driven by North America and Europe, as shown in the right panel, but the emergence of contributions from Asia and other regions signifies a broadening base of participation.
This combination implies two key outcomes:
Democratization of Econometric Tools: The open-source nature of
R enables researchers and practitioners worldwide to contribute and access sophisticated econometric methods, reducing traditional barriers associated with proprietary software [
23,
24].
Decentralization of Innovation: While historical hubs remain influential, the growth in contributions from underrepresented regions reflects a shift toward a more decentralized model of methodological innovation and software dissemination in the econometrics community [
19].
Thus,
Figure 2 illustrates not only the growth of
R packages but also the changing geography of knowledge production in quantitative economics, enabled by the open-source ethos and collaborative structures of the
R ecosystem.
3.11. Text Analysis of Descriptions
Textual analysis of the descriptions of these 207
R packages was performed. These descriptions underwent pre-processing, where text was transformed to lowercase, and punctuation, stops, non-informative words, hyperlinks, numbers, and references were removed. The lemmatization of these words was performed using the TreeTagger tool [
25,
26] via the package [
27]. The purpose of lemmatization is to transform the words to their base or dictionary form and to produce valid words that facilitate the organization and analysis of the text. The lemmatization for this TreeTagger software version is well-used and trained on the Decision Trees model. Then, wordclouds of both the most frequent words and phrases are produced [
28]. The extraction of the most frequent phrases was achieved utilizing the phm
R package [
29]. The
phm method is based on inputting a corpus of text, block splitting according to punctuation, and extracting and sorting unique n-grams of words, given words to avoid starting/ending with (like stop words) and phrases to be excluded, a minimum threshold frequency, and the avoidance of overlapping for each phrase. Thus, a series of unique most frequent phrases is the output of this phm tool.
The wordclouds produced from the above procedures are illustrated in
Figure 3, and the frequencies of the most frequent 20 words and phrases are shown in
Table 10 and
Table 11, respectively. According to
Figure 3a and
Table 10, the main purposes of the packages found at the word level are estimation and fitting (estimate, likelihood, fit), regression models (linear, probit, logit), inference and testing (test, statistic, inference), and data management (data, variable, testing). Regarding
Figure 3b and
Table 11, the main categories at phrase level are time series (time series, impulse response), classical estimation (maximum likelihood estimation, least square, two step), linear and generalized regression (linear regression, generalized linear, beta regression, quantile regression), panel analysis (panel data, fixed effect, random effect), causal inference and policy evaluation (instrumental variable, synthetic control, treatment effect), and system/simultaneous equations (discrete choice, multinomial probit).
4. Conclusions
This overview provides a comprehensive analysis of 207 econometrics-related R packages, offering new insights into the structure, development patterns, and dissemination of econometric tools within the R ecosystem. By integrating descriptive statistics, inferential analyses, and text mining techniques, this study contributes to a clearer understanding of the current state of computational econometrics in R.
The results show that most packages are developed by small-to-mid-sized teams in Europe and North America, with minimal but growing contributions from Asia and South America. Mid-sized teams (3–10 authors) have the highest likelihood of journal publication. Bayesian methods and dataset-only packages remain underrepresented, and newer packages receive fewer updates, raising sustainability concerns.
Several key findings have emerged: First, the majority of packages are created by small-to-medium-sized teams, with those developed by mid-sized groups being more likely to appear in academic journals. Although most packages are widely interconnected through reverse dependencies, this does not predict publication. Second, packages with thorough documentation—especially those including vignettes—are significantly more likely to be published, underscoring the role of reproducibility and usability in scholarly dissemination. Third, although the presence of reverse dependencies is high across packages, this network connectivity does not predict publication, suggesting divergent academic and practical incentives. Moreover, while most packages originate from Europe and North America, there is a growing, though still modest, contribution from other regions, indicating an ongoing globalization of econometric software development.
The text mining analysis further revealed that R packages cover a broad range of econometric techniques, including time series, panel data, causal inference, and estimation procedures, confirming the versatility and methodological richness of the R ecosystem. However, Bayesian methods and dataset-only packages remain underrepresented, and recent packages tend to be updated less frequently, raising questions about sustainability and support for newer tools. Overall, the findings imply that, while R has become central to econometric research, its ecosystem reflects both strong regional concentration and methodological biases, with broader participation and long-term maintenance emerging as key challenges for future development.
Future research can build on this work in several ways: First, a deeper functional benchmarking of package performance beyond metadata and description analysis would enhance practical decision making for users. Second, more attention could be given to the role of community dynamics and user feedback in shaping the development of packages. Finally, extending the overview to include GitHub-hosted packages and those not yet available on CRAN would provide a fuller picture of innovation and experimentation in computational econometrics.