Therefore, in order to avoid all these drawbacks of the classic research workflow, such as spending time on learning to use new software or the high probability of making mistakes, and at the same time to achieve clear, open and reproducible research, we need to work with a programming language. At this point, we have to remark that almost all programming languages can achieve every process, but the big question is what it costs. Some languages are extremely fast and efficiently utilise the processing power of the computer. Other languages are very capable of creating a mobile phone or other user-friendly applications, but in the academic and research community, the fundamental questions are how much time the researcher needs to become familiar with each tool and if this time investment is scientifically profitable. Hence, in research, we need a highly readable programming language, with a clear and simple syntax which does not require professional skills and degrees in computer science to use it. To be clear, the research community needs a programming language for non-programmers. R has all the beneficial attributes to be the primary tool for biometeorological research because it is a general-purpose data analysis language which can be easily and quickly learned without any previous coding experience.
3.1. R’s main Characteristics
Briefly, we can describe R as a modern, functional programming language that allows for the rapid development of ideas, together with object-oriented features for rigorous software development [55
]. Moreover, it is particularly important that R is a multiplatform, free, and open-source software. That allows every research group to use it without limitations, independent of the funding resources or the operating systems the researchers use, and the open-source characteristic is a guarantee for the quick development of new tools. R mainly consists of the language plus the run-time graphic environment and the system debugger. There are many implementations of integrated development environments (IDEs), such as RStudio®
with R kernel or R Tools for Visual Studio®
, for easy and unobstructed software or code script development. The language was introduced by Ihaka and Gentleman [56
] in 1996 as a combination of two previous computer languages, S and Scheme [57
]. The R community is highly active and talented; hence, there is a vast number of free tutorials, forums, training datasets and documentations.
Additionally, the R community develops the packages (the specialised libraries) at an almost exponential productivity rate (Figure 3
). Most of them are available from the Comprehensive R Archive Network (CRAN) online repository of sites which carry identical material, consisting of R distribution(s), contributed extensions, documentation for R, and binaries. The number of packages is far higher than 12,500 (in 2019) because there are more repositories, such as Bioconductor and GitHub containing hundreds of R packages. It is very important that the R user is not obliged to create “environments” or to check for versioning compatibility between the core of the language and the packages, or to take care of loading dependent packages separately. This is a priceless characteristic for scientists without a programming background.
At this point, we must mention the disadvantages of R. First, R uses only the physical memory (RAM) of the computer. This is an apparent restriction to our programming ambitions but the new generations of personal computers (PCs) come by default with an adequate amount of RAM to handle even Big Data. Besides, there are many techniques and specialised packages for data management. Another drawback of R is its speed since it is not made for multiprocessor processing by default. Nevertheless, there are R packages that enable parallel computing with R. The last disadvantage of the R language is that it cannot produce executable files. That means that we cannot send an application to our colleagues, but we can send the code and run this on their computer. This weakness of R is probably a hidden strength because it invigorates the research transparency and openness.
In the following sections, there is a selection of packages covering the needs of the common biometeorological research accompanied by tables with the names of the packages and short descriptions (The name of each package in the tables is an active link to the source of the package).
3.2. Data Acquisition with R
The foremost action of the research group is to acquire the data for the analysis. In atmospheric sciences such as biometeorology, there is a big variety of data sources. This means, as already mentioned, that a biometeorologist may collect field data from various types of loggers or scanned questionnaires. Moreover, for the same project, the research group could be obliged to use data from the registry of a hospital, from the meteorological service and so on. The result of the above is a variety of formats in terms of files with a different extension and in terms of files with the same extension but a different structure. We can separate the data acquisition in data input/output from local sources and web database sources (Table 1
With the term local, we mean the data distribution via any means (e.g., USB stick, email, cloud services), and with the term web sources, we mean the distribution via a structured web service which provides an API (application programming interface) as a contact point. From the plethora of R’s data input packages, we suggest “readr” [58
] as a solution for a fast and friendly way to read rectangular data in csv, tsv and fwf format. As Peng [59
] pinpoint, “readr functions such as read_csv (reading csv files) optimize dramatically the reading speed of R”.
Another very useful package for reading and writing the widely used spreadsheets files is “xlsx” [60
]. It is an easy-to-use package that gives us the ability to read the content of an xlsx file and its separate worksheets. It is very popular among R users because it enables the uninterrupted data flow from the non-R-literate scientific community to the R users and vice versa. For the same purpose, we can utilise the “foreign” [61
] and “haven” [62
] packages because they are made for easy input and output data for SPSS, SAS, Minitab and Stata native formats. These packages can read and import data created by the above widely used statistic software and export datasets to a readable format enhancing interoperability. Moreover, the “feather” package [63
] is made to read and write feather files, a lightweight binary columnar data store designed for maximum speed. With this package we can afford the weight of big data files securely.
Apart from the aforementioned packages, base R can easily and accurately import every type of data file if the user describes the structure and the characteristics of its contents. The packages, first of all, are a shortcut if we want to bypass all the detailed coding, achieving higher processing speed.
As was already mentioned, in biometeorology, researchers use data from big databases; they offer their content via web services and APIs, like NASA, COPERNICUS and others. The R community has already created the related data for an easy and uninterrupted connection with the databases. In Table 1
, users can find a selection of the most popular packages for data retrieval in the atmospheric content web databases. The “rnoaa” package [64
] is dedicated to the NOAA (National Oceanic and Atmospheric Administration) data sources from current weather data to sea ice and historic recordings. With almost a single line of R code, the researcher can define the type of data, the location (if needed), the time period and other details and download them. The “nasapower” package [65
] is specialised in NASA-POWER dataset acquisition with enhanced functionality. A very useful package is “Copernicus” [66
] that is specialised in Global Land Vegetation Products and their products, such as NDVI, LAI and other indices. Moreover, an example of a widely used dataset for a wide range of research are the MODIS satellite products. The “MODIS” package [67
] provides automated access to global online data archives.
Another valuable package for biometeorological research is the “rWBclimate” [68
] that can give access to the World Bank’s climate circulation models data. A vast amount of useful data derived from EUMETSAT is available with the usage of the “cmsaf” package [69
] in the widely used format of NetCDF. Finally, the “NASAaccess” package [70
] with more than 50 integrated functions give access to gridded Ascii files containing climate and weather information. Of course, there are a few dozens of packages made for acquiring data from web bases such as Landsat, ESA and so on. All the above data can also become accessible using a web browser or an FTP client software but using R packages makes the process automatic, quick, and accurate in quite an easy way with two to three lines of code.
3.3. Data Handling with R
As a data analysis-oriented language, R is highly effective in handling and managing data in an unprecedented way. Researchers are literally “educated” on how to handle and what can do with their data by the software they used. The limitations of the software are the limitations of the research practice and, at the end of the day, are the limitations of the scientist’s ideas. The R language preserves absolute freedom on data handling and manipulation. If there is a central argument on choosing the coding practice instead of using data analysis or statistical software, this is the freedom and subsequent capability of managing the data powerfully.
The most popular R packages for data handling are “data.table” [71
] and “dplyr” [72
]. The first is one level above base R and provides all the necessary functions (tools) to subset, rename, summarise, merge or group, bind and, of course, do every calculation we need between columns or rows of a data table or between separate data tables. On the other hand, “dplyr” has introduced six “verbs” (i.e., functions) in order to do all the work “data.table” functions do, but in a different syntax. The last version (1.0) of “dplyr” improved its speed, and with its neat syntax, it is the flagship of data handling in the R world. As presented in Table 2
, the next widely used data handling package is “reshape2” [73
]. The main purpose of this package is the reshaping of the data format from the long to the wide structure and vice versa. As Wickham [74
] mentioned, this process is usually tedious and frustrating. The “reshape2”, gives us two functions to do all this important work. Some of the basic modifications of data table structures can be made with the “pivot_longer” and “pivot_wider” functions of the “dplyr” package. The package “lubridate” [75
] is focused on the date and time parts of our datasets. It contains some particularly useful functions for handling and parsing such data. In addition, it gives new capabilities on time zones or time-series data.
The introduced packages are a subset of a big number of R packages for data handling (e.g., stringr, tidyr). However, those four packages are enough for easy, fast and accurate data handling processes. It is worth mentioning that during coding synthesis we can use every package we need, and all of them are made to collaborate with the others.
3.4. Biometeorological Data Analysis with R
As anticipated, the R community has already created specialised packages for biometeorological research. The “comf” package [76
] is made to easily calculate some of the most widely used human thermal comfort indices such as PMV (Predicted Mean Vote) and PPD (Predicted Percentage Dissatisfied) [77
] or estimate synthetic parameters such as MRT (Mean Radiant Temperature). In total, in this package, more than 20 indices and related biometeorological (or bioclimatological) parameters are incorporated. The following related package is “ClimInd” [78
] which gives us the functions to calculate more than 100 climatic and bioclimatic indices (Table 3
). The variety of the indices is great, covering a spectrum from temperature-based to tourism indices.
Another very promising R package for human biometeorological index calculations is “rBiometeo” [79
]. The predefined functions of this package vary from human energy balance indices such as PMV to the classic thermohygrometric index (THI). In order to feed the biometeorological indices with input data we can use the “climate” R package [80
] that is specialised in the automation of meteorological data downloading. It is well connected with OGIMET, NOAA and other publicly available databases. Moreover the “RNCEP” package [81
] contains functions to temporally aggregate data, producing user-defined variables, and to visualise these data on a map, encouraging the exploration of relationships between biological systems and atmospheric conditions.
In biometeorological research, the conversion between metric systems and units is a common practice. The “weathermetrics” package [82
] provides ready-to-use functions to facilitate all the possible unit and metrics conversions. The catalogue of the biometeorologically related R packages is exceedingly long because, in research practice, we use functions from packages oriented or made for other scientific disciplines. Along with the functions of the above packages, we can use advanced mathematics functions such as Generalized Additive Models from the “mgcv” package [83
] or propensity score analysis with the “MatchIt” package [84
]. Moreover, advanced R users can create and publish their biometeorological R package, or they can define and apply their own functions.
3.5. Results Dissemination with R
Probably one of R’s most powerful attributes is the communication of the research results to the broad public. It is well known that with R, we can easily create beautiful and neat graphs, and we can report the research results in the form of a web page (HTML), slides or any type and format of documents.
A very quick and effective R package for plotting is the “lattice” [85
]. It can plot univariate and multivariate data graphs with almost a line of code. It is very famous for the trellis graphs which display the distribution of a variable or the relationship between variables, separately for each level or more other variables [86
]. On the other hand, the “ggplot2” package [87
] is the flagship of the R graphics. The “gg” means grammar of graphics and Hadley Wickham [88
], the author of this package, introduced a new way of working with data plots. The main concept is that the graph consists of five separate layers: the “mapping”, which contains the set of aesthetics, the “data” which contains the presented dataset, the “geom” which describes the type of the data illustration (e.g., line, dot, boxplot), the “stat” which describes the statistical transformation we probably want to do with data, and the “position” which adjusts the overlapping method of the objects. The above layers are not mandatory for every graph we create. The justification for using ggplot2 for data graphs is a long catalogue of clear advantages, such as automatic and easy legends and colours (with pallets), lots of default characteristics, easy faceting, flexibility on changing systems from cartesian to logarithmic and so on. Finally, the usage of the ggplot2 package gives us the privilege of unrestricted choices about the graphs’ output format, resolution and size. All in all, it is an absolutely professional tool for scientific graphs. The next R package (Table 4
) for results dissemination is “plotly” [89
], which was initially created for the Python programming language. This unique tool creates interactive, publication-quality graphs. In addition, its ability to present 3D plots in a very effective way, along with the production of animated plots, ranks “plotly” among the essential tools for results dissemination.
Since the communication of research results is not only a matter of graphs, scientists need tools for reporting and sharing their findings. There is a group of R packages which can embody the R functional code inside a classic digital document, or an HTML page via automatic compilation. Hence, the “rmarkdown” package [90
] helps us to create dynamic analysis documents that combine code with rendered output. With this package, the biometeorological results can be combined with the data and the related code into a polished document which can be the state of the art in terms of reproducibility. The package “rmarkdown” gives outputs in doc, pdf and HTML using the markdown [91
] and free text syntax, enriched with code chunks.
Nowadays, it is common for research groups to communicate their findings in blogs. Hence, the “blogdown” R package [92
] is an ideal solution because it is dedicated to creating web pages made with “rmarkdown”. This package in collaboration with a static site generator can create web sites (blogs) in several minutes. In addition to the above, the “bookdown” R package [93
] facilitates writing books and long-form articles and reports with the rmarkdown syntax. With this excellent package, authors can easily create printer-ready books and e-books with automations in terms of their style and structure. Of course, there is a group of predefined functions that create a table of contents, indices and other parts of a book.
The last R package which is very useful for the communication and dissemination of biometeorological research results is “shiny” [94
]. With this tool, scientists can create interactive web applications with R code. In this way, the scientific community can share the results, or the data by an integrated graphic environment that makes them explorable by the public. The shiny web applications can contain maps and interactive graphs (e.g., bar plots, box plots and pie charts). Apart from the visual parts of the web applications, parts of text and code can coexist. Finally, the user of shiny-made applications can produce outputs in doc and pdf formats or can download the created images or the data results in csv and other formats.