Introduction to Reproducible Geospatial Analysis and Figures in R: A Tutorial Article

: The present article is intended to serve an educational purpose for data scientists and students who already have experience with the R language and which to start using it for geospatial analysis and map creation. The basic concepts of raster data, vector data, CRS and datum are first presented along with a basic workflow to conduct reproducible geospatial research in R. Examples of important types of maps (scatter, bubble, choropleth, hexbin and faceted) created from open-source environmental data are illustrated and their practical implementation in R is discussed. Through these examples, essential manipulations on geospatial vector data are demonstrated (reading , transforming CRS, creating geometries from scratch, buffer zones around existing geometries and intersections between geometries).


Introduction
The R language [1] is a popular open-source tool for data analysis and generation of figures for scientific communication [2,3].The open and collaborative nature of R allows its wide and varied user community to continuously extend its capabilities through "packages" hosted in repositories (the main of which is the Comprehensive R Archive Network (CRAN), https://cran.r-project.org(accessed on 20 March 2024).
Through the years, literate programming capabilities (generation of documents from code script) have been added to R through several packages (namely knitr [4][5][6], RMarkdown [7][8][9], bookdown [10,11]) and, more recently, through the multi-language and multiengine Quarto software (https://quarto.org/(accessed on 20 March 2024)).Literate programming has raised an ever-growing interest in the scientific community for several years for two reasons.First, and most importantly, it drastically improves transparency and computational reproducibility of scientific research, which has been identified as a critical point undermining the reliability of the scientific literature reliability in many fields and disciplines, including Earth and environmental sciences [12,13].An increasing number of journals require that the raw data and scripts for data analysis be submitted along with manuscripts, and some of them already accept submission in Rmarkdown format [14].The second reason is that it allows scientists to conveniently combine data analysis, figure generation and manuscript preparation (with automatic management of editing aspects such as formatting, bibliography, cross references. . . ) in one single and convenient tool.Plentiful resources can be relied on to learn about using R for literate programming (e.g., [7,8,10,15,16]).
The present article aims to present the basics of geospatial analysis, geospatial data manipulations and map types with example application using the R language.This tutorial article focuses on the package sf [20] for geospatial data manipulation and on the package ggplot2 [25] for generating maps.The full RMarkdown script and the datasets used to prepare this manuscript available in Supplementary Materials.

Geospatial Data Types and Formats
Vector and raster data are two fundamental types of data structures used in GIS [32].Vector data represent discrete entities (features) as points, lines, and polygons encoded through the coordinates of their vertices [32].Vector data types are suitable for discrete measurements and complex geometries such as lines (e.g., rivers) and polygons (e.g., countries) [32].Raster data represent spatial information as a grid of cells or pixels, where each cell holds a value representing a specific attribute or phenomenon [32].They can be used to represent continuous data such as land cover acquired through teledetection (satellite and drone imagery).
Geospatial data can be stored in various file formats.Table 1 lists some commonly encountered GIS formats.Vector data formats (such as shapefile, geopackage and geoJSON) can be read in R using sf :: st_read [20] whereas raster data such as geoTIFF can be read using raster :: raster [37].In addition to these specialized formats, GIS data are also commonly found in .csvand .xlsxformats.These file formats can be read using utilis :: read .csv [1] and openxlsx :: read .xlsx [38], respectively, and then converted to objects of class sf using sf :: st_as_sf [20], specifying the columns to use as coordinates and the CRS.Further details on GIS file formats can be found in [39].

Coordinate Reference System (CRS)
A coordinate reference system (CRS) is a framework used to define the positions of objects on the Earth's surface.It consists of a coordinate system, which defines how coordinates are represented, and a datum, which specifies the reference point, orientation, and scale of the coordinate system [32,40].
There are two types of CRS: geographic and projected.Geographic CRSs represent the Earth's surface as an ellipsoidal surface (specified in the datum) and refer to positions and distances as angles [32,40].Projected CRSs represent Earth's surface projections of geographic CRS on a virtual cone, cylinder or plane at a given location in space, with a given orientation, etc.The projection can then be represented on a flat surface (e.g., paper or screen) by "unfolding" the cylinder or cone [32,40].Projected CRSs represent positions and distances as Cartesian coordinates on that flat surface.Although more practical for representation and some computation (as distance units can be expressed in metres or feet), the projection process always induces distortion [32,40].
Coordinate reference systems can be identified in different ways.The most currently common ways are the EPSG (European Petroleum Survey Group) identifier and the more complete WKT2 ("well-known text") defined by the Open Geospatial Consortium [33] and transposed in ISO 19162 [41].Table 2 lists some common CRS.Most often, countries use a specific CRS for maximum accuracy.In R, the CRS of an object of class sf can be consulted using sf :: st_crs [20], and the CRS of an object can be modified using sf :: st_transform [20].

Scatter Map
A scatter map is the simplest type of map used to represent data points on a map, usually with a color proportional to a given variable.In this first example, a scatter map representing the soil organic carbon (SOC) content measured across Germany (data from Poeplau et al. [49]) is prepared.
The data are first loaded from tidyr :: xlsx files using openxlsx :: read .xlsx [38] and stored as objects of class df.The mean SOC is calculated for each sampling location using dplyr :: group_by and dplyr :: summarise [50].The measurement and sampling location data are merged using base :: merge [1].The data are converted to an object of class sf using sf :: st_as_sf [20], specifying the columns to use as coordinates and the CRS.
The background map of the European Nomenclature of Territorial Units for Statistics (NUTS) is loaded using giscoR :: gisco_get_nuts [44].This function allows us to download the map with three available CRS: EPSG 4326 (default), EPSG 4258, EPSG 3035 and EPSG 3857.Since the ESPG of the data is EPSG 25832, the CRS of either the data or the map must be changed to match the other.In the present example, the CRS of the map is changed because EPSG 25832 is more accurate for the region of interest.A dummy boolean variable is created to indicate whether the polygon belongs to Germany or not.This variable is used later to distinguish Germany from other countries on the maps.#Read map of EU with regions and transform to same CRS europe_sf<-gisco_get_nuts(nuts_level = "2") |> st_transform(crs="EPSG:25832") #Mark polygons belongign to Germany europe_sf$Germany<-europe_sf$CNTR_CODE=="DE" When generating plots using ggplot2 [25], the default color scale fits the whole range of the data for the represented variable.In frequent cases (e.g., presence of outliers or highly skewed distribution), most data may fall within a narrow range of the color scale, making it rather uninformative.In the present example, this is overcome by scaling the minimum and maximum values of the scale to specific quantiles using the limits argument of viridis :: scale_color_viridis [51] and by adding the argument oob= scales :: squish [52] (which replaces out of bounds (oob) values with the closest limit).The viridis color scales have the advantage of remaining readable with color-blindness [51].The axis range of the map is defined by supplying values from sf :: st_bbox [20] (which calculates a bounding box of its input) to ggplot2 :: coord_sf [25] to automatically "zoom" on the area of interest.The functions ggplot2 :: scale_fill_manual [25] and ggplot2 :: guides [25] are used to assign a distinct color to Germany and hide the legend for the associated dummy variable, respectively.The resulting map is shown in Figure 1.

Choropleth Maps
A choropleth map is a type of map where arbitrary areas (usually administrative) are colored accordingly with a given variable (usually an aggregate statistic).Choropleth maps are most useful when the data are meaningful to the geographic segmentation (e.g., economic and demographic data such as per capita income, gross domestic product. . .).
In the example below, a choropleth map is used to represent the soil organic carbon (SOC) content measured across Germany.First, each point is associated with the NUTS in which it is located using sf :: st_intersection [20].The mean for each NUTS is calculated and joined with the corresponding geometries.The geometries must be dropped from the means with sf :: st_drop_geometry [20] since sp :: merge can only join an object of class sf with an object of class df.The resulting choropleth map is shown in Figure 3.  Choropleth maps are also commonly used to represent categorical variables as well.In the following example, the color scale is discretized by setting the argument discrete = TRUE in viridis :: scale_fill_viridis [51].The result is shown in Figure 4.

Hexbin Map
Hexbins are the two-dimensional analogs of histograms with hexagon-shaped cells.Hexbin maps are somewhat in-between scatter maps and choropleth maps: they can be used to represent aggregate data over space, but unlike choropleth maps, the segmentation is regular and not related to a meaningful characteristic such as a national border.The relevance of the information communicated depends heavily on the granularity (i.e., cell size).Hexbins should only be used for relatively small areas (e.g., country or smaller), as the wider the area represented on a map, the more distorted the cells will be in the real space when applied to a planar projection.
Square counterparts also exist ("fishnets" maps).The process of generating a fishnet map can be assimilated into the conversion of vector data to raster data ("rasterization") [40].
In the following example, the mean soil organic carbon (SOC) content measured across Germany is represented in a hexbin map.To achieve this, a hexagon grid covering the data is generated using sf :: st_make_grid [20] with the argument square = FALSE (this step is also called tesselation) and formatted into an object of class sf using sf :: st_sf [20].The desired cell size is specified with the argument cellsize.The value for this argument must be supplied with a unit compatible with the unit specified in the CRS, which can be specified with units :: as_units [54].It follows that it is not straightforward to set a cell size area when the unit of the CRS is degrees (which is the case for the widespread EPSG 4326).
Since ggplot2 :: coords_sf [25] displays an area slightly larger than the one specified using ggplot2 :: coord_sf [25], the tesselation is performed on an area expanded by a factor 0.25 compared to the area of interest covered by the data.The result is shown in Figure 5.

#Define interest of area
SOC_bbox<-st_bbox( SOC_sf, crs=paste("EPSG",st_crs(SOC_sf)$epsg,sep=":")) #Calculate width and height of interest area width <-SOC_bbox [3] -SOC_bbox [1] height <-SOC_bbox [4] -SOC_bbox [2] #Define a larger area for tesselation expansion_factor<-0.In the previous example, the grid was generated in such a way that hexagons have the same size on the projected map.Because of the projection, the actual area of the hexagons is not exactly identical.The broader the geographic area covered, the higher the deviation between cells.
Alternatively, it is possible to generate a grid with cells of actual regular size/area using the package dggridR.A dggr object representing the desired grid is obtained with dggridR :: dgconstruct [55].The resolution parameter is selected from 0 to 30 (each resolution corresponds to a given division of Earth's surface; the corresponding number of cells and area can be consulted by running dggridR :: dginfo [55]).Various grids are available.In the present example, the default grid ("ISEA3H") is used.
The function dggridR :: dgrectgrid is then used to generate the grid as an object of class sf based on the dggr object and coordinates of the bounding box covering the area of interest in degree longitude/latitude.The following steps are identical to the first hexbin map example.The result is shown in Figure 6.

Bubble Map
A bubble map is very similar to a scatter map, except that the sizes of the dots (or another shape) are proportional to a variable (instead or in addition to color).
In the following example, a bubble map is used to illustrate the mean concentration of particulate matter with an aerodynamic diameter smaller than 10 µm (PM10) across the United Kingdom in the period 2016-2019 (source data: [56,57]).Again, the data are read from tabular format (in this case, .csv)and converted to an object of class sf using sf :: st_as_sf [20].Since the CRS is not specified, the default WGS84 (EPSG 4326) is assumed.To produce a more accurate map, the CRS is changed to the British National Grid (EPSG 27700), a more specific CRS for this region.The CRS of the map is also changed to match the CRS of the data.
For this example, the location data were deliberately converted to a simple feature before merging with the measurement data.When merging an object of class sf with an object of class df using sp :: merge [58], the first must be supplied to the x argument and the second to the y argument.

#Append calculated means to corresponding geometry in sf object
A single combined legend for multiple aesthetics (size and color) can be obtained by using ggplot2 :: guides and supplying ggplot2 :: guide_legend [25] to arguments of each aesthetic.The labels in the legend for both aesthetics (specified using ggplot2 :: labs [25]) must also be identical.The map obtained is displayed in Figure 7.

Faceted Map
In a faceted map, data are subsetted into groups and a map is generated for each group.The generated maps are typically displayed in a grid layout.Faceted maps are a common way of representing geospatial data at different points in time or for other categorial variables.
The following example shows how to produce a faceted map of the maximum daily PM10 concentration per year in the UK in the period 2016-2019.First, the year is extracted from the timestamps of each measurement with base :: format [1].Then, means are calculated for each station and each year.
The faceted map is generated in the same way as a bubble map, except that the function ggplot2 :: facet_wrap [25] is added.The result is shown in Figure 8.

Conclusions
The present tutorial article briefly presented the concepts of vector data, raster data and coordinate reference systems.The implementation of basic vector geospatial data representation and operations (reading, creating geometries from scratch, buffers from existing geometries, intersecting geometries. . . ) using the R language was demonstrated.Finally, the main types of maps and their generation and fine tuning through R were described.The present tutorial can be used as a teaching resource for data scientists and students beginners in geospatial analysis.The full RMarkdown script and the datasets used to prepare this manuscript available in Supplementary Materials.

Figure 1 .
Figure 1.Scatter map of mean soil organic carbon (SOC) in Germany measured in the period 2011-2018.

Figure 2 .
Figure 2. Watercourses with a long-term average discharge higher than 10 m³/s in central Europe according to their flow.

Figure 3 .
Figure 3. Choropleth map of mean soil organic carbon (SOC) in Germany measured in the period 2011-2018 (continuous variable).

Figure 5 .
Figure 5. Hexbin map of mean soil organic carbon (SOC) in Germany measured in the period 2011-2018 (tesselation performed on a planar projection).

Figure 6 .
Figure 6.Hexbin map of mean soil organic carbon (SOC) in Germany measured in the period 2011-2018 (tesselation performed before planar projection).
© EuroGeographics for the administrative boundaries

Figure 7 .
Figure 7. Bubble map representing the mean PM10 concentration across the UK in the period 2016-2019.

Figure 8 .
Figure 8. Maximum daily PM10 concentration per year in the UK in the period 2016-2019.

Table 1 .
Examples of commonly used GIS data formats.

Table 2 .
Examples of commonly used CRS.
Technical documentation on this function is available in the vignette at https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf(accessed on 20 March 2024) 2 Technical documentation on this function is available in the vignette at https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf(accessed on 20 March 2024)