1. Introduction
Data and data analysis lie at the heart of science. Sharing scientific data is essential to allow scientists to build on each other’s work. However, such sharing will only occur if the data are trusted. Before using a dataset, a scientist would typically want to know the origin of the data, who was responsible, and how the raw data were manipulated to produce the published dataset. Data provenance describes this history of data and is useful to instil confidence in the data and to encourage its reuse [
1].
To move from raw data to published data, there is a typical data analysis workflow involving cleaning the data to remove questionable values, extracting the subset of the data useful for the current analysis, performing the necessary statistical analysis, and publishing the results in the form of output data, plots, and/or papers describing the results.
There are several different types of provenance that help to describe this process. Static metadata capture details about data collection: where the data were collected, when, by whom, with what instruments, etc. Workflow provenance identifies the software used to process the data and how the data were passed from one tool to another as they were analyzed. Fine-grained execution provenance captures further detail internal to the software, recording precisely how each value was computed.
Besides improving the trustworthiness of data, provenance has the potential to help solve some important problems [
2]. If a scientist discovers that a sensor has failed, for example, he or she might use provenance to determine what data were computed from that sensor so that they can be discarded. Conversely, if a scientist observes surprising results in output data, he or she might trace back to the inputs and discover in the process that a sensor has failed.
Provenance can also be used to help one scientist understand or reproduce another scientist’s work. For example, by examining provenance, a scientist could determine what bounds a second scientist had used to identify outliers in the data. If the first scientist felt those bounds were inappropriate, the analysis could be rerun with different bounds to see if that had an effect on the analysis results.
Fine-grained provenance can also support the debugging of scripts under development by capturing intermediate values. Using provenance, the scientist might trace backwards through the execution of the script to determine how a value was calculated and find the error without the need to set breakpoints or insert print statements and re-run the code, as is more commonly done. Fine-grained provenance also avoids the problems of nondeterminism caused by the use of random numbers in a simulation, or in concurrency, as the fine-grained provenance contains a precise history of the script execution that led to the error.
The focus of this paper is on a tool called RDataTracker and the value of the fine-grained provenance collected by RDataTracker for scripts written in R (
https://www.r-project.org/), a language widely used by scientists for data analysis and visualization. In this paper we describe how we collect such provenance without directly modifying the R interpreter in order to make provenance capture available to scientists using the standard R interpreter. In previous papers [
3,
4], we introduced RDataTracker and DDG Explorer, a visualization tool that works with the provenance collected by RDataTracker. The version of RDataTracker described in these papers required the scientist to modify the script extensively to include commands identifying the provenance to collect. This paper describes a new implementation of RDataTracker that requires no annotation by the scientist and instead relies on introspection to identify the provenance to collect. In
Section 2, we describe related work. In
Section 3, we describe the provenance collected by RDataTracker. In
Section 4, we describe the implementation of RDataTracker. In
Section 5, we evaluate results. Our future plans are described in
Section 6.
Appendix A provides information about how to download and run RDataTracker.
2. Related Work
Capturing data provenance is a common feature of workflow tools, including Kepler [
5], Taverna [
6], and Vistrails [
7], among others. The provenance captured by such tools describes the interaction among the tools used in a workflow and how the data flows between tools. While some scientists use workflow tools, they have not gained wide acceptance. In our experience, many scientists perform simpler data analyses that can be captured by a single script, such as an R or Python script, or by using tools such as R Markdown [
8] or Jupyter notebooks (
https://jupyter.org/).
Several systems allow the collection of workflow-like provenance without requiring the user to use workflow tools. Acuña et al. [
9] describe a system that captures file-level provenance from ad hoc workflows written in Python. Burrito [
10] collects provenance as a scientist works with multiple tools by inserting plug-ins that monitor shell commands, file activity and GUI activity. These are made available to the scientist with a variety of tools that allow the scientist to review and annotate past activity. ProvDB [
11] and Ground [
12] support storing provenance that is collected by external tools and then ingested into a database to support querying and analysis. YesWorkflow [
13] constructs a workflow representation from a script based on stylized comments introduced by the scientists. McPhillips, et al. report that these can sometimes then be used to collect workflow-style provenance. These systems focus on gathering provenance at the level of workflow. In contast, our work focuses on fine-grained provenance collected at the statement level.
Thirty years ago Becker and Chambers [
14] described a system for collecting provenance for top-level statements in the S language (a precursor of R). Though their approach is no longer viable (all objects in that version of S were stored in the file system), their paper foresees many of the applications and challenges of provenance today.
There are several other tools that collect provenance for R.
recordr [
15] is an R package that collects provenance concerning the libraries loaded and files read and written. It does this by overriding specific functions (such as
read.csv) to record the provenance information prior to calling the predefined functions. The provenance recorded is thus at the level of files.
rctrack [
16] is an R package that collects provenance with a focus on reproducibility. The provenance it collects consists of an archive containing input data files, scripts, output files, random numbers and session information, including details of the platform used as well as the version of R and loaded libraries. Having this level of detail is very valuable for reproducing results. The approach used in
rctrack is to use R’s
trace function to inject code into the functions that read and write files, generate random numbers, and make calls to external software (such as R’s
system function). While including more information than
recordr, this provenance is still primarily at the level of files and intended to support reproducibility. In contrast, the provenance collected by RDataTracker is finer grained and also serves the purpose of debugging.
Several other approaches to collecting fine-grained provenance involve modifying a language compiler or interpreter to perform the work. Michaelides et al. [
17] modify the StatJR interpreter for the Blockly language to capture provenance used to provide reproducibility. Tariq et al. [
18] modify the LLVM compiler framework to collect provenance at the entry and exit of functions. IncPy [
19,
20] uses a modified Python interpreter to cache function arguments and return values to avoid recomputation when the same function is called again with the same parameters. In CXXR [
21], the
read-eval-print loop inside the R interpreter is modified to collect provenance. The provenance collected is at the level of the global environment and is thus less detailed than the provenance collected by RDataTracker. Also, the provenance collected by CXXR is not made persistent and thus is only available for use within the current R session. These approaches require scientists to use non-standard implementations of language tools, which makes it harder to stay current as languages evolve and to get scientists to adopt these tools. In contrast, RDataTracker collects provenance both at the global level and within functions and saves the resulting provenance so that it can be loaded into other tools that support analysis, visualization and querying [
4,
22].
noWorkflow [
23,
24,
25] is most similar to RDataTracker in implementation. noWorkflow collects fine-grained provenance from Python scripts using a combination of tracing function calls and program slicing. Like RDataTracker, noWorkflow works with the standard Python interpreter and relies on techniques that examine the runtime state of the script to collect provenance. The provenance collected by noWorkflow and RDataTracker are at a similar granularity, although the techniques used are different.
5. Evaluation
When RDataTracker collects detailed provenance we expect a significant slowdown in execution since each line of the original script results in provenance being saved. Each line results in the creation of a procedural node, most lines produce at least one output data node and use at least one input data node, and the edges associated with these nodes must be created. Furthermore, saving copies of input and output data files and plots, as well as intermediate data (which can include large dataframes) is time consuming.
In this section, we present the results of timing script execution with and without RDataTracker collecting provenance. Our goal is to measure two things: the slowdown caused by saving more detail in the DDG, and the slowdown caused by creating snapshots of intermediate values.
To measure these effects we used slightly modified versions of scripts used at Harvard Forest. One script, the Met script, processes data from the Fisher Meteorological Station. The second script, the Hydro script, processes data from six Hydrological Stations in the Prospect Hill Tract. Both of these scripts create tables and graphs for the Harvard Forest web page (
http://harvardforest.fas.harvard.edu/real-time-data-graphs). For both scripts, the timing tests were run on saved input files, rather than real-time data, and the output was saved to a collection of files, rather than moved to the website. This avoided any timing variations that might be due to network speed or different input data values.
Table 4 provides further detail on the scripts. In the table, a top-level statement is a statement that is not inside a function. Note that a control construct at the top level counts as a single statement. The number of top-level statements will match the number of operational nodes created in an execution captured at detail level 0, as described below.
Both scripts were run with various combinations of detail level and snapshot size. The levels of detail were as follows:
No DDG: The original script with no provenance collected.
Detail level 0: Collect provenance only for top-level statements.
Detail level 1: Collect provenance internal to functions and internal to loops but only for a single loop iteration.
Detail level 2: Collect provenance internal to functions and internal to loops for up to 10 iterations of each loop.
Detail level 3: Collect provenance internal to functions and internal to loops for all loop iterations.
The snapshot sizes were as follows:
No DDG: The original script with no provenance collected.
No snapshots: Collect provenance but save no snapshots.
10K snapshots: Save snapshots up to 10K bytes per snapshot.
100K snapshots: Save snapshots up to 100K bytes per snapshot.
Max snapshots: Save snapshots of any size.
The tests were performed on a MacBook Pro with 4 cores, using 2.3 GHz Intel Core i7 processor and 16 GB of RAM. Each test was repeated five times and the average value was used to produce the graphs.
Figure 4 shows the results of running the timing analysis as the level of detail and size of snapshots saved was varied. As expected, the performance slowed down as the level of detail captured in the provenance increased. At detail level 3 with no snapshots (not shown), the elapsed time for the Met script was 678 seconds while the Hydro script did not complete after several hours, representing an unacceptable slowdown. The long computation times and large DDG sizes for level 3 (see
Table 5 below) were caused by loops with large numbers of iterations (both scripts) and embedded calls to user-defined functions (Hydro script).
Figure 4 also shows that the performance slowed down as the size of the snapshots increased. It is interesting to note that, at least for these scripts, there is little difference in execution time between saving no snapshots and saving snapshots that are each limited to 10K in size. Even small, incomplete snapshots might be helpful for finding certain problems with code that manipulates data frames. For example, a programmer could verify that the equation used to compute the value for a new column in the data frame is generally correct, by examining just a few rows of the data frame.
Figure 5 shows the amount of disk space used to store the provenance data recorded at different detail levels and snapshot sizes. As expected, if a large data frame is repeatedly modified, and RDataTracker is saving large snapshots, the amount of disk space used to store the snapshots can grow considerably. As with the runtime, we see little change in the total disk space used when going from no snapshots to snapshots limited to 10K.
Table 5 shows the numbers of nodes and edges in the provenance graph for various levels of detail. As expected, these numbers increased significantly as the level of detail increased. In particular, as these two scripts use for loops in several places to iterate over the rows of a data frame, we see a large jump as we go from detail level 1 (saving 1 loop iteration), to detail level 2 (saving 10 loop iterations), to detail level 3 (saving all loop iterations). It seems reasonable, then, that size of detail level 2 is somewhat less than 10 times the size of detail level 1, while the size of detail level 3 for the Met script shows a much larger jump since one of its loops is iterated 2880 times.
In general, performance slows down proportionally to the amount of data collected. We would also expect different performance characteristics depending upon the nature of the script. Slowdown can be caused by loops, when we are collecting provenance internal to loops; functions, when significant amounts of the computation are done inside user-defined functions and we are collecting provenance internal to functions; and intermediate data, when the calculations are working with large data frames or other data structures and making frequent updates to them.
It is not possible to draw broad conclusions about the performance of RDataTracker from the results of running RDataTracker on just these two scripts. Scripts written by other developers and in other domains may have a programming style that results in either better or worse performance. While our expectation is that the ways in which top-level statements, functional programming, and loops are used will have an impact on RDataTracker’s performance, we have not yet gathered the scripts needed to do this larger evaluation and draw meaningful conclusions.
In addition, new R packages are constantly being developed, some of which may have a significant impact on how scientists write R code and what types of provenance should be collected. For example, the pipe operator provided in the magrittr (
https://github.com/tidyverse/magrittr) package allows for the output of one expression to be piped directly into another expression without the need to introduce intervening variables [
27]. It may be desirable to collect provenance for each expression involved in a pipe, rather than just at the completion of the entire statement.