d2ix : A Model Input-Data Management and Analysis Tool for MESSAGE ix

: Bottom-up integrated assessment models, like MESSAGE ix , depend on the description of the capabilities and limitations of technological, economical and ecological parameters, and their development over long-time horizons. Even small models of a few nodes, technologies and model years require input-data sets involving several hundred thousand data points. Such data sets quickly become incomprehensible, which makes error detection, collaborative working and the interpretation of results challenging, especially for non-self-created models. In response to the resulting need for manageable, comprehensible, and traceable representation of input-data, we developed a Python-based spreadsheet interface ( d2ix ) that enables presentation and editing of model input-data in a concise form. By increasing accessibility and transparency of the model input-data, d2ix reduces barriers to entry for new modellers and simpliﬁes collaborative working. This paper describes the methodology and introduces the open-source Python-package d2ix . The package is available under the Apache License, Version 2.0 on GitHub.


Introduction
The software package described in the following -d2ix-is freely available under the Apache License, Version 2.0 on GitHub under: https://github.com/tum-ewk/d2ix.

Input-Data-Handling-The Underrated Modelling Challenge
Technology-based integrated assessment models, such as MESSAGE ix (formerly known as MESSAGE) have a long history in energy and environmental systems modelling [1,2]. Despite having been developed in times of relatively low computing power, over the last forty years these models have grown in line with multiplying computing capacity, expanding models in dimensions such as coverage and detail [3]. Until the 1990s, the models focused on the energy-system only [4]; however, today's energy-engineering-economic-environment optimising models are designed to describe the full extent of energy-system dynamics, including effects such as polluting greenhouse gas emissions, economic development, land and water use and health implications [5,6]. At the same time, rising computing power allows not only increasing coverage but also magnifies the level of detail represented in models, such as the number of model years, nodes, technologies and technology parameters.
While in line with this structural change, big data not only presents a challenge in terms of energy-systems modelling: here too, the amount of input-data has skyrocketed [7]. Today, one technology in MESSAGE ix is described by forty parameters, of which fourteen are defined not only by the installation year of the technology but also the age of the technology. Thus, in even very simple input-data sets (e.g., describing one node over ten model years), each technology is defined by approximately one thousand input parameters, each again defined by up to twelve sets. Therefore, even a small model of one node over ten years has an input-data set per technology of more than twelve-thousand data points, not including the input-data for describing the ecology or economy (Figure 1).

d2ix-Combining Benefits of Non-Binary and Binary Data Formats
Currently, most models handle input-data using vast spreadsheets (e.g., MS Excel), csv (comma-separated value) or plain text-files to organise, pre-process and document the model-data. While on the one hand, binary ('higher') formats, such as spreadsheets, provide support with data-handling, (un)intentional changes made to the input files are not trackable and are difficult to retrace. On the other hand, non-binary ('lower') but trackable formats such as csv and text-files lack visual clarity and data-handling support. To ensure transparency in data-handling and reproducibility of model results, the modelling-platform (ixmp), supplies MESSAGE ix users with tools for (i) database communication for version-controlled data management, (ii) a Python/R interface for efficient input-data and results processing and (iii) a web-browser based tool for drag and drop results visualisation [8]. The newly developed 'data to MESSAGE ix ' (d2ix) package adds to this functionality by providing the user with a visually comprehensible overview of the input-data by reducing the dimensions of the input-data set, thereby reducing the number of data points to be handled by the user (Figure 2).

User interface
Web-browser based interface for postprocessing and data visualization Python/R API  This model input-data-handling approach, as such, is novel as it is the first to combine reduced form MS Excel spreadsheet data and lucidly change-tackable .yaml files for input-data documentation. By following the FAIR principles of scientific data-handling and analysis, d2ix makes data findable, accessible, interoperable and reusable, and thus facilitates collaborative working and can therefore support the energy-modelling community [9]. By enabling new users to quickly become acquainted with existing models, and by simplifying the generation of new scenarios, d2ix reduces the barriers to entry into energy-and climate-policy modelling. Furthermore, the synoptic organisation of the input-data set can reduce the risk of errors prone to happen when organising big data sets ( Figure 3). Such errors can have detrimental effects such as the data and coding mistakes causing the infamous Reinhart-Roghoff spreadsheet error [10]. Lastly, the interface will be equipped with a unit test that can inspect the model for commodity 'dead ends' and overly restrictive bounds, a feature that can prevent infeasibilities, undesired exceedingly restricted scenarios and the misinterpretation of results. Overall, d2ix is a well-suited data-handling tool for large energy-system models such as MESSAGE ix . The easy change-trackable framework for transparent model input-data preparation is the first of its kind to be introduced as a standardised model-creation workflow.

Related Work
While, in the light of good scientific practice, model transparency and reproducibility have received wide academic attention, the focus has remained on how to deal with and how to publish raw-data and model code [11]. However, the important link between the two much noted components-the raw-data and the model-the input-data-handling, has so far not been dealt with scientifically [12,13]. In contrast, the major strategies of input-data-handling which established themselves as go-to solutions in energy-system modelling have never been subject to publication but rather research-institution internal, customised, single-user solutions. Thus, most models now provide different data-handling strategies. Four mayor types can be identified among the most commonly used input-data-handling methods. They are: • Type I-Reduced text-file structures: The input-data of such models is handled in several long or one single, even longer, structured text-file. Such text-file-based input-data systems used to find application with most energy-system models. Due to their long history as well as their suitability for synoptic change-tracking, some modellers still rely on Type I input-data-handling strategies (e.g., Calliope [14]). • Type II-Full parameter text-files: The input-data handled by this type is organised in a multitude of text-files. Each file contains one parameter in full dimension and shape as required by the database. Despite the low lucidity and the difficulty in tracking any (un-)intentional changes made to the input-data, Type II data-handling schemes are commonly used. Especially community-friendly, open-source models such as PyPSA [15] and oemof [16] in particular appreciate the high flexibility of the input-data-handler in combination with the low requirements regarding the programming skills of the modeller. • Type III-Reduced parameter spreadsheets: Here the input-data is organised in MS Excel spreadsheets in reduced dimension. While this dimension reduction increases the lucidity of the input-data, it can at the same time limit flexibility. However, in order to lower barriers to entry for new modellers, several open-source models such as urbs [17] and ficus [18] rely on Type III input-data structures. • Type IV-Code-based input-data: Code-based input-data can be either hard-coded or predefined and processed in functions. Thus, the input-data is documented and stored together with the code. Such transparent data-handling types allow for the full documentation of code and input-data within one workflow. However, extensive amounts of hard-coded data, can become overwhelming for any new user, just like the text-file-based data. Nevertheless, several renowned models such as MESSAGE ix [8], Temoa [19] and OSEMOSYS [20] provide interfaces for hard-coded model input-data. Table 1 summarises and lists the strengths and shortcomings of those four strategies and compares them to the newly developed d2ix workflow. It shows that by filling the gaps in documentation, standardisation and transparency, frameworks such as d2ix can help improve energy-system modelling by combining the strength of binary and non-binary input-data storage and handling formats.

Methodology
The Python-package we have created, d2ix, supports the user in creating new MESSAGE ix models as well as adapting and analysing existing input-data sets and scenarios. The support consists of four main tasks: first, d2ix supports the user in organising the input-data for MESSAGE ix . For this task, we created an abstracted data model, summarising the reduced model input-data in two spreadsheet files. Secondly, d2ix functions as a standardised interface between the spreadsheets and the MESSAGE ix Python API. Third, d2ix documents the pre-processed model input-data in yaml text-files. This allows systematic and visual change-tracking of the spreadsheets-based scenario-data using automated change-tracking services such as Git. Lastly, several unit tests implemented in d2ix will allow an automated structured inspection of input-data sets to identify commodity 'dead ends' and overly restrictive constraints.

Class Structure and Definition
The d2ix package supports researchers who want to create a MESSAGE ix model, either from scratch or by modifying existing models ( Figure 4). This support is supplied by the means of four different classes which handle the data input. In the following, the classes are described in their functionality and structure.  The MessageInterface class acts as the interface between d2ix and the ix modelling-platform (ixmp). To communicate with the ix modelling-platform, MessageInterface applies the MESSAGE ix classes ixmp.Platform and message_ix.Scenario. While the Platform instance contains the connection to the database, the Scenario class predefines the format and indexation of the model in-and output-data (parameters, sets and variables) required for running the MESSAGE ix model. The database with which MessageInterface establishes a communication with is defined in the run-config file provided in the config folder (..\d2ix\config\run_config.yaml.template). The unique identification of the established Scenario instance is defined by the user input (Section 4.2), as is the logger setting of the d2ix module.

DBInterface-Data-Handling in d2ix
The DBInterface class enables data-handling in the d2ix package. The DBInterface class holds the model input-data in the form of a dictionary containing all model sets and parameters which can be accessed and modified before being transferred to the database via the MessageInterface class. The central tasks of this class are (i) to hand over the final input-data created in d2ix to the database, (ii) to write the final input-data into text-files for transparency and change-tracking, and (iii) to collect the model results from the database after a model run. Furthermore, the DBInterface class will check whether the units used in the input-data are already stored in the database and will add them if they are not.

Model-Data Transformation from Reduced Spreadsheet to Database Format
The Model class constitutes the core of the d2ix package. Its main task is the pre-processing of the input-data from the reduced d2ix spreadsheet format to the expanded final input-data format required by MESSAGE ix . Apart from creating all required sets and parameters, the Model class automatically adds one slack technology for each demand provided in the input-data set, in order to prevent the model from running into infeasibilities during calibration, and to simplify debugging. After each successful scenario-run in MESSAGE ix , the Model class reformats the results from database tables into time-series elements optimised for post-processing, applying the TimeSeries class from the ixmp package [8].

ModifyModel-From Database to Spreadsheet and Back
The ModifyModel class is used to enable the analysis and modification of existing MESSAGE ix models, i.e., models readily available in the database. To do so, the ModifyModel class has two main functions: (a) ModifyModel allows users to choose a specific MESSAGE ix scenario-run, which is then, first collected from the database, secondly, written to an excel sheet and lastly, made accessible to the user as a Python dictionary. The data can then be analysed and modified either in spreadsheet or through scientific computing (e.g., Python). In the second function (b) the modified data can be returned to the database as a new scenario containing the changes applied by the user.

Testing and User Experience
In accordance with best-practice collaborative programming [21], we set up a Continuous Integration implementation, with CircleCI and Docker each executing several tasks. Additionally two linters, thus static code analysis segments, are configured for basic code quality checks to ensure long term code maintainability. The coding style is tested with Flake8 and MyPy, the static types in Python.
Furthermore the API functionality is tested in a defined environment inside a Docker container using the d2ix tutorial and some basic examples.
We tested the functionality of d2ix together with various beta users. In a first step, the data transfer from the spreadsheet to the ixmp platform and the git-tracked text-files was evaluated, thus proving the data-model functionality of d2ix. In a second step, we created three models of different sizes, in order to analyse and improve the runtime performance. The model descriptions and runtime performance are documented in Table 2. Finally, we tested the tool's intuitiveness with users without programming skills. By having such a user without programming experience recreating an existing MESSAGE ix model we succeeded in proving the data-model functionally as well as the coherence of the API. As a test model to recreate, we used the standalone country model of South Africa, which is available under the GNU General Public License, Version 3 on GitHub (https://github.com/tumewk/message_ix_south_africa) [22]. Two further MESSAGE ix country models are currently being developed for energy-research purposes.

Installation
To start using the open source Python-package d2ix, you must to ensure that your environment is equipped with the requirements as described in the README instructions found alongside the d2ix repository (https://github.com/tum-ewk/d2ix).

Running d2ix-Creating a Model from Scratch
The core functionality of the d2ix tool is to create a model from scratch. The bases for model creation are two reduced spreadsheets (Figure 3). In this example, we create a new MESSAGE ix scenario-in this case the replica of the 'Westeros' tutorial from the MESSAGE ix repository-using the d2ix MS Excel templates. The required parameters, configurations and files with the corresponding path are shown in Listing 1. The code creating the scenario is shown in Listing 2 and is explained below.
Furthermore, an introductory tutorial is provided in the d2ix repository under tutorial.ipynb.

Creating a Model Instance
The Model class provides the functionality to create a model from scratch. The class instance is specified by thirteen parameters which are described in Table 3. Furthermore, the code to create a new instance is provided in Listing 2. Table 3. Parameters used for creating a model instance.

Parameter Description
run_config the path to the run_config.yaml file located in the config folder 1 The file contains the specifications of the database type.
base_xls the path to the model_data file, located in the input folder 2 MS Excel file consists of seven input sheets that contain all necessary information for the model creation. Thus, the demand, the units, the technologies, and the nodes are defined and mapped in model_data. The structure of the file must to remain unchanged, as the input-data expansion done by d2ix depends on the current structure.
manual_input_parameter all possible parameters, such as economic parameters and ecological constraints, can be added to the model using the manual_input_data file It provides the option of adding parameters manually by adding a sheet by the name of the parameter which contains the data in the format required by MESSAGE ix . Thus, parameters in the manual_input_data are not manipulated by d2ix, but simply scanned for new set elements before being passed on to the database. historical_range_year defines the the temporal resolution from historical data model_range_year defines the the model temporal resolution verbose defines the logger level in order to facilitate easy debugging yaml_export allows the export of yaml files from the model The model parameters and sets can be written to structured text-files before being added to the database in order to facilitate change-tracking (e.g., Git) despite the input-data being provided in xlsx format. This can be turned off during model calibration in order to increase speed.
creation with d2ix as well as the model input-data paths and the model name are defined in Listing 1. The newly created instance is named 'baseline' and spans over a time horizon from the year 690 to the year of 720. The first model year is defined as the year 700. The resulting model-year vector is equal to [690, 700, 710, 720], wherein 690 is a historical year, thus, not considered in the optimisation. By setting verbose to true, the log-level is set to debug mode which allows for more information to pass from the creation process to the user. Setting the yaml export parameter to true permits the creation of git-trackable yaml files of the input-data. It is recommended to only set it to false during calibration, as this shortens the model creation runtime, though it disables the git-trackability of the input-data set.
Listing 2: Creating a new MESSAGE ix scenario using the d2ix spreadsheet templates. 1 from d2ix import Model 2 3 # Create a Model instance from the data provided in base_ & m a n u a l _ p a r a m e t e r _ x l s 4 d2ix_model = Model ( run_config = CONFIG , base_xls = BASE_XLS , 5 m a n u a l _ p a r a m e t e r _ x l s = MANUAL_PARAMETER_XLS , model = MODEL , scen = ' baseline ' , 6 historic al _ da t a = True , f i r s t _ h i s t o r i c a l _ y e a r =690 , f i r s t _ m o d e l _ y e a r =700 , 7 last_mod el _ ye a r =720 , h i s t o r i c a l _ r a n g e _ y e a r =10 , m o d e l _ r a n g e _ y e a r =10 , 8 verbose = True , yaml_export = True ) 9 10 # write data from ' model ' dictionary to the database and solve 11 scenario = d2ix_model . model2db () 12 scenario . solve ( model = ' MESSAGE ') 13 d2ix_model . close_db ()

Transferring a Scenario from d2ix to the Database-model2db()
When the input-data is ready, it can be passed to the database using the model2db function, which returns an instance of the messageix.Scenario class (Listing 2, line 11).

Solving a Scenario
Using the solve function (from the messageix.Scenario class), the database model is dropped to a structured input-gdx file, which is passed on via a solve command to the mathematical model formulation of MESSAGE ix . After the successful model run, an output-gdx file is created containing all input and output-data. This file content is automatically passed on to and stored in the database. Further details on the solve() function can be found in the MESSAGE ix documentation [23]. Sample results of the baseline scenario from the Westeros example are shown in Figure 5.

Running d2ix-Modifying Existing Models
The d2ix package can also be used to modify existing models. The code required for retrieving, modifying and returning input-data sets to the database is shown in Listing 4, and is explained below.

Creating a ModifyModel Instance
The ModifyModel class provides the functionality of collecting models from the database, writing them into a structured spreadsheet file for user modification and returning the modified model to the database. The parameters specifying the ModifyModel instance not introduced in Section 4.2.1 (run_config, model, scen and verbose) are described in Table 4. The scen2xls() function (Listing 4, line 8) searches the database for the scenario defined by model and scenario name, and in the mod_model. If a scenario with the defined model and scenario name is available, all parameters and sets from the most recent (default) version of the scenario will be written to the spreadsheet. If a version is specified, this version instance of the scenario will be copied.

From the Excel File to ModifyModel Instance-xls2model()
The xls2model() (Listing 4, line 10) function reads the spreadsheet file specified in the mod_model instance and stores the data as a structured dictionary in the instance. The data is then available to modify, analyses and visualise using the Python functionality.

Post-Processing a MESSAGEix Scenario
The ixmp package supplies tools for standardised reporting of reference data and results. These tools are documented and described in [8] as well as in the online documentation [24].

Conclusions
In d2ix, we built a package that supports users in creating, modifying, and analysing MESSAGE ix scenarios. The main benefits of using d2ix for scenario creation are threefold. (i) The synoptic input-data supports the transparency and reproducibility of even large models and can thus reduce errors. It further encourages collaborative modelling attempts by making it easier to understand and review model parameters and assumptions implemented by other researchers. (ii) By reducing the dimensions of the input-data, the researchers can easily handle the data using two MS Excel sheets. Hence, d2ix reduces barriers to access by reducing input-data complexity and allowing scenario creation without programming knowledge. (iii) d2ix permits the combination of the benefits of 'higher' (easy and synoptic data-handling) and 'lower' (change-trackability) data formats. To put it succinctly: by providing a synoptic and easy input-data-handling workflow d2ix can support the efforts of the open data movement within the MESSAGE ix modeller community and can serve as an example for data-handling frameworks built for other model types.
However, simplification of input-data does reduce the flexibility of the model, e.g., currently a maximum of two outputs is supplied for each technology. However, this can be bypassed by either adapting the model parameter 'output' using the get_ and set_ parameter functionality, or by adapting the input spreadsheet and the underlying code to supply as many outputs as required. An expansion of d2ix to increased flexibility could be subject of future work; however, the decision on the specific balance between flexibility and simplicity requires practical experience which still remains to be collected.