You are currently viewing a new version of our website. To view the old version click .
Information
  • Article
  • Open Access

15 November 2025

Sustaining CyberWater-VisTrails: A Case Study in Software Upgrades and Reengineering

,
,
,
,
,
and
1
Department of Computer Science, Luddy School of Informatics, Computing, and Engineering, Indiana University, Indianapolis, IN 46202, USA
2
Department of Civil Engineering, Swanson School of Engineering, University of Pittsburgh, Pittsburgh, PA 15260, USA
*
Authors to whom correspondence should be addressed.
Information2025, 16(11), 988;https://doi.org/10.3390/info16110988 
(registering DOI)
This article belongs to the Special Issue Optimization and Methodology in Software Engineering, 2nd Edition

Abstract

This study focuses on the process of updating and upgrading a large-scale legacy software system to ensure its compatibility with modern computing environments. The evolution and maintenance of legacy software pose significant challenges in software engineering, especially given the rapid advancements in technology, computing platforms, and dependent libraries. These challenges become even more pronounced when new systems are built upon existing open-source software, which may become outdated due to discontinued maintenance or lack of community support. In this work, we examine the problem from a sustainable computing perspective through the case study of the CyberWater project—an innovative cyberinfrastructure framework designed to support open data access and open model integration in water science and engineering. CyberWater is built on top of VisTrails, an open-source scientific workflow system. VisTrails has not been actively maintained since 2017, requiring an upgrade to ensure CyberWater’s continued functionality, compatibility, and long-term sustainability. This paper presents our work on upgrading VisTrails, including the complete upgrade process, tools developed and utilized, testing strategies, and the final outcomes. We also share key experiences and lessons learned, with a focus on the sustainability challenges and considerations that arise when maintaining and evolving large-scale open-source software systems in scientific computing environments.

1. Introduction

CyberWater is a U.S. National Science Foundation project that provides an open data and open model framework designed to facilitate the incremental and seamless integration of diverse datasets and models across multiple water science and engineering disciplines [,,,,]. CyberWater is built upon VisTrails, an open-source scientific workflow and provenance management system that supports data exploration, visualization, and reproducible research []. VisTrails 1.0 release was in October 2007 and was officially supported for a decade until maintenance ceased in 2017. During that time, it was downloaded thousands of times [] and widely used to automate repetitive tasks such as simulations, data analysis, and data visualization []. One of VisTrails’ key strengths was its extensibility, allowing developers to fundamentally adapt the software to their specific models and research needs. CyberWater built on this foundation, leveraging VisTrails’ extensibility to implement its own capabilities for open data and open modeling. However, VisTrails’ official development and maintenance ceased in 2017, leaving projects that depend on it—like CyberWater—responsible for maintaining VisTrails itself. This lack of upstream maintenance became increasingly problematic. By January 2020, Python 2—VisTrails’ underlying programming language—reached its official end-of-life and no longer received updates or security patches. Additionally, PyQt4, the Qt Graphic User Interface (GUI) toolkit used by VisTrails, has also reached end-of-life, making it difficult to find compatible dependencies.
In addition to addressing the legacy issues of VisTrails, upgrading from Python 2 to Python 3 enables CyberWater to leverage modern scientific packages like “Xarray”, “NumPy” and “Pandas”. These packages significantly enhance data processing capacity across various CyberWater modules. For example, “Xarray”’s advanced functionality for subsetting datasets by temporal or spatial coordinates used in CyberWater’s data agents depends heavily on “Pandas”. However, since Pandas has discontinued support for Python 2, these subsetting features would not function properly in the older environment, creating substantial roadblocks for users. Consequently, modules relying on “Xarray” could encounter runtime errors or lose critical functionality. More broadly, Python 2 lacks key features and performance optimizations that are essential for handling the large, multi-dimensional datasets commonly used in Earth science applications.
Another important benefit of the Python 3 migration is compatibility with “fsspec”, a powerful package that provides a unified interface for interacting with various file systems, including cloud storage, Hypertext Transfer Protocol (HTTP) servers, and local directories. However, “fsspec” is not compatible with Python 2, as it relies on modern asynchronous I/O capabilities and streamlined process management introduced in Python 3. For example, both “Xarray” and “fsspec” are utilized in CyberWater’s CMIPAgent to automatically access and retrieve data from the Coupled Model Intercomparison Project (CMIP), enabling seamless integration of distributed datasets for Earth science research and applications.
Geospatial data analysis is also an important component of geoscience research. CyberWater integrates Geographic Resources Analysis Support System (GRASS) geographic information system (GIS), which, through the Python upgrade, has been upgraded from version 7.2 to 8.3, introducing several significant advancements. GRASS GIS 7.2 had limited or no native support for Network Common Data Form (NetCDF) files, complicating work with multidimensional climate and geospatial datasets. In contrast, GRASS GIS 8.3 offers seamless import and export of NetCDF files. It also supports direct data import from URLs—something that 7.2 could not reliably do—streamlining data workflows. Furthermore, certain GRASS GIS–compatible open-source modules such as “r.slope.stability” [], require GRASS GIS version 7.8 or later for full Python 3 compatibility. These improvements in GRASS GIS 8.3 enhance CyberWater’s ability to handle large geoscience data and expand its applicability across diverse geoscience studies.
In general, Python 2 lacks key features and performance optimizations present in Python 3 that are crucial for efficient manipulation of large multi-dimensional datasets encountered in Earth science applications. Overall, upgrading VisTrails from Python 2 to Python 3 improves overall performance with faster data processing and optimizations from dependencies, improves stability from updated dependencies, and provides key new features that would never be able to use with the Python 2 version.
As a result of these compounded issues and potential advantages, it became clear that VisTrails became increasingly unsustainable. The CyberWater project saw security vulnerabilities emerge, logical errors appeared more frequently, and installation became increasingly difficult due to the dwindling availability of Python 2 libraries and dependencies. To address these issues and take advantage of the benefits that upgrading the framework provided, we launched an effort to modernize VisTrails and ensure its long-term viability in the context of CyberWater.
The modernization project followed a three-phase approach.
  • Upgrading the VisTrails source code from Python 2 to Python 3.
  • Updating VisTrails’ external dependencies to Python 3-compatible libraries.
  • Updating CyberWater’s own source code to be fully compatible with Python 3.
Most existing work focuses on categorizing software modernization but offers few examples or guidelines on when to apply certain specific techniques. This paper focuses on the sustainable computing perspective of this modernization effort, detailing the technical, organizational, and sustainability challenges involved in maintaining and evolving large-scale scientific software that lacks upstream support. Specifically, the following sections outline the technical upgrade process, the tools and methods used, the testing strategies employed, and the outcomes achieved. We also share key lessons learned regarding sustainable software evolution, with particular attention to the challenges posed by maintaining large-scale open-source software systems in the domain of scientific cyberinfrastructure. The goal is not only to showcase the modernization of the CyberWater-VisTrails project within the context of sustainable computing but also to provide strategies for determining when to apply specific modernization methods.

CyberWater Background

From a software perspective, CyberWater extends VisTrails by introducing two new packages: msm (Meta Scientific Modeling) and AgentTools. In VisTrails, a package is a set of “Python classes—called modules—stored in one or more files” []. These modules define visualization and simulation code that users can incorporate into VisTrails workflows, which are sequences of interconnected modules forming an execution pipeline. VisTrail workflows are sets of modules connected to create a pipeline of execution. Both the msm and AgentTools packages offer numerous modules users can utilize in their workflows, as showcased in Section 5.
The msm framework is divided into three sets of modules. The first set consists of data agents, which are CyberWater modules that download data from various online sources. The second set consists of model agents, which take workflow inputs, convert them into the file format required by the user’s model, and then execute the model. Once execution is complete, the model agent imports the results back into the workflow and makes them available as outputs for subsequent visualization or further processing. The third set of modules forms the system core.
The AgentTools framework includes the Generic Model Agent Toolkit, the Static Parameter Agent Toolkit, the High-Performance Computing (HPC) Launch Agent, and integration engines. The Generic Model Agent Toolkit enables users to create their own model agents via the VisTrails GUI workflow management system without writing code. The Static Parameter Agent Toolkit provides modules to help users prepare parameter files for their models. The HPC Launch Agent supports the execution of computationally intensive jobs on remote high-performance computing platforms on demand. Finally, the integration engines allow CyberWater to interoperate with external systems such as MATLAB 2022b and GRASS GIS 8.3. The overall structure of the CyberWater framework is shown in Figure 1.
Figure 1. Structure of the CyberWater-VisTrails system illustrating how the CyberWater framework interacts with VisTrails and how its components are organized. The arrows within the figure are color-coded for readability.
The remainder of the paper is organized as follows. Section 2 overviews related works. Section 3 presents our work and experience on the maintenance and upgrading of CyberWater and VisTrails, including used packages, in a systematic way. Section 4 reports our testing and validation of the software upgrades. In Section 5, we provide two real-world CyberWater application cases to illustrate the CyberWater system and its upgrading. Finally, Discussion and Conclusions are given in Section 6 and Section 7, respectively.

3. Transition and Upgrading of Language, Software and Library

CyberWater was developed for both Python 2 and Python 3 but began facing maintenance issues after support for VisTrails ceased. Initially, the CyberWater team manually updated dependencies to keep VisTrails functional, but this approach became increasingly difficult over time as more dependencies became outdated. As a result, the functionality and security of the entire software were at risk, prompting the need for a larger upgrade process to resolve these issues and mitigate future risks.
The systematic upgrade of the CyberWater-VisTrails system was divided into three main components: upgrading Python, updating the associated Python libraries for VisTrails, and then transitioning CyberWater’s source code to Python 3. These three components had undergone the most significant changes over the five-year period when VisTrails ceased maintenance and CyberWater started its own upgrade project. These three components also formed the backbone of the CyberWater-VisTrails system. While other libraries were updated as well, their modifications were relatively minor in comparison. Table 1 summarizes the major version changes, including the starting version, its release date, the updated version, and the updated version’s release date. These changes highlight the necessity of the updates, reflecting the lack of maintenance in the years between the conclusion of the VisTrails project and the CyberWater team assuming responsibility for its upkeep. A full overview of upgrade issues is provided in Table A1 in Appendix A.
Table 1. Framework Updates in VisTrails Maintenance.
At the time of the upgrade, the release date of each framework version was a key factor in deciding which version to adopt, especially for the Python upgrade. Python was the first component updated, and the latest version was chosen due to security concerns as Python 2.7 no longer receives updates. Subsequent framework balanced recency with compatibility with Python 3.9.10. For example, PyQt4 supports only Python 2.X, while PyQt5 supports Python version 3.7 and above. Visualization Toolkit (VTK) was upgraded to its latest version after extensive integration testing, as it supports all Python 3 versions. Finally, GRASS GIS 7.2, which relied on Python 2.7, was also upgraded to the most recent version to ensure compatibility with the overall CyberWater-VisTrails application. Although security improvements were a secondary motivation for these upgrades, compatibility with Python 3.9.10 was the primary driver for all frameworks except Python itself.
Even though Python, PyQt, VTK, and GRASS GIS are major framework upgrades, CyberWater-VisTrails utilizes 135 Python packages that were also updated in this upgrade process in addition to PyQt and VTK. These upgrades did not require extensive updates across the CyberWater-VisTrails source code, so only the four frameworks shown in Table 1 will be covered. In summary, 2102 files within the CyberWater-VisTrails project were updated. About 594,299 lines of code were added, while 424,648 lines of code were deleted. For reference, CyberWater-VisTrails in total consists of 944,214 lines of code across 2248 files. This means about 93.51% of CyberWater-VisTrails files were affected by this upgrade and about 62.94% of CyberWater-VisTrails source code was affected by this upgrade. Because of the nature of recording lines changed within the project, it is difficult to quantify how many lines of codes were just updates, as a simple line change for source translation is both treated as a line addition and line deletion, nor do they provide a greater picture on the effort made to complete this upgrade. A major upgrade can still be seen on how widespread this upgrade process came to be through the number of files and lines of code that have been changed.
The first portion of the software that was upgraded was the VisTrails source code. CyberWater only makes up about 441,482 lines of code, or about 46.76% of the entire CyberWater-VisTrails software. The process was first source translating of the code from Python 2 to Python 3, utilizing scripts and documentation to update from PyQt4 to PyQt5, and various smaller updates for the 136 Python packages that were also updated like VTK and the VTK VisTrails package. The second step was with the CyberWater source code which followed very similar steps. First, translate the source code from Python 2 to Python 3, utilize scripts and documentation to update from PyQt4 to PyQt5, and update the usage of various smaller Python packages within the source code. No static analysis tools were used within this upgrade process. Because only minor testing was performed throughout the upgrade process, a more thorough functional testing phase was conducted after the whole conversion as described in future sections.
At the start of this upgrade, little support was available for updating the VisTrails project. Although the CyberWater team had access to publicly available user documentation and the VisTrails GitHub repository, both sources had not been updated since 2017. As a result, the transition process relied heavily on CyberWater’s own development resources, including its internal development documentation and the project’s private GitHub repository. While CyberWater is an open-source project, the source code for the current beta version has not been officially released. Prior to this upgrade, the CyberWater support framework was limited to maintaining the CyberWater source code (specifically, the AgentTools and msm VisTrails packages). Through this upgrade process, the CyberWater support framework was expanded to include the VisTrails source code as well, enabling better overall maintenance and preventing the software from falling behind on future updates.

3.1. Upgrading from Python 2 to Python 3

The upgrade from Python 2 to Python 3 was carried out in two stages: automated upgrades using linters and official Python migration scripts, followed by manual inspections to resolve issues not handled by the automated tools.

3.1.1. Automation Tools

The transition from Python 2 to Python 3 primarily involved source translation. Since Python 3 retains the same fundamental programming structure, the upgrade process was relatively straightforward. Python provides an automated translation tool, 2to3.py, which converts Python 2 code into Python 3 syntax. The 2to3.py script is installed with every Python 3 installation and its usage is documented within Python’s official developer documentation (https://docs.python.org/3.12/library/2to3.html (accessed on 27 October 2025)). Many language-level changes were handled automatically, such as enforcing parentheses in the print() function. However, one of the more challenging issues encountered involved string representation changes.

3.1.2. Manual Inspection for Python

After running 2to3.py, a manual inspection was performed to address common upgrade issues that were not automatically resolved by the tool. In Python 2, the str type was used to represent both text and binary data. In Python 3, these were split into str (immutable Unicode sequences) and bytes (immutable binary sequences). This change led to issues where code that previously relied on implicit type compatibility now required explicit conversions. Previously, VisTrails took advantage of the fact that bytes and str types were once considered the same Python type many times. Figure 2 showcases an example of VisTrails utilizing the concept of bytes and str being considered the same type with the class FileLocator within its object creation method.
Figure 2. Source code for the FileLocator class showcasing its creation method.
The example shown in Figure 2 is the class definition of the FileLocator class that was created for VisTrails. Within this class, it determined which specific derived FileLocator class to provide based on the file extension of the given filename. One key challenge arose in the FileLocator class, where filename variables that were previously treated as str in Python 2 are interpreted as bytes in Python 3. Since bytes lacks methods such as lower() and endswith(), the code failed to compile. The resolution involved converting bytes to str at the appropriate points, either directly or by modifying the upstream library that provided the filenames.
These types of modifications were systematically identified and incorporated into the upgrade process. Once the Python 3 migration was complete, attention shifted to upgrading the libraries, which had also become outdated over the five-year period.

3.2. Upgrading from PyQt4 to PyQt5

The PyQt upgrade required the most extensive effort throughout the entire process. The VisTrails application relies on PyQt as its GUI framework, and the transition from PyQt4 to PyQt5 introduced substantial changes that fundamentally affected how the application operates. The upgrade process consisted of two major components: (1) the use of custom automation tools developed by the CyberWater project team, and (2) manual code inspections to resolve more complex issues that could not be automated

3.2.1. Custom Automation Tools

Updating the PyQt framework required more extensive modifications, as no automated tools like 2to3.py existed for this transition. The primary approach involved referencing official documentation, identifying deprecated methods and classes, and making necessary changes manually. Given time constraints, the goal was to maintain the existing software structure while modernizing the codebase.
One of the most significant changes involved the restructuring of PyQt’s class hierarchy and module organization. In PyQt4, most widget-based objects were found in the QtGui module, whereas in PyQt5, these were moved to QtWidgets. This meant that import statements such as
  • from PyQt4 import QtGui, QtCore
  • had to be updated to
  • from PyQt5 import QtGui, QtWidgets, QtCore
To automate repetitive replacements, a simple script was created to update import statements and object initializations. The simple script would be performed by going line by line within the project and if the line included a PyQt class that is within QtWidgets, the reference would be updated from “QtGui.<class_name>” to “QtWidgets.<class_name>”. However, not all changes could be handled this way.

3.2.2. Manual Inspection for PyQt

Another significant modification involved PyQt’s signal-slot mechanism, which shifted entirely to the new-style application programming interface (API). The concept of signals and slots from Qt was to provide an easy framework to implement the observer software design pattern. From PyQt’s reference guide, a signal is owned by a GUI widget within the software that could emit some event information to other widgets within the application []. These signals could send this information by attaching to another widget’s slot function. A slot function is a function that would be executed after it has received an emitted signal. In PyQt4, signals and slots could easily be created on the spot with what is called the old-style API. The old-style signal is shown in Figure 3 with connecting button’s signal clicked with the slot onClicked().
Figure 3. Old style of connecting button’s signal, clicked() to the slot onClicked() by using the QObject’s method connect() and providing the instance of QButton, a reference to the signal clicked(), and the slot onClicked() as parameters.
Within this style QObject’s public method connect() was used to connect an instance of a QButton and this object’s onClicked() slot using the button’s signal: clicked(). The problem with this old-style was that it does not follow Python’s standards as macros and non-initialized objects are being used. While PyQt4 also offered a new style of signals within PyQt4.5, which are of the type pyqtSignal, PyQt5 requires this new-style signal-slot connection as shown in Figure 4.
Figure 4. New style of connection button’s signal clicked(), to the slot onClicked() by directly calling the clicked’s method connect() and providing the slot onClicked() as a parameter. The new style is a more direct, clear, and Python-style way of connecting Qt signals and slots.
Here it can be seen what signal, method, and slot belong to what. Starting from the left there is button, an instance of QButton. Next, the button’s signal clicked() is accessed. All signals have a method called connect which only takes one parameter: a slot to connect to. Finally, as a parameter, the slot onClicked(), is given to be connected to button’s signal: clicked().
While many of these updates were straightforward replacements, some required careful inspection. VisTrails had custom signal implementations that were sometimes difficult to distinguish from built-in PyQt signals. Thus, a systematic verification process was needed to ensure correctness. Figure 3 and Figure 4 can be used as an example to showcase the verification process. The first step was to update the line as shown in Figure 3 to the style that is shown in Figure 4. Afterwards, a check would need to be made on PyQt’s documentation to see if the object button had a defined signal clicked. If yes, no further changes were required. If not, then the button’s constructor would need to be updated with a new property clicked of the pyqtSignal type. In this example, button is of type QButton and QButton does have a defined clicked signal, so an update was not required. The process was further complicated due to the fact of the CyberWater team not having VisTrails development documentation as VisTrails only expected to have outside developers extend the software, not maintain the software.

3.3. Upgrading from VTK 5 to VTK 9

Unlike Python and PyQt, the VTK upgrade presented unique challenges because VisTrails did not use VTK directly in its core source code but instead through a VisTrails package that acted as a wrapper for the VTK Python package. The VTK VisTrails wrapper is a set of Python classes that encapsulate the functionality of VTK code. This VTK VisTrails wrapper acted as an interface for end-users to execute VTK code. This wrapper, originally written for VTK 5.10, became outdated when upgraded to VTK 9.2.
The primary issue was that the wrapper imported and initialized all available VTK classes, including abstract classes that were never meant to be instantiated. Since abstract classes cannot be initialized, the upgrade required selectively disabling them. This was achieved either by disabling entire root classes to automatically disable their derived classes or by manually disabling specific leaf classes that could not be initialized.
The way classes are disabled in the VisTrails VTK package is by adding the class to a couple of hard coded lists. When the VTK package is first initialized, the wrapper filters the classes to be imported through these lists because the VTK wrapper started with a smaller number of disabled classes due to abstraction or unsupported types on Python. Over time these lists grew up and now take up hundreds of lines within the wrapper code itself.
With every update, the process starts if the class itself can be instantiated. If the class cannot be initialized, then it is added to the list. All subclasses are then checked to see if they can be added as some parent classes could be abstract while child nodes are not. The main problem here is around the maintenance itself. For every update VTK has, testing of the wrapper is required to make sure that no new classes need to be disabled. The amount of work required in testing is dependent on how large the update is, but this could be done better to help with further maintenance.
While some classes could be adapted using enumeration-based workarounds, many abstract classes had to be outright disabled. This approach, while effective, was not ideal for long-term maintainability. Future efforts should focus on reengineering the automation processes within the VTK wrapper to reduce the time and effort required for subsequent upgrades.

3.4. CyberWater Upgrade Process

With the VisTrails upgrades completed, the next phase focused on updating CyberWater. Since CyberWater was originally designed and developed with Python 3 in mind, most of its components were already compatible with the upgraded VisTrails system. Consequently, the primary goal of this phase was to ensure that all relevant updates made in VisTrails were consistently applied to CyberWater. The CyberWater upgrade involved manual updates across its packages with no automated tools used. The process consisted of two main inspection stages: one for the CyberWater source code itself and another for CyberWater’s integration with GRASS GIS.

3.4.1. Manual Inspection: CyberWater

Although CyberWater was mostly developed with Python 3 in mind, many small updates needed to be made to fully upgrade the system. The components that required updating overlapped with changes made already across VisTrails. For example, some print() statements within CyberWater needed to have the now enforced paratheses. The usage of PyQt within CyberWater is small, but still large enough to require changes in import statements and changes with the usage of the signal-slot mechanism. Again, these changes have already been experienced with the update to the VisTrails system, so the applications of these changes were taking solutions already developed from past experiences.

3.4.2. Manual Inspection: CyberWater Integration with GRASS GIS

A notable update within the CyberWater source code was the integration of GRASS GIS with CyberWater. GRASS GIS is a “computational engine for raster, vector, and geospatial processing” []. CyberWater supports workflows that utilize GRASS GIS as the engine for their models. The way the user does this is by adding a module to their workflow defined within the CyberWater source code named GISEngine. The whole purpose of this module is to initialize GRASS GIS and set up the connection between CyberWater and GRASS GIS.
Originally, CyberWater utilized GRASS GIS 7.2 which used Python 2.7 as its interpreter. The update to GISEngine was straightforward; the module needed to update its paths, change the Python interpreter it was using, and had to change the way GRASS GIS was initialized. The noteworthy change was with the initialization of GRASS GIS. With GRASS GIS 7.2, a parameter that represented the starting output from GRASS GIS was no longer needed in GRASS GIS 8.3. This string parameter would be utilized with GRASS GIS’s setup function. The other changes mainly worked with updating operating system environment variables that GRASS GIS would use on start-up. Some examples of changed environment variables would the variable that describes what batch file to start GRASS GIS with and the variable that defines where the location of the GRASS GIS Python 3.9 installation is.
All these changes were made within grass_GIS.py which defines the GrassGIS class responsible for initializing the GRASS GIS session. Originally, the constructor of GrassGIS used ‘grass72.bat’ as the GRASS GIS launch script, when it should be ‘grass83.bat’ for GRASS GIS 8.3. The environment variable, PYTHONHOME, also required updating from ‘Python27’ to ‘Python39’. Finally, the GRASS GIS setup script initialization function was updated from ‘gsetup.init(gisbase, gisdb, location, mapset)’ to ‘gsetup.init(gisdb, location, mapset)’, where gisbase represents the path to the GRASS GIS installation, and gisdb, location, and mapset specify the GRASS GIS database path, location name, and mapset name, respectively, to be used in the session.
The upgrade process successfully modernized the VisTrails system, bringing it in line with current Python standards and ensuring compatibility with contemporary libraries. The insights gained from this upgrade will inform future maintenance efforts, particularly in automating complex transitions such as the VTK wrapper update.

4. Testing and Evaluation

Following the completion of all upgrades, a rigorous testing phase was conducted. Testing the VisTrails upgrade primarily relied on use-case scenarios rather than predefined test cases, as PyQt-based GUI functionality posed challenges for automated testing initially. One major challenge in testing came from the enforcement of a single encapsulating QApplication to structure the entire VisTrails application. This framework made it difficult to isolate functionalities for test cases. Python debuggers such as “pdb” also posed problems, as the application would often fail to initialize within a debugging environment. As a result, only simple unit tests could initially be conducted, and more sophisticated testing had to rely on pre-made workflows and use-case scenarios. Over time, however, debuggers like “pdb” would become more usable as the upgrade process progressed, allowing them to be integrated into the testing workflow.
The primary goal of testing was to identify any functional errors in the software following the upgrade. To achieve this, various example workflows created for the CyberWater-VisTrails beta version were utilized as test workflows. The Python 2 beta version of CyberWater-VisTrails is publicly available on the CyberWaterBeta GitHub repository (https://github.com/cyberwaterproject/CyberWaterBeta (accessed on 27 October 2025)). The Python 3 version of CyberWater-Vistrails will be publicly available by the end of 2025. The testing workflows are publicly available on the CyberWater HydroShare resource (https://www.hydroshare.org/resource/608d220c4f774aaaa7f81426011f6e8f/ (accessed on 21 July 2025)). Since they were originally developed for the Python 2-based beta version, they provided a valuable basis for comparison, highlighting potential issues users might encounter during workflow execution. While VisTrails includes built-in testing functions, these functions primarily evaluate core functionality rather than GUI behavior. Future maintenance efforts should establish comprehensive test cases to systematically validate new upgrades.
The testing environment varied since the testing phase was conducted by multiple developers using different PCs. Individual developers worked on separate machines to simulate a user’s experience with CyberWater-VisTrails. Because Windows is the primary operating system targeted for CyberWater, more extensive testing was performed on Windows 10 and Windows 11, while other operating systems like Ubuntu received less coverage.
Two physical hardware limitations were also focused on during testing to evaluate whether CyberWater-VisTrails’ minimum hardware requirements needed revision: 6 GB of RAM and a 2.1 GHz core processor. Performance was assessed against these hardware and operating system conditions, though—as stated earlier—the primary focus of the testing phase was functionality, with performance treated as a secondary objective.
The testing phase was organized into four main categories:
  • Basic Initialization—software startup and GUI responsiveness.
  • Core Functionality—workflow creation, updating, and execution in VisTrails.
  • Advanced Functionality– features such as mashups, database storage of VisTrails objects, and exploration tools.
  • Performance Review—workflow execution speed and initialization efficiency.
All four categories revealed bugs and issues that required resolution, but the Core Functionality and Advanced Functionality phases uncovered the most significant cases worth highlighting. In the following subsections, three key issues encountered during testing and their implications for future maintenance will be highlighted. The first issue arose from Python’s method resolution order (MRO) in class hierarchies with multiple inheritance. The second involved a RuntimeError related to PyQt objects being deleted on the C++ side but still accessed later in Python. The third issue occurred when trying to connect to a remote HPC platform that requires two-factor authentication.

4.1. Issue 1: Python MRO and Multiple Inheritance

Python 3 and PyQt5 enforce stricter usage of super() calls in class initializers, particularly in cases of multiple inheritance. In VisTrails, any input ports for VisTrails modules inherits both from PyQt widgets and custom-defined base classes. Because every VisTrails module has a set of input ports, this issue affected nearly every single module within the CyberWater-VisTrails application. Simply adding any VisTrails module to a workflow will replicate the error message as shown in Figure 5.
Figure 5. Example of error message created with MRO issues with PyQt5 showcasing missing parameters due to these parameters already been incorrectly handled by another parent constructor.
In the example shown in Figure 5, a QCheckBox tries to initialize within its inherited class’s constructor, QConfigurationComboBox, but three required positional arguments are missing. The three required arguments are key, field, and callback_f. The arguments are considered missing because they belong to QConfigrationComboBox’s other parent class, QConfigurationWidgetItem. PyQt5 follows Python’s method resolution order and uses super() calls, which results in QConfigurationWidgetItem’s constructor being invoked before QConfigurationComboBox’s own constructor. Since the QComboBox constructor does not receive the required key, field, and callback_f arguments, the initialization of QConfigurationComboBox fails. Figure 6 illustrates how QConfigurationComboBox invokes both QComboBox and QConfigurationWidgetItem constructors.
Figure 6. QConfigurationComboBox’s initializer before the changes for Python MRO calling both parent classes constructors with the relevant parameters given to each parent.
Any time a VisTrails defined class inherited from a PyQt class and a VisTrails class, the error of missing required positional arguments would be seen. Previously, initializers that inherited from both a PyQt class and a VisTrails class explicitly called parent constructors. Figure 6 provides another example of the old style.
Figure 7 showcases the class StandardConstantWidget. StandardConstantWidget is a base class used for keyboard input for other VisTrail classes to inherit from. Here StandardConstantWidget inherits from two classes: QLineEdit and ConstantWidgetBase. QLineEdit is a simple PyQt widget which has the main purpose of receiving keyboard input. ConstantWidgetBase is an abstrct class for other ConstantWidget classes to inherit from. Originally, in StandardConstantWidget’s initializer both QLineEdit’s and ConstantWidgetBase’s initializers would be called with the parameters being passed to the correct initializer. However, the Python 3 upgrade necessitated replacing these calls with a single super().__init__() invocation, ensuring proper initialization order via keyword arguments due to a change from Python 2 to Python 3 and PyQt5 implementing super() calls in all its class constructors. In Python 3, when a class inherits from a PyQt5 class, super().__init__() calls are enforced because invoking one parent constructor automatically triggers all parent constructors through the MRO. The older approach of explicitly calling each parent class’s constructor separately was removed in Python 3, as it could cause issues with misplaced arguments. Thus, if one parent’s constructor were invoked directly, all constructors in the hierarchy would still be executed in that single call. Consequently, all VisTrails classes inheriting from PyQt5 must use super().__init__() to ensure that both PyQt and VisTrails constructors are initialized with the correct arguments. While functional, this change to using super() calls also altered method resolution order, potentially introducing unforeseen issues, especially if inherited classes define methods with the same name. To illustrate, the updated StandardConstantWidget initializer is given in Figure 8.
Figure 7. StandardConstantWidget’s initializer before the changes for Python MRO calling both parent classes constructors with the relevant parameters given to each parent.
Figure 8. StandardConstantWidget’s initializer after the changes for Python MRO calling super().__init__() once with all parameters provided to the super() call.

4.2. Issue 2: PyQt Object Deallocation and Runtime Errors

Categorized as a core functionality issue, a long-standing issue in VisTrails involved accessing objects that had been garbage-collected. The garbage-collected objects had not only affected the new Python 3 version of CyberWater-VisTrails, but also the non-updated Python 2 version of CyberWater-VisTrails. This problem primarily affected classes inheriting from PyQt. One notable instance occurred when switching between VisTrail workflows containing overlapping Connection IDs. VisTrails attempted to reuse existing Connection objects but failed to update associated port items, leading to invalid memory access. Figure 9 showcases the invalid memory access in VisTrails’ console output.
Figure 9. Updating connection objects before changes to update port items causes a wrapped C/C++ object has been deleted error.
VisTrails workflows consist of modules and connections between the modules. Each module encapsulates visualization and simulation code that users can incorporate into workflows, while connections serve as the primary mechanism for passing data between modules. Ports act as the connection endpoints. An output port serves as the starting point of a connection, while an input port serves as its endpoint. The fix for invalid memory access with ports involved explicitly refreshing ports during workflow transitions.
To illustrate in more detail, Figure 10 shows the loop that updated the Connection objects.
Figure 10. Updating connection objects before changes to update port items only updated the connecting modules.
VisTrails originally only updated these Connection objects partially. The modules that the Connection object represents were updated within the object, but the ports that the object represents were not updated. Because of this oversight within the original development of VisTrails, the Connection object tries to access deleted port items later when updating the scene. To resolve this issue, the fix is to add a portion that updates these port items when updating common connections. The update can be seen in Figure 11.
Figure 11. Updating connection objects after changes to update port items with the addition of updating the respective source and destination port items.
Another option would have been to redesign how VisTrails reuses common connections. However, such reuse is already a rare occurrence. For this error to arise, two open workflows must contain identical connection IDs. VisTrails prevents this by assigning IDs through a counter that increments with each new connection. The counter only resets when VisTrails is relaunched. Thus, overlapping IDs can only occur if one workflow is created in one session and another workflow is created in a later session using the exact same sequence of actions (e.g., adding a module, editing it, and creating a connection). When switching between these workflows in a subsequent session, their IDs may overlap. Because this situation is highly unlikely under normal use, the effort required to redesign the ID assignment mechanism was deemed too costly compared to the minimal risk of occurrence.

4.3. Issue 3: Operational Sustainability for Connecting to HPCs with Two-Factor Authentication

The transition of CyberWater-VisTrails to Python 3 revealed areas of the CyberWater codebase that needed updates for reasons beyond Python 3 or library upgrades. The CyberWater HPC module initially assumed that logging into any HPC cluster using Secure Shell (SSH) required only a username and password. However, many HPC clusters have since adopted more secure connection methods, such as two-factor authentication (2FA). To accommodate these changes, the HPC module needed to be updated to support 2FA. To enable 2FA support, a new checkbox was added to the module configuration. Users can check this box to indicate that 2FA is required for their connection.
In the previous implementation, the system attempted to connect using the username and password via the “paramiko” library (see Figure 12). The “paramiko” is a Python library that implements the SSHv2 protocol. The “paramiko” library is one of the most popular Python packages for SSH support, so naturally the library was used with the HPC module. However, if the connection required an additional 2FA passcode, the connection would fail.
Figure 12. One-factor authentication connection method utilizing the “paramiko” Python library.
To resolve this, the system was updated utilizing the “paramiko” library to handle interactive authentication. This method mimics the terminal login process, where the connection initiates, prompts are received, and responses are sent back. A handler function is passed to manage the interaction. Figure 13 showcases the updated source code that handles interactive authentication.
Figure 13. Updated interactive two-factor authentication sequence utilizing the “paramiko” Python library.
The handler function is invoked when the server sends a prompt (e.g., “Password:”). It checks the prompt for keywords that indicate whether the system needs to provide a password or the 2FA code. Based on the prompt, the appropriate response is sent to the server, as shown in Figure 14.
Figure 14. Handler functions for automatic response to server authentication prompts.
The 2FA checkbox allows users to choose between using the HPC module in its original form (non-2FA connections) or utilizing the new features for supporting 2FA. These features were tested across various HPC connections with CyberWater-VisTrail workflows. As the checkbox introduces new features, it could raise the possibility of new bugs and design issues, making it essential to include these new features in the testing phase. During the testing phase, a comprehensive validation process was conducted. Fourteen representative workflows were tested, all 91 of CyberWater developed modules were tested, and the majority of VisTrails core functionalities, including the workflow management system, provenance management, explore system, and file import/export system, were systematically verified. In the early stages of testing, many workflows initially failed. However, these issues were progressively resolved. While the upgrade did not immediately yield a significant improvement in workflow execution performance, the successful transition to Python 3 has laid a modern, robust foundation for future performance optimization efforts.
These issues underscore the importance of not only upgrading software components but also addressing legacy logical errors that persist across versions, as well as operational sustainability for evolving security and access policies for the CyberWater project. Future maintainers must rely on thorough debugging and source code analysis to diagnose and resolve such problems efficiently. Given the limited documentation available for VisTrails, maintainers should prioritize code comments and structured documentation to facilitate future upgrades and troubleshooting.
A survey conducted for the Annual International Conference on Design of Communication: Documenting & Designing for Pervasive Information found that, regardless of the intended usage, software engineers ranked source code as the most valuable artifact []. The second most valued artifact was code comments. This finding highlights that class documentation, use case documentation, and other design documents are not always essential for understanding a software system. A developer can often comprehend both the problem and the software itself solely by examining the source code. However, this approach is not the most efficient, as reading well-structured design documents is easier and faster than deciphering the source code that emerges from those documents.

5. CyberWater Application Examples

Two CyberWater workflow examples are provided here to showcase the functionalities that can be done now with the upgraded Python 3 CyberWater-VisTrails and GRASS 8.3. The first example focuses on the West Branch Susquehanna Watershed, and the second on the Indiantown Run Watershed, both located in Pennsylvania, U.S.A.

5.1. West Branch Susquehanna Watershed Example

In the first example, the workflow includes model coupling simulations using two models with two different datasets for the West Branch Susquehanna (WBS) River Basin, which spans over 17,700 square kilometers. The two models, coupled with each other, are the Variable Infiltration Capacity (VIC) model [,] version 5.0 (VIC5) and a routing model []. The simulation period covers 15 March 1997, to 15 May 1997. The first simulation uses a complete set from NASA Land Data Assimilation System (NLDAS) as VIC5 model’s forcing data, which is automatically retrieved via CyberWater’s NLDASAgent module. The NLDASAgent module’s purpose is to download data from NASA Land Data Assimilation System. The NLDASAgent module takes in needed inputs like username, password, time range, space range, and a list of variables to download forcing data for models like VIC5. The second simulation uses hybrid data retrieved from two different data sources as VIC5’s forcing data. The two data sources are:
  • ERA5 (European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5)
  • NLDAS
The ERA5 atmospheric forcing variables, 2 m air temperature, 10 m U wind component, 10 m V wind component, and surface pressure, were retrieved using the NetCDF4ToDataset module from ERA5’s data links/URLs, while the remaining variables (precipitation, longwave and shortwave radiation, and specific humidity) continue to be sourced from NLDAS. The NetCDF4ToDataset module converts NetCDF4 files into Python dataset classes that other CyberWater modules can utilize. The NetCDF4ToDataset module uses the “fsspec” and “Xarray” Python packages to subset remote NetCDF files based on the specified time range and study area. The use of “fsspec” and “Xarray” functions to subset remote data to be retrieved is critical, because—for example—when using the ERA5 2 m air temperature variable, the full file covering the required period (from the beginning of March 1997 to the end of May 1997) is approximately 4.31 gigabytes. By applying spatial and temporal subsetting through the NetCDF4ToDataset module, the amount of data downloaded is reduced to just 435 KB, significantly optimizing resource usage. This highlights the importance of using the NetCDF4ToDataset module, which is only possible with Python 3, to extract only the necessary data via subsetting.
The ERA5 data retrieved from the National Center for Atmospheric Research (NCAR) Research Data Archive (https://rda.ucar.edu/ (accessed on 18 July 2025)), with an hourly temporal resolution and an original spatial resolution of 0.25 degrees, are subsetted based on the temporal range and spatial extent of our study area—WBS river basin, allowing only the relevant variables and region to be extracted. These data are then resampled to 0.125 degrees using the Resample module to match the resolution required by the VIC5 model. As illustrated in Figure 15, the complete workflow includes data access, subsetting based on the study area and time range, preparation of the forcing datasets, and coupling the VIC model with the Routing model to generate streamflow. Figure 15a,b highlight the NLDAS and ERA5 data groups used in the workflow, respectively. A comparison of the two simulation outputs is shown in Figure 16, which also includes observed streamflow data from the United States Geological Survey (USGS) at the WBS river basin’s outlet location (USGS Site ID: 01553500), retrieved using the UsgsAgent module. For more information about the data and model agents used in the example, readers are referred to []. The objective of this example is to demonstrate the importance of subsetting large datasets from data sources using upgraded Python 3 through a workflow which accomplishes the tasks of accessing forcing data online, processing and subsetting the data, utilizing the data to execute hydrological model simulations, and evaluating the model simulation results with observations.
Figure 15. Workflow for retrieving forcing datasets and executing VIC5-Routing coupled model simulations using NLDAS and hybrid NLDAS–ERA5 data sources. (a) Extracting NLDAS data using the NLDASAgent module; (b) Retrieving ERA5 data using the NetCDF4ToDataset module with Python 3. The screenshots shown in (a,b) correspond to the “a” and “b” components indicated in the right-hand workflow diagram. The black triangles within each module are used to open a drop-down menu for the module’s options.
Figure 16. Comparison of model-simulated streamflows with two different forcing datasets to the USGS observations at Site ID 01553500 for the West Branch Susquehanna River Basin.

5.2. Walkthrough of West Branch Susquehanna Watershed Example

The first step of the workflow involves defining the temporal and spatial extent of the model using the TimeRange and SpaceRange modules. For this example, the TimeRange’s starting time is “15 March 1997 00:00:00” and the ending time is “15 May 1997 00:00:00”. The SpaceRange module can accept various formats; here, latitude and longitude bounds are specified: x_min = −79.0, x_max = −76.125, y_min = 40.375, y_max = 42.0.
The output of the SpaceRange module defines the case study region and can be passed to any data agents to retrieve data from multiple sources. As shown in Figure 15a,b, the acquisition of data followed different workflows. NLDAS data were acquired using the NLDASAgent module, whereas the ERA5 data were acquired through the NetCDF4ToDataset module. The key distinction is that NLDASAgent specializes in downloading data exclusively from NLDAS, while NetCDF4ToDataset is more general and supports any publicly available NetCDF4 dataset.
  • For the NLDAS data, a single NLDASAgent module is used to acquire the seven required variables. As shown in Figure 15a, four of these undergo unit conversions to meet the input requirements of the VIC5 model, using either the msmUnitConversion module, which converts one dataset from one unit to another, or the msmDatsetOperation module, which applies mathematical operations to two datasets to generate a new output dataset. These conversions produce a standardized set of seven variables used by the VIC5 models.
  • Pressure [kPa];
  • Radiation Flux Long Wave [W/m2];
  • Radiation Flux Short Wave [W/m2];
  • Temperature [°C];
  • Total Precipitation [mm/h];
  • Water Vapor Pressure [kPa];
  • Wind Speed [m/s].
Figure 17 illustrates the configurations of the NLDASAgent module.
Figure 17. Illustration of the NLDASAgent configurations in the CyberWater workflow in Figure 15a. The black triangles within each module are used to open a drop-down menu for the module’s options.
Using the same spatial and temporal ranges as those retrieved from NLDAS, ERA5 data were downloaded through four separate NetCDF4ToDataset modules, each acquiring a single variable. As with the NLDAS data, these four variables were converted and compiled into a new set of three variables as follows, for input to the VIC5 models.
  • Surface Pressure [kPa];
  • 2-meter temperature [°C];
  • Wind Speed [m/s].
Once the data downloads were configured, the two VIC5 execution modules were set up. Model execution occurs within the VIC5Agent_g CyberWater module, a generated component built on CyberWater’s generic model agent toolkit, which enables execution of any model within the CyberWater framework.
As shown in Figure 15, two VIC5Agent_g instances were implemented: the first (right panel) utilizes data only acquired from NLDAS, while the second (left panel) incorporates both NLDAS and ERA5 inputs. Each VIC5Agent_g requires seven data connections corresponding to input ports.
  • AIR_TEMP: [°C] Air temperature;
  • LWDOWN: [W/m2] Long-wave downwards radiation;
  • SWDOWN: [W/m2] Short-wave downwards radiation;
  • PREC: [mm] Total precipitation per time-step;
  • PRESSURE: [kPa] Atmospheric pressure;
  • VP: [kPa] Water-vapor pressure;
  • WIND: [m/s] Total magnitude of wind speed.
The VIC5 model generates two datasets in this example: baseflow and runoff. These datasets are passed to the RoutingAgent_g module, which performs hydrologic routing using the Muskingum method, following the execution of the VIC model. After the routing is completed, a singular output is produced: streamflow, the routed streamflow dataset.
  • Two separate RoutingAgents are configured for the outputs of the VIC5 model instances, but both generate streamflow outputs that are visualized with the msmShowChart module. This module creates a matplotlib time-series chart based on the average values of the given datasets. The simulated streamflow outputs are compared against observed streamflow data from USGS at the WBS river basin’s outlet, retrieved using the UsgsAgent (Figure 15). For this example, the UsgsAgent requires three configuration values.
  • desired_site_code: 01553500.
  • unit_conversion_factor: 0.0283168466.
  • url: http://waterservices.usgs.gov/nwis/iv/ (accessed on 4 August 2025)?
Since the USGS data are already in the format for direct comparison with the model outputs, no additional data handling is required. The output from the UsgsAgent can be directly linked to the msmShowChart module, enabling comparison between observed streamflow and the outputs generated by the two VIC5 model instances (driven by NLDAS and ERA5 inputs). The resulting comparison is illustrated in Figure 16.

5.3. Indiantown Run Watershed Example

In the second example, the CyberWater workflow illustrated in Figure 18a is used to perform two main tasks: geomorphological landform classification using “r.geomorphon,” and streamflow simulation using the Distributed Hydrology Soil Vegetation Model (DHSVM) hydrological model [] driven by the NLDAS meteorological data for the Indiantown Run Watershed. The “r.geomorphon” module, a terrain analysis tool, identifies surface forms based on the shape and arrangement of surrounding terrain features. Meanwhile, the DHSVM is used to simulate streamflow, which is then compared against observations from a USGS station (Site ID: 01572950).
Figure 18. (a) CyberWater workflow for automated geomorphological classification using DEM data and DHSVM simulation; (b) input Digital Elevation Model (DEM) map obtained from the DEMAgent module; (c) geomorphon classification results showing extracted terrain classifications. The black triangles within each module are used to open a drop-down menu for the module’s options.
The “r.geomorphon” module is unavailable in GRASS GIS version 7.2, highlighting the necessity to upgrade to a more recent version, such as GRASS GIS 8.3, to utilize advanced terrain classification capabilities within CyberWater. For this study, the Digital Elevation Model (DEM) data required for terrain analysis and for the DHSVM was directly retrieved online by the DEMAgent module for the Indiantown Run Watershed. The DEMAgent is another data agent like the NLDASAgent. DEMAgent retrieves DEM data online through the Shuttle Radar Topography Mission (SRTM) survey. This watershed covers an area of approximately 14 km2 and is located near Harper Tavern, Pennsylvania. All GIS operations, including the execution of the new function of “r.geomorphon”, were performed through the GISRunModuleAgent, which interfaces with the GISEngine module to initialize and manage the GRASS GIS environment as described in []. Figure 18b presents the original DEM used as input, while Figure 18c displays the resulting landform classification generated by the “r.geomorphon” tool for analysis.
For the geomorphon analysis, the module was configured with the following parameters: search = 9, skip = 0, flat = 0.5, and dist = 0. These settings are optimized for 30 m resolution DEMs to balance sensitivity and generalization. They allow accurate detection of key geomorphic features such as ridges, valleys, and slopes, which are critical for hydrological modeling and watershed characterization. Ridges often define watershed divides, valleys represent areas of flow accumulation, and slopes determine the movement of surface runoff.
Parallel to the terrain classification, the DHSVM was executed to simulate streamflow for the watershed using NLDAS input data. The simulation was conducted for the period of 1 January 2007 to 1 June 2007. Figure 19 presents a comparison between the simulated DHSVM streamflow and the observed USGS streamflow data. This comparison demonstrates the integrated capability of CyberWater to automate both terrain-based analysis and hydrological model evaluation within a unified workflow.
Figure 19. Comparison of simulated DHSVM streamflow and observed USGS streamflow at Site ID 01572950.
The integration of the geomorphon landform classification included in GRASS GIS version 8.3 with the hydrological modeling workflow in CyberWater enhances the user’s ability to further automate environmental analyses along with the modeling simulations. The absence of advanced terrain analysis tools such as “r.geomorphon” in GRASS GIS version 7.2 highlights the necessity of using upgraded Python 3 to fully leverage CyberWater’s functionalities.

5.4. Walkthrough of Indiantown Run Watershed Example

The first step of the workflow involves defining the temporal and spatial range of the model using the TimeRange and SpaceRange modules. The starting time for this example is “1 January 2007 00:00:00” and the ending time is “1 June 2007 00:00:00”. The spatial domain is defined as: x_min = −76.75, x_max = −76.5, y_min = 40.375, and y_max = 40.5.
With the temporal and spatial ranges defined for the simulation, the next phase of the workflow is data acquisition and data handling. Three data agents are used in this workflow—the NLDASAgent, UsgsAgent, and DEMAgent modules—each described earlier. While all share the same case study, they acquire different datasets for use with the models within the workflow.
In this example, the NLDASAgent acquires five variables directly from NLDAS. Through msmUnitConversion modules, two of these variables are converted into different units. In addition, two new variables are derived from the original five using the msmDatasetOperation module. These operations yield a total of six variables for use in the models.
  • Radiation Flux Long Wave [W/m2];
  • Radiation Flux Short Wave [W/m2];
  • Temperature [K];
  • Total Precipitation [m/h];
  • Wind Speed [m/s];
  • Relative Humidity [kg/kg].
The configuration of the NLDASAgent modules is shown in Figure 20.
Figure 20. Indiantown Run Watershed workflow showcases the NLDASAgent configurations within the NLDASAgent Data sub-workflow (group) shown in Figure 18a. The black triangles within each module are used to open a drop-down menu for the module’s options.
The NLDAS-acquired data are then used to drive the DHSVM, which is subsequently compared against observed streamflow measurements acquired from USGS. The DHSVMAgent_g module, which executes the DHSVM, requires six specific input datasets for model execution. These datasets are connected to the corresponding NLDAS outputs, as shown in Figure 18a.
  • HUMIDITY: [kg/kg] Specific humidity;
  • LONG WAVE: [W/m2] Long-wave downwards radiation;
  • SHORT WAVE: [W/m2] Short-wave downwards radiation;
  • PREC: [m] Total precipitation per time-step;
  • TEMP: [C] Air temperature;
  • WIND: [m/s] Total magnitude of wind speed.
Using these inputs, the DHSVMAgent_g simulates streamflow, which is then compared to observed data acquired via the UsgsAgent module. The UsgsAgent is configured to acquire streamflow data using the following parameters:
The msmUnitConversion module converts the DHSVM streamflow output into cubic meters per second (m3/s). The output from UsgsAgent can then be linked to msmShowChart along with the output of DHSVMAgent_g, enabling a comparison between observed data and the DHSVM results. The comparison, as visualized by msmShowChart, is shown in Figure 19.
The second part of the workflow demonstrates CyberWater’s integration with GRASS GIS. Data acquisition is performed by the DEMAgent module, which requires a connection to the GISEngine module. The GISEngine initializes the GRASS GIS environment within Python, allowing subsequent GRASS GIS modules such as DEMAgent and GISRunModuleAgent to execute GRASS GIS scripts. The DEM data acquired by DEMAgent is passed to the first GISRunModuleAgent, named “g.region” in Figure 18a. As the name suggests, this agent executes the GRASS GIS script “g.region.raster”, which generates a raster map of the initialized region.
The output from “g.region” GISRunModuleAgent’s Ready port is then passed to the next GISRunModuleAgent, “r.geomorphon”. The Ready output port acts as a flag, indicating whether the next module should be executed, and is used as a mechanism to organize the execution sequence of GRASS GIS operations in CyberWater. The “r.geomorphon” script identifies surface forms based on terrain shape and arrangement. Its Ready output is subsequently connected to the next GISRunModuleAgent, Save, which executes the GRASS GIS script “r.out.text”. The script converts the raster map layer into a GRASS American Standard Code for Information Interchange (ASCII) text file.
The Ready output port of the “Save” GISRunModuleAgent is connected to the final GISRunModuleAgent, “Category”. The “Category” module executes the GRASS GIS script “r.category”, which manages category values and labels associated with user-specified raster map layers.
Since the GISRunModuleAgent operates outside the CyberWater-VisTrails framework, its output must be imported back into the system. This is accomplished with the DirToStaticDataSet module, which creates a single dataset from an operating system folder. Only one map is expected to be loaded as a static map, specifically the file saved by the “Save” GISRunModuleAgent.
The imported dataset is then passed to the msmAnimation module to visualize the output from “r.geomorphon”. The msmAnimation module generates an animated Graphics Interchange Format (GIF) from the dataset produced by DirToStaticDataSet. The resulting visualization is shown in Figure 18c.

6. Discussion

The approach taken for this maintenance upgrade of CyberWater-VisTrails would not be sustainable for future updates. While reengineering may be necessary for major upgrades, a more sustainable strategy would be to conduct maintenance earlier and on a more regular basis. Addressing issues incrementally prevents them from accumulating into large-scale overhauls, reducing the time and effort required for future upgrades. For example, rather than debugging a complex issue involving deprecated C/C++ methods, redefined classes, and other bugs all at once, proactive maintenance might limit the scope to just one of these changes. Regularly checking for software updates during available maintenance periods can enable shorter, more manageable upgrade cycles.
Some parts of VisTrails’ source code, however, require reengineering to minimize the impact of future maintenance. A notable example is the VTK package, whose wrapper is currently fragile and likely to break with future VTK updates. The most critical improvement would be enhancing the system that determines whether a class is abstract. As previously noted, updating the VTK wrapper currently involves adding abstract classes to disabled class, module, and method lists. The decision to reengineer ultimately depends on how much a solution impacts future maintenance. In the case of VTK wrapper, a simple fix does not address the growing disabled classes list and the slow process of determining which classes should not be imported. A more sustainable solution is to reengineer the wrapper so that it dynamically disables classes that cannot be imported because they are abstract. Automating this process would significantly streamline future maintenance.
Reengineering, however, can be costly. As noted in the introduction, reengineering follows a reverse engineering process where an existing system is analyzed, redesigned, and redeveloped to produce a more modern version. This often requires rebuilding the system from the ground up. By contrast, a “fix-as-we-go” approach modernizes only enough to preserve the system’s original functionality. While this method demands less time and effort, it results in short-term fixes that complicate future development.
Maintenance should be streamlined so developers can focus on expanding software rather than repeatedly repairing its foundations. During this upgrade project, several candidates for reengineering emerged. The most significant was the VisTrails VTK wrapper, but challenges such as VisTrails’ reliance on multiple inheritance with PyQt and CyberWater’s rigid integration of GRASS GIS also highlighted the need. One missed opportunity in this upgrade was the limited use of reengineering. Time constraints and the small number of developers available often made the quicker “fix-as-we-go” path more appealing, even though it might compromise long-term maintainability.
In the case of VTK wrapper, applying a simple fix does not resolve the broader concern of an ever-growing disabled classes list and the slow process of determining which classes should be excluded from import. A more sustainable solution is to reengineer the process so that the wrapper dynamically disables classes that cannot be imported because they are abstract. This automation is already partially implemented. When a class is imported, the system first checks whether it appears in a hardcoded obsolete list. If not, VisTrails attempts to instantiate the class via the VTK Python package. If instantiation raises a TypeError or NotImplementedError, the class is identified as abstract; otherwise, it is valid. The logic is illustrated in the is_abstract() method. For every VTK class, each method is mapped to either an input or output port depending on its use case. Their parameters are then validated with is_type_allowed(), which checks only whether a type appears in the disabled_classes or disallowed_types lists. During the upgrade from VTK 5.10 to VTK 9.3, various errors surfaced when VisTrails attempted to instantiate vtkWeakPointers. This vtkWeakPointer type is not intended to be instantiated—it exists solely for the C++ side of VTK. However, the current VisTrails process assumes that nearly all types are instantiable, which is not the case. The quick workaround was to disable all methods involving vtkWeakPointer, but this inadvertently removed access to valid methods that could otherwise be used within VisTrails.
Exploring the VTK wrapper highlighted several challenges and potential development pitfalls, but it also provided valuable lessons for future CyberWater-VisTrails developers and maintainers. The consideration is whether reengineering the wrapper yields long-term efficiency gains. Each upgrade currently with the VisTrails VTK wrapper requires expanding the hardcoded disabled classes lists, leading to software bloat: easy to maintain at a basic level, but inefficient and error-prone for rapid iteration. Effective software maintenance relies on striking the right balance between simplicity and flexibility within the project’s constraints.
That said, not every issue requires reengineering. Some challenges can be addressed using other software evolution techniques, such as replacement, translation, or wrapping. While reengineering demands a substantial upfront investment, it can save significant time in the long run. However, for certain bugs, a quick resolution that does not impact long-term maintenance may be preferable. By balancing these strategies, future VisTrails maintenance can become far more efficient and sustainable.

7. Conclusions

The maintenance of CyberWater-VisTrails involved multiple software evolution methods, including source translation for the transition from Python 2 to Python 3 and software replacement for the PyQt upgrade. Looking ahead, maintaining CyberWater-VisTrails will require expanding upon these approaches, particularly through strategic reengineering to streamline future upgrades.
Additionally, the upgrade process itself needs refinement. Many updates followed an ad hoc “fix as we go” approach, which, while functional, proved inefficient. A more structured strategy—similar to the systematic approach taken for the PyQt transition—would improve efficiency by addressing multiple changes at once.
Maintainability discussions in the scientific community often focus on the categorization and the proposed methodologies, but these proposals are not always subjected to thorough testing. In the CyberWater-VisTrails upgrade process, such methodologies were applied in a unique context: the maintainers were tasked with upgrading software they did not originally develop. While the sprint remained the same, the supporting systems required for the software were entirely new.
There was a strong emphasis on limiting reengineering to minimize functional discrepancies. The goal was to preserve the software’s original design as much as possible, thereby reducing the risk of introducing design issues. However, after the upgrade process, it appeared that this approach was not quite feasible. New design choices for libraries, programming languages, or other dependencies inevitably created reengineering opportunities. These opportunities should not be overlooked; at the very least, they should be carefully investigated, since the upgrade process provides the ideal moment to redesign the software when necessary.
While CyberWater-VisTrails is now stable, ensuring its long-term viability requires proactive and continuous maintenance. Regular, well-planned updates will prevent major overhauls and support a sustainable evolution of the system over time.

Author Contributions

Conceptualization, Y.L.; methodology, D.B. and Y.L.; software, D.B., A.S. and A.M.H.; validation, D.B., A.S., S.A., A.M.H. and R.C.; formal analysis, D.B. and Y.L.; investigation, D.B., Y.L. and X.L.; resources, Y.L. and X.L.; writing—original draft preparation, D.B. and Y.L.; writing—review and editing, D.B., A.S., A.M.H., Y.L. and X.L.; visualization, A.S.; supervision, Y.L. and X.L.; project administration, Y.L. and X.L.; funding acquisition, X.L. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the U.S. National Science Foundation under [OAC-1835817, OAC-2209835] to Indiana University, and [OAC-1835785, OAC-2209833] to the University of Pittsburgh, respectively.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new data were created in this study. Data sharing is not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
GUIGraphical User Interface
HTTPHypertext Transfer Protocol
CMIPCoupled Model Intercomparison Project
GRASSGeographic Resources Analysis System
GISGeographical Information System
NetCDFNetwork Common Data Form
msmMeta-Scientific-Modeling
HPCHigh Performance Computing
VTKThe Visualization Toolkit
APIApplication Programming Interface
MROMethod Resolution Order
SSHSecure Shell
2FATwo-Factor Authentication
WBSWest Branch Susquehanna
VICVariable Infiltration Capacity
NLDASNASA Land Data Assimilation System
ERA5ECMWF Reanalysis v5
ECMWFEuropean Centre for Medium-Range Weather Forecasts
NCARNational Center for Atmospheric Research
USGSUS Geological Survey
DHSVMDistributed Hydrology Soil Vegetation Model
DEMDigital Elevation Model
SRTMShuttle Radar Topography Mission
ASCIIAmerican Standard Code for Information Interchange
GIFGraphics Interchange Format

Appendix A

Table A1. The list of all updates made to CyberWater-VisTrails during the upgrade.
Table A1. The list of all updates made to CyberWater-VisTrails during the upgrade.
ModuleBug DescriptionBug Type
PythonResolved issue with copying modules/workflows.bytes/str
VTKUpdated VTK’s wrapper to disable further abstract classes by disabling classes, methods, and modules.VTK wrapper
GRASS GIS/PROJ4Updated the syntax of datum parameters from lowercase to uppercase.Syntax
PythonResolved issue with grouping Modules together.Operator overload
PythonResolved issue with ungrouping a Group object.Operator overload
ElementTreeResolved deprecated method of class Element getchildren() to list(Element).Deprecated method
PyQtUpdated images used within VisTrails that used unsupported iCC profiles and had invalid iCCP chunks.Images
ElementTreeUpdated all calls of ElementTree.tostring() to use Unicode encoding.bytes/str
VTKSuppressed VTK warnings on first startup as they are needed for the wrapper, but not the user.VTK wrapper
Python/PyQtUpdated all calls of exec_() within the PyQt package to exec().Deprecated method
PyQtUpdated QPrinter objects so that workflows could be exported.Restructure of module
Python/PyQtWhen switching workflows within the pipeline view, some deleted connection objects could be accessed if there were common connections between the workflows. Updated method that updated common connections to fully update the connections, not just partially.Legacy
PyQtUpdated QFileDialog.getSaveFileName() methods to only return the first item of the tuple that is returned as VisTrails only needs the str that represents the path.Updated method functionality
PyQtChanges to PyQt 5 caused some visual bugs with the Spreadsheet window. Updated to be a frameless window to disable tile bar movement, so that the rubber band functionality could be used. Also updated the spreadsheet tabs event filter to not check if the mouse was on the title bar.Visual bug
PyQtAdded various pyqtSignals across VisTrails to allow custom made signals made by VisTrails.Deprecated method
PyQtUpdated various objects that are used in search functions in VisTrails to be hashable.Operator overload
PyQtUpdated all import and initialization statements to account for restructuring of PyQt.Restructure of module
PyQtUpdated all signals, emits, and connect statements to account for the new style of PyQt signals.Deprecated method
Python/PyQtAny class that inherited from PyQt and a class defined by VisTrail had to have its initializer updated to use super() calls as PyQt used these calls for its own initialization.Updated method functionality
PyQtUpdated deprecated method setMargin() to setContentsMargins() which is a more functional method that allows the user to define margins separately instead of a uniform margin.Deprecated method
PyQtUpdated the calls of enumerations as they were moved to a separate module in PyQt5.Restructure of module
PyQtUpdated deprecated method setMovable() to setSectionsMovable().Deprecated method
PyQtUpdated deprecated method setResizeMode() to setSectionResizeMode().Deprecated method
AllVarious deprecated methods or methods that have been updated due to version changes.Deprecated method/Updated method functionality

References

  1. Salas, D.; Liang, X.; Navarro, M.; Liang, Y.; Luna, D. An open-model framework for hydrological models’ integration, evaluation and application. Environ. Model. Softw. 2020, 126, 104622. [Google Scholar] [CrossRef]
  2. Chen, R.; Luna, D.; Cao, Y.; Liang, Y.; Liang, X. Open data and model integration through generic model agent toolkit in CyberWater framework. Environ. Model. Softw. 2022, 152, 105384. [Google Scholar] [CrossRef]
  3. Chen, R.; Luna, D.; Li, F.; Young, R.; Bieger, D.; Song, F.; Pamidighantam, S.; Liang, Y.; Liang, X. CyberWater: An Open Framework for Data and Model Integration in Water Science and Engineering. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM’22), Atlanta, GA, USA, 12–21 October 2022; pp. 4833–4837. [Google Scholar] [CrossRef]
  4. Chen, R.; Li, F.; Luna, D.; Ranawaka, I.; Song, F.; Pamidighantam, S.; Liang, X.; Liang, Y. Asynchronous modeling workflows in CyberWater with on-demand HPC/Cloud access. Future Gener. Comput. Syst. 2024, 159, 307–322. [Google Scholar] [CrossRef]
  5. Luna, D.; Chen, R.; Sheba, A.; Young, R.; Liang, Y.; Liang, X. Facilitating open data and open model integration with generic parameter input file generators in the CyberWater framework. Environ. Model. Softw. 2025, 185, 106266. [Google Scholar] [CrossRef]
  6. Main Page. Available online: https://www.vistrails.org/index.php/Main_Page (accessed on 30 April 2025).
  7. Palacio Cordoba, J.; Mergili, M.; Aristizábal, E. Probabilistic landslide susceptibility analysis in tropical mountainous terrain using the physically based r. slope. stability model. Nat. Hazards Earth Syst. Sci. 2020, 20, 815–829. [Google Scholar] [CrossRef]
  8. Mens, T.; Serebrenik, A.; Cleve, A. Evolving Software Systems, 1st ed.; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar] [CrossRef]
  9. Lehman, M.M. Programs, life cycles, and laws of software evolution. Proc. IEEE 1980, 68, 1060–1076. [Google Scholar] [CrossRef]
  10. Cordy, J.R. Source transformation, analysis and generation in TXL. In Proceedings of the 2006 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-Based Program Manipulation, Charleston, SC, USA, 9 January 2006; pp. 1–11. [Google Scholar] [CrossRef]
  11. Malloy, B.A.; Power, J.F. An empirical analysis of the transition from python 2 to python 3. Empir. Softw. Eng. 2019, 24, 751–778. [Google Scholar] [CrossRef]
  12. Howison, J.; Herbsleb, J.D. Incentives and integration in scientific software production. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work, San Antonio, TX, USA, 23 February 2013; pp. 459–470. [Google Scholar] [CrossRef]
  13. Ali, M.; Hussain, S.; Ashraf, M.; Paracha, M.K. Addressing software related issues on legacy systems—A review. Int. J. Innov. Sci. Technol. Res. 2020, 9, 3738–3742. [Google Scholar]
  14. Seetharamatantry, H.; Murulidhar, N.N.; Chandrasekaran, K. Implications of legacy software system modernization—A survey in a changed scenario. Int. J. Adv. Res. Comput. Sci. 2017, 8, 1002–1008. [Google Scholar]
  15. Sommerville, I. Software Engineering, 9th ed.; Pearson Publication: London, UK, 2014. [Google Scholar]
  16. Majthoub, M.; Qutqut, M.H.; Odeh, Y. Software re-engineering: An overview. In Proceedings of the 2018 8th International Conference on Computer Science and Information Technology (CSIT), Amman, Jordan, 11–12 July 2018; pp. 266–270. [Google Scholar] [CrossRef]
  17. Murugesan, S. Harnessing green IT: Principles and practices. IT Prof. 2008, 10, 24–33. [Google Scholar] [CrossRef]
  18. Pazienza, A.; Baselli, G.; Vinci, D.C. A holistic approach to environmentally sustainable computing. Innov. Syst. Softw. Eng. 2024, 20, 347–371. [Google Scholar] [CrossRef]
  19. PyQt5 Reference Guide—PyQT Documentation v5.15.4. Available online: https://www.riverbankcomputing.com/static/Docs/PyQt5 (accessed on 30 April 2025).
  20. GRASS GIS—Bringing Advanced Geospatial Technologies to the World. Available online: https://grass.osgeo.org/ (accessed on 30 April 2025).
  21. De Souza, S.C.; Anquetil, N.; de Oliveria, K.M. A study of the documentation essential to software maintenance. In Proceedings of the 23rd Annual Internation Conference on Design of Communication: Documenting & Designing for Pervasive Information—SIGDOC ’05, Coventry, UK, 21–23 September 2005. [Google Scholar] [CrossRef]
  22. Liang, X.; Lettenmaier, D.P.; Wood, E.F.; Burges, S.J. A simple hydrologically based model of land surface water and energy fluxes for general circulation models. J. Geophys. Res. Atmos. 1994, 99, 14415–14428. [Google Scholar] [CrossRef]
  23. Liang, X.; Wood, E.F.; Lettenmaier, D.P. Surface soil moisture parameterization of the VIC-2L model: Evaluation and modifications. Glob. Planet. Change 1996, 13, 195–206. [Google Scholar] [CrossRef]
  24. Hernandez, F. Integrated High-Resolution Modeling for Operational Hydrologic Forecasting. Ph.D. Thesis, University of Pittsburgh, Pittsburgh, PA, USA, 19 June 2019. [Google Scholar]
  25. Wigmosta, M.; Nijssen, B.; Storck, P. The distributed hydrology soil vegetation model. In Mathematical Models of Small Watershed Hydrology and Applications; Water Resources Publications: Highlands Ranch, CO, USA, 2002; pp. 7–42. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.