marvin: A Platform for Chemoinformatics Software Development

A strategy for a new type of platform for chemoinformatics software development and its first implementation are presented. The basic task of such a platform is to apply sequences of computational methods to high numbers of molecules. The implementation presented is based on four major components: (a) the application manager, responsible for running programs and for data management; (b) executable applications that supply limited pieces of functionality; (c) syntax definitions for data and control files and (d) the runtime library which comprises routines for data handling and user interface. This simple concept is implemented in the software package marvin. Different computational methods are available within marvin, including parts of commercial software packages (e.g. molecular modeling, bioinformatics, statistics, etc.) as well as newly developed and innovative algorithms. The basic layout of marvin is described and a simple example illustrates its application.


Introduction
Recent developments in automation of chemical synthesis and biochemical screening have resulted in a fundamental paradigm shift in the drug discovery process [1]. Today, pharmaceutical companies are searching for new lead structures by screening huge compound libraries, including hundreds of thousands or even millions of chemical compounds against an increasing number of biological targets. This new drug discovery workflow is designated as "new technologies" or NT [2,3]. The NT approach outputs enormous amounts of experimental results, produced by high-throughput robotic systems. On the other hand new and innovative methods of data management and molecular modeling are required in order to handle the corresponding data and to reunite the experimental results with molecular structures and with predicted molecular properties [4,5]. These new computational methods must be able to derive properties such as three-dimensional structures, structural indices, physicochemical properties and other molecular descriptors for huge numbers of compounds in order to increase efficiency of drug discovery [6,7].
As a consequence of this development the new discipline chemoinformatics has been emerging [8][9][10]. Chemoinformatics combines the techniques of molecular modeling, computer aided drug design, data management and data mining and becomes an integrated part of the drug discovery process in the pharmaceutical companies today [11][12][13]. Major parts of the drug discovery process are supported by the new computational science (see Figure 1).
The new applications of chemoinformatics require rapid development of novel algorithms in order to solve arising problems. Typical aims addressed by upcoming chemoinformatics software involve huge numbers of screening data and huge numbers of molecular structures. The screening laboratories output more than 100,000 points of data every day [14]. The number of compounds in the compound stocks already exceeds the magic number of one million. Combinatorial libraries comprising more than 10,000 compounds are synthesized [15]. Virtual combinatorial libraries already contain more than 10 million compounds. Future computational algorithms must be able to apply molecular modeling methods to these compound libraries in order to find typical patterns in structural space, in pharmacological space or in property space [16]. Chemoinformatics software is set up to compute properties, such as physicochemical properties [17], lipophilocity or drug-likeness [18] as well as molecular and structural descriptors [19,20] for huge numbers of molecules. Data-mining techniques are applied to correlate descriptor similarity to chemical or biological similarity.
In the last few years the development in drug discovery was mainly driven by automation techniques -as a consequence the computational methods must now follow quickly. Major developers of commercial molecular modeling software are addressing the new problems by adding chemoinformatics modules to the software packages. However commercial solutions suffer from the limitation to proprietary computational methods and on complex data management systems.
In contrast, the concept implemented in the software package marvin [21,22], is very simple and open. It represents a first example for a platform that allows very fast development of innovative chemoinformatics algorithms by combining the functionality of newly developed programs with modules from existing molecular modeling software packages. Beyond this, marvin provides completely automated setup and testing as well as automated application of the new methods to current data sets. Because of its very simple basic concept, the strategy can be implemented very fast on any computational platform. Because of its open interface, a multitude of modules and computational methods is available at once. Because of its modular concept, only small parts of any new chemoinformatics algorithms must be developed from scratch -most parts of the algorithm can be imported by interfacing software packages that already exists.

Figure 1:
Simplified scheme of the preclinical drug discovery process. Every step in the workflow is accompanied by computational sciences. Bioinformatics is an important tool for target finding and target validation. Molecular modeling methods are used mainly in the lead optimization cycles. Chemoinformatics supports a major part of the process with property calculations, similarity assessment, data management and data mining.

Basic strategy
The basic idea of marvin is to automatically run a sequence of computer programs on a number of compounds. The functionality of the entire sequence is called algorithm in the context of marvin. The single programs are addressed by the term application. Therefore, a marvin algorithm is built up by putting together applications and running them on molecules. marvin is only the integration platform that links together all pieces of data and software needed in a chemoinformatics project.

marvin as a black box
Because of the high level of automation, any marvin algorithm can be applied as a black box: Chemical structures are used as input and results of the entire study are presented as output. Usage as a black box is an important demand on an integrated chemoinformatics software solution, because applying a variety of computational methods, implemented in different software packages on different computational platforms, to millions of compounds would be a very time consuming job. High efficiency and fast application of computational methods however, is one of the major preconditions for chemoinformatics software within the strategy of NT drug discovery.
The concept of marvin allows manual setup and flexible optimization of an algorithm. Once it is tested and validated, even a complex algorithm can be run at the touch of a button without any user interaction until the final result is presented.

marvin as a box of bricks
Looking inside the black box, marvin presents itself as a box of bricks (see Figure 2). All applications -the modules of the algorithm -are built up based on a small number of precisely defined interfaces. These interfaces are data file formats and communication file formats, handled by marvin library functions. All communication between applications is performed via these files. This is why all applications can be designed individually and every application can be used together with any other one. There is no need to interface modules from different software suppliers -as long as integrating of the single modules into the integration platform is simple.
Before running an algorithm on this file based platform, the global setup file, that holds all runtime parameters for all applications of the algorithm, must be created. The library functions read this setup file and pass the parameters to the corresponding applications. Every application writes its output to data files which are readable by all other marvin applications. Additional inter-application communication is possible via the communication file, which is how error messages and warnings are passed from one application to another.
The status output (such as warnings and error messages) are written to a common output file that comprises a detailed documentation of the entire study. This status file is generated automatically.

Figure 2:
Visualization of the modular structure of a marvin algorithm (see Section 4 for the example study). Data sets are transferred between applications by so-called maff data files. Different types of applications, such as generic applications (displayed as yellow rectangles), interfaced applications (ovals) or high-level applications (in orange and 3D representation) are used seamlessly in order to include all functionality needed by the marvin algorithm. Blue arrows indicate data flow, magenta arrows indicate control flow.

Information flow and control flow
Input of the marvin black box are molecule structures and parameter setup files. The information flow inside the black box is controlled by the application manager (APM, see Figure 3). At start time, the APM analyzes the setup, completes application parameter sets by looking up default values and writes a so-called job file. This job file is a script that runs the current study on the specified host computers. It holds all commands necessary to run applications as well as commands needed for data management and control flow. Additional functionality includes copying files, removing files, compressing and decompressing of files, transferring files to remote hosts, partitioning data sets for parallel processing, merging outputs from parallel processing, etc. The user can apply marvin as a black box. Only marvin input file and study results must be handled manually. The entire marvin study runs automatically and is controlled by the APM, which reads the setup from the input file. Blue arrows indicate data flow, magenta arrows indicate control flow.
marvin modules mavin modules are the application manager (APM), applications, file format definitions and the marvin runtime library. The application manager reads basic parameters from the input files and sets up the entire marvin job.
Applications are computer programs, which address small functional parts of an algorithm (such as generation of 3-dimensional structure from molecular topology, calculate one molecular property, etc.).
The parameter input files contain all run-time parameters for all applications and the marvin specific setup.
The marvin run-time library is used to integrate new software into the marvin system. The library functions covers functionality concerning user interface, data input and output, protocol recording and documentation.

Application manager
The minimum requirement of an application manager is to apply a number of applications to a number of molecules. Beyond this, the marvin APM works as a data management and networking module by controlling host-to-host file transfers, batch job submission, inter-application communication etc. The marvin APM is controlled by keywords given in the input files (see Listing 2 for an example). The functionality of the marvin APM includes: Multiple run modes (single, all, list): The application manager decides whether an application is to run with all molecules or just once for all molecules or only with a certain class of molecules (e.g. special handling of ionic compounds).
Network support: The APM finds applications in a local area network or the internet and starts them on the remote host. Data files are transferred if necessary, default settings are adapted and output files are concatenated automatically.
Integration of resources: Databases and other local or remote resources are included. Commercial software packages: Functionality supported by commercial software packages is included. External packages can be run as interfaced applications as an integrated part of any marvin study.
Innovative applications: Novel and innovative applications can be implemented in c/c++ and seamlessly combined with the interfaced applications.
File format conversions: Necessary file-format conversions are recognized and performed automatically.
Runtime parameter organization: Every application is provided with the correct parameter input from the input file.
Checkpointing and restart of studies: Already calculated results are recognized and noticed (e.g. in order to restart stopped studies or rerun with only some of the application parameters changed).
Parallelisation: Time-consuming calculations are optimized by starting multiple instances of an application in parallel. This allows "cluster-like" parallelisation on multi-cpu computer servers.
Individual applications need not to be parallelized by compiling special versions for specific host computers. Even software modules of which no parallelized versions are available (e.g. commercial software packages) can be run in parallel using this concept.
Batch processing: If processes are submitted to a batch queue, the APM waits for the completion of the job. The marvin algorithm is continued as soon as all needed results are available.
Documentation and error log: The APM generates a detailed documentation for the run of all executed applications. Errors, warnings, status output and cpu-times are reported to the marvin output file.
Optimized data management: Data files are stored in a compressed format and temporary files are removed automatically.
Fail save concepts: The APM recognizes technical problems and tries to find work-arounds or notifies the user.
Applications marvin applications are built up as generic, as high-level or as interfaced applications. Generic applications are programs developed for usage within marvin algorithms by linking the marvin runtime library. The functions and data types of the marvin library can be used to handle all data and parameter input and output (see Section 3.4 for details on the runtime library). Generic applications are most commonly used for external data interfaces, such as reading external file formats, or for implementing innovative computational methods.
In addition, most of the marvin system functions (such as the application manager) are implemented as generic applications.
High-level applications are defined in one of the marvin input files. High-level applications run any other marvin applications with a different set of default parameters (see Listing 1 for an example).
Interfaced applications are external software packages, integrated by using the generic application cmdLine [24] or by implementing a generic marvin interface application.
Example applications are given in Table 1. All types of applications can be used to build up an algorithm, regardless of their type. Communication between the different applications and between applications and APM is based on text files. Formats of these files are fixed and implemented in the marvin runtime library. Most important marvin files are input files (parameter setup), data files, phone files (communication between applications) and output files.

Input files
All run-time and control parameters of marvin are given in three hierarchically organized input files. The structure of all input files follows the same syntax. The file local.defaults holds the host dependent settings for the local computer, such as paths to local applications, the local marvin installation, etc. (see Listing 3 for an example).
The file application.defaults includes all default parameters for applications. When executing an application on a remote host, the application manager copies application.defaults to the remote system to guarantee usage of the same settings for the entire study. High-level applications are defined in this file preferably (see Listing 4).
The file myname.marvIn (= marvin Input) holds the setup for the current marvin study to be run. The setup section of this file contains the sequence of applications and the list of molecules of a study. All settings can be redefined in the marvIn file (see Listing 2).
All marvin input files are plain text files. A section is defined for each application. Beginning and end of an application section are marked by the keys %%application and %%end of application. Every application reads parameters from its personal section only. The application manager makes use of the section %%setup. Within the sections run-time parameters are characterized by their name (e.g. number of data points:). Several values of different data types can be given after the colon. The scope of a parameter definition can cover one or more lines. Different types of parameters are given in Table 2. Text outside a parameter definition and text outside a section are ignored by the parameter read functions of the marvin run-time library and thus can be used for user comments. All marvin applications output molecular data into a standard data file for every molecule (molecule.maff = marvin file format). Maff files are text-files that store molecule date and history information in a readable form. Therefore every data file includes a brief documentation displaying the applications, used to generate these data. Listing 6 shows an example maff data file from the example algorithm described in Section 4. For optimized data management marvin allows automatic compression and decompression of data files. Maff files may contain different types of information assigned to molecules, such as molecular topology (i.e. structural formula), three-dimensional structure, additional properties of the molecule or the atoms, tables of high-dimensional data (e.g. potentials, surfaces, etc.) and comments.

Communication file
The marvin communication file (study.phone) allows applications of a study to communicate to each other (e.g. an application can exclude some molecules from the data set at run-time, see Listing 9 for an example).

Output file
The marvin output file (study.log) comprises all status information from all applications, such as warnings, error messages, computation times, etc. The thoroughness of this information is adjusted by setting the logfile size: parameter in the %%setup section of the marvin input file (possible values are small, medium, big or debug, see Listings 7 and 8).

marvin run-time library
The marvin run-time library is a compilation of functions and data types that help software developers in implementing novel marvin applications. The library functions are designed to address problems of data handling, file handling, user interface and marvin system management.
Data structures: marvin library functions are used to accessing molecular data stored in predefined data structures. The data show the same modular structure as the data files (see Figure 4). Different nesting levels of variables are used to address different levels of detail of information. In the highest hierarchical level entire sets of information about one molecule are handled as one object. But the interface gives handles to more fine-grained information, such as molecular structure, molecular properties or even properties of single atoms.
Functions for data handling: The data-handling library includes functions for writing and reading data from maff data files into predefined data structures and functions for managing these data. marvin data structures include molecular data (e.g. atom coordinates, atom properties, topologies, molecular properties, etc) and high-dimensional vectors (e.g. potential fields, description vectors, etc). The functions allow accessing data in different ways, such as addressing values by index or by coordinates (see Table 3 for example functions).

User interface functions:
The user interface library includes functions to read run time parameters from one of the input files. The parameters must be specified by section name, parameter name and parameter element number in the parameter list. The routines are searching all marvin input files for the demanded parameter hierarchically: If the parameter is not found in the setup of the current job (i.e. the marvIn file), the file applications.defaults and -if necessary -the file local.defaults are searched. This way parameters are set to default settings automatically (see Table 4 for example functions).
marvin system management: The marvin system management library includes functions for setting and accessing information in the marvin environment (see Table 5 for example functions).

Function
Description ReadData(Name) reads a molecular data set from maff file Name MarvinWriteData(Name) writes a molecular data set into maff file Name MarvinGridPoint(Index) returns the value referenced by index vector Index MarvinGridInterp(Coors) returns the value of a n-dimensional vector at the point given by the coordinates Coors by interpolating the grid MarvinGridNumber(Coors) returns a pointer to the data point next to the point defined by the coordinates Coors in a n-dimensional vector MarvinAtomProperty(Num) returns the properties of atom number Num … …

MarvinLog( size, string)
writes a message into the status output file if size matches the global setting of the current output detail level (one of small, medium, big, debug). E.g. the command MarvinLog( "d", "The value of Errorlevel is -1") will write the message into the log file only if the output detail level is set to debug.

MarvinError( Num, Mssg)
The function reports the error message corresponding to the error number Num. The correct error message is assigned to Num and the string Mssg is printed as an additional explanation.

MarvinPercent( Actual, Total)
Reports the progress of a computation in percent. ... ... Transmits a message to another application.

Example problem
The marvin platform allows building up complete chemoinformatics algorithms by simply combining generic, high-level and interfaced applications. In the following example, a marvin algorithm is set up to address the problem of mining a chemical database for compounds which are similar to a set of well known drug molecules.
As a first step in the algorithm, different molecular properties are calculated for all molecules in the reference set and in the database. These calculated properties are called molecular descriptors in this context. As next, pairs of descriptors are compared and the similarities are computed as Euclidean distances between the descriptor pairs (i.e. similarity is not assessed in space of chemical structures but in descriptor space).
Obviously, the choice of descriptors is crucial -different descriptors will result in different similarities of the molecules. Therefore, the algorithm used for the virtual screening of the database must allow high flexibility in choice of computational methods for descriptor calculations.

Applications
Several high-level, interfaced and generic marvin applications are needed to perform all calculations of the example study. Application setup and a detailed listing of parameter definition are given in the listings.
Applications used within this algorithm are molIn (generic), PropertyHbonds (high-level), marvinLogP (interfaced), atomAutoCharge (high-level), atomAutoHbond (high-level), susi (generic) and printSusi (high-level). In the following, these marvin applications are described briefly. The description is focused on technical aspects -the theoretical background of the implemented methods is discussed in more detail in [22]. molIn reads the topology of all molecules and writes an initial maff-file for each molecule. molIn is a generic application that generates the marvin specific molecule format. It accepts more than 50 different input formats by accessing the external program babel [25]. Maff files, written by molIn, hold the topology information of the molecules only. Following applications will add data fields to the files and after finishing the marvin run, every file contains a variety of molecular information such as descriptors, properties and the results of filtering operations.
PropertyHbonds is a high-level application that runs the generic application atomProperty with the parameters needed to scan the molecules for hydrogen bond acceptors and hydrogen bond donors. All hydrogen bond acceptors are marked by a -1 and all donors by a +1 flag. The markers are stored as extended atom properties in the maff files.
marvinLogP is an interfaced application that uses the clogP program [26] for calculation of octanole/water partition coefficients (logP values). The clogP values are stored in table type data sets, that consist of one row and one column. atomAutoCharge is a high-level application that uses the generic application atomAuto to compute autocorrelation coefficients based on the atom charges.
atomAuto derives spatial autocorrelation coefficients [22,27,28] from three-dimensional molecular structures. Autocorrelation coefficient describe properties of atom pairs for given distances (e.g. The coefficient A CH (10)(11) indicates patterns of charges in distance between 10Å and 11Å). Usage of autocorrelation coefficients is widespread in automated comparison of molecules, because comparison of the autocorrelation coefficients is possible without the necessity of superimposing the molecules [29].
atomAutoHbond is the corresponding high-level application that uses the same generic application atomAuto to compute autocorrelation coefficients based on the extended atom property hydrogen bond, previously calculated by propertyHbonds.
susi is a generic application that computes similarity of one molecule compared to a reference set. First, Euclidean distances between the sample descriptors and all reference descriptors are calculated. Non-linear scaling of the Euclidean distances leads to similarity scores that are in a range between 0.0 and 1.0. A similarity score of 1.0 denotes exact similarity (respectively a small distance between the descriptors). A similarity score of 0.0 indicates no similarity (i.e. the distance between the descriptors is higher than the predefined maximum distance).
All similarity scores for the sample molecule and reference set are summed up to and gives the socalled susi (sum of similarity scores). Small values of the susi characterizes molecules that are similar to the reference set [22].
In the example study the description vector for each molecule is built up from two sets of autocorrelation coefficients and the computed cLogP value (41 values in total, see Listing 6). printSusi is a high-level application that parameterizes the interfaced application GnuPlot in order to give a graphical output of the study result.
The application GnuPlot uses the external GnuPlot [30] program to plot data to a printer or file. Most of the GnuPlot parameters are accessible from within the marvIn setup. The GnuGlot application provides suitable default settings for plotting marvin datasets of different types and is used mainly to output study results. Default plot styles, defined as high-level applications, hide most of the GnuPlot parameters.

Run modes
The applications molIn, PropertyHbonds, marvinLogP, atomAutoCharge and atomAutoHbond are configured to run with all molecules so that all molecule data files contain the descriptors needed for the comparison. The applications susi and printSusi are called only once to calculate the sum of similarity scores for all molecules in the database. Susi lists molecule names and corresponding susi values into the status output file. Compounds with high scores can be selected from the list for synthesis.

Listings
Application setup and a detailed listing of parameter definitions are given in Listing 2. Run-time parameters for marvin are defined in the section %%setup, parameters for the applications in the subsequent sections. Comments are included for better readability. The syntax for parameter definitions is the same in all input files and for generic, high-level or interfaced applications.
Listings 3 and 4 display parts of the files local.defaults and application.defaults, used in the example study.
Example application manager output is shown in Listing 5. This file is an unix shell script and runs all applications and helper programs of the study.
An example data file is given in Listing 6. The header of every data file includes a history section with messages from all programs, worked on the molecule or the data. This history section is generated automatically every time a data set is written by the MarvinWriteData() functions that are part of the marvin run-time library.
Listing 7 and Listing 8 show clippings from the status output file, printed with the log file detail setting small and medium. The settings big and debug generate a more detailed status output.

Optimization of the algorithm
The success of any similarity assessment of molecule databases depends on the descriptors used. This is because every similarity or diversity assessment is based on descriptors instead on molecular structures. Strictly speaking, not the similarity of molecules, but the similarity of descriptors is calculated and diplayed. Therefore a platform for chemoinformatics algorithm development must allow very high flexibility in parameter setup and in usage of descriptor calculation programs [31]. The platform must be able to rerun a study with minimum expenditure of time in order to optimize the parameter setup for descriptor calculation and comparison.
In a study that runs under the marvin platform all parameters for every single application can be modified easily and any application can be replaced by another one. It is possible to rerun the entire study with modified setup. The job setup and all parameter settings are documented automatically in the marvin status output file for every single run. The studies run without further user interaction so that optimization of the algorithm by variation of parameters and methods is only limited by cpu time available.

Summary and outlook
The modular chemoinformatics platform marvin allows flexible setup of algorithms such as the similarity screening outlined in the example. Multiple runs of the same study are possible with different parameterization or with different methods used. Interactive work on the algorithm is reduced to a minimum. All runs are documented automatically. Data management tasks, such as removal, copying or compression and decompression of files as well as network file transfers are handled and controlled by marvin. All applications, the algorithm and the marvin components are controlled by the same input file using an uniform and easy to use syntax.
Most of the requirements of a platform for chemoinformatics algorithm development are met by a simple modular system like marvin. The combination of the basic components application manager, runtime library and interface to external software packages has proven to be flexible and strong enough to work as a chemoinformatics software platform. Chemoinformatics algorithms can be developed and optimized easily. The basic strategy of marvin is very simple and can be implemented quickly.
However, the layout of marvin shows major disadvantages, mainly in handling huge data sets: The application manager executes all applications internal and external to marvin in a simple way from the unix command line. All data is stored in compressed plain text files. Both characteristics cause a limitation of the number of molecules examined in a study. No considerable problems occur with data sets between 1000 and 10 000 molecules. Applied to a higher number of molecules the script files become too big and the number of data files too high.
Therefore the platform is appropriate for development of new applications and algorithms. Further developments are necessary in order to apply the new algorithms to data sets of more than 10 000 molecules.
Tomorrow chemoinformatics platforms should be implemented in a different way. For example, using completely object-oriented concepts, CORBA-interfaces for communications tasks and objectrelational databases for storage of huge data sets. But this first implementation demonstrates the proofof-priciple: A simple concept is able (or necessary) to meet the needs to a future chemoinformatics platform.

Listings
The listings illustrate the way parameters are set in a marvin algorithm. All listings are clippings from the control files and data files of the example study.

Listing 1
Example definitions for nested high-level applications (file: application.defaults): The high-level application "mopacAM1" runs the external program mopac [23] with the AM1 hamiltonian. The second high-level application "mopac-on-Sun" runs mopac with the same parameters on a remote host named bigsun. Both high-level applications refer to the interfaced application marvinMOPAC.

Listing 8
Clippings from the marvin status output file of the example study recorded with log mode medium. Log mode medium and log mode large are used for validating an algorithm or a single application. All parameters read by the marvin library functions are echoed to the log file.