Open Source Repository and Online Calculator of Prediction Models for Diagnosis and Prognosis in Oncology

(1) Background: The main aim was to develop a prototype application that would serve as an open-source repository for a curated subset of predictive and prognostic models regarding oncology, and provide a user-friendly interface for the included models to allow online calculation. The focus of the application is on providing physicians and health professionals with patient-specific information regarding treatment plans, survival rates, and side effects for different expected treatments. (2) Methods: The primarily used models were the ones developed by our research group in the past. This selection was completed by a number of models, addressing the same cancer types but focusing on other outcomes that were selected based on a literature search in PubMed and Medline databases. All selected models were publicly available and had been validated TRIPOD (Transparent Reporting of studies on prediction models for Individual Prognosis Or Diagnosis) type 3 or 2b. (3) Results: The open source repository currently incorporates 18 models from different research groups, evaluated on datasets from different countries. Model types included logistic regression, Cox regression, and recursive partition analysis (decision trees). (4) Conclusions: An application was developed to enable physicians to complement their clinical judgment with user-friendly patient-specific predictions using models that have received internal/external validation. Additionally, this platform enables researchers to display their work, enhancing the use and exposure of their models.


Introduction
One of the most challenging areas of modern medicine is oncology [1]. The decisionmaking process for the best treatment plan is now more difficult than ever because of the inherent heterogeneity of cancer types, patients, and the ever-growing range of available treatments [2]. To choose the "optimal" course of treatment, clinicians must consider evidence from clinical trials, continuing research, their own professional expertise, national guidelines, and the values of the patient [3]. Predictive modeling is becoming a key knowledge-based tool in the healthcare field [4]. The popularity of predictive modeling arises from advancements in a variety of areas, including the availability of health data from electronic health records and databases, a better understanding of causal or statistical predictors of health, disease processes, and multifactorial models of poor health, and advances in nonlinear computer models based on artificial intelligence or neural networks [5]. These new computer-based modeling techniques are gaining credibility in therapeutic settings [6]. However, the current understanding of how so-called machine intelligence will evolve, and thus how current relatively sophisticated predictive models will evolve in response to advances in technology, is difficult to predict [7]. What is known is that despite the fear of the black-box nature of certain AI models, with the introduction of various nomograms dealing with predictive models in oncology, their simplicity and ease of use is helping to cross the bridge of academic/research use to clinical adoption [8][9][10].
There was a need for a public repository of predictive models, and a thorough search was conducted to identify existing applications. However, as the search did not yield useful results, it became our goal to develop a web-based application that would compile and archive publicly available, validated models [11]. The doctors who use such an application would be able to quickly assess the benefits and drawbacks of the chosen models that passed the quality controls. They could securely rely on the application to calculate the output of such models by providing the inputs in a specific wizard rather than designing their own implementation or researching for an appropriate implementation on their own.
The advantages for medical researchers include enhancing the exposure of their model, which should encourage usage and citations, and assisting them in generalizing their models by enabling the models to be evaluated by research groups other than the ones that generated them TRIPOD 4 (Transparent Reporting of studies on prediction models for Individual Prognosis Or Diagnosis) [12]. This paper will describe the foundational work that led to the creation of this application prototype.

Materials and Methods
The National Center for Biotechnology Information's (NCBI) PubMed database and Medline papers from December 2016 to March 2022 were reviewed. The following search terms were utilized to find publications that were pertinent: "validated prognostic models in oncology", "novel diagnostic tools for cancer with external validation", "lung, prostate, head and neck predictive models", and "machine-learning & deep learning models in oncology". Additionally, all the models included were validated (TRIPOD type 3 or 2b) or open-source repositories or peer-reviewed articles. Since this is not a systematic review, no more than five models for each disease (brain, head and neck, lung, prostate, esophagus, rectum, and endometrium) were selected and translated onto a user interface. Figure 1 illustrates the steps taken from the stage of doing a literature search to that of publishing the results online. advances in nonlinear computer models based on artificial intelligence or neural networks [5]. These new computer-based modeling techniques are gaining credibility in therapeutic settings [6]. However, the current understanding of how so-called machine intelligence will evolve, and thus how current relatively sophisticated predictive models will evolve in response to advances in technology, is difficult to predict [7]. What is known is that despite the fear of the black-box nature of certain AI models, with the introduction of various nomograms dealing with predictive models in oncology, their simplicity and ease of use is helping to cross the bridge of academic/research use to clinical adoption [8][9][10].
There was a need for a public repository of predictive models, and a thorough search was conducted to identify existing applications. However, as the search did not yield useful results, it became our goal to develop a web-based application that would compile and archive publicly available, validated models [11]. The doctors who use such an application would be able to quickly assess the benefits and drawbacks of the chosen models that passed the quality controls. They could securely rely on the application to calculate the output of such models by providing the inputs in a specific wizard rather than designing their own implementation or researching for an appropriate implementation on their own.
The advantages for medical researchers include enhancing the exposure of their model, which should encourage usage and citations, and assisting them in generalizing their models by enabling the models to be evaluated by research groups other than the ones that generated them TRIPOD 4 (Transparent Reporting of studies on prediction models for Individual Prognosis Or Diagnosis) [12]. This paper will describe the foundational work that led to the creation of this application prototype.

Materials and Methods
The National Center for Biotechnology Information's (NCBI) PubMed database and Medline papers from December 2016 to March 2022 were reviewed. The following search terms were utilized to find publications that were pertinent: "validated prognostic models in oncology", "novel diagnostic tools for cancer with external validation", "lung, prostate, head and neck predictive models", and "machine-learning & deep learning models in oncology". Additionally, all the models included were validated (TRIPOD type 3 or 2b) or open-source repositories or peer-reviewed articles. Since this is not a systematic review, no more than five models for each disease (brain, head and neck, lung, prostate, esophagus, rectum, and endometrium) were selected and translated onto a user interface. Figure  1 illustrates the steps taken from the stage of doing a literature search to that of publishing the results online.  In order to assess the category of the models from the studies, every paper was tested for its compliance with the TRIPOD classification, as shown in Figure 2.
The process for extracting the coefficients from a nomogram and implementing them into a web user interface will be described in the following paragraphs. In order to assess the category of the models from the studies, every paper was tested for its compliance with the TRIPOD classification, as shown in Figure 2. The process for extracting the coefficients from a nomogram and implementing them into a web user interface will be described in the following paragraphs.

Obtaining Model Coefficients from a Nomogram
A large number of predictive models with respect to treatment outcomes (cure rate, cancer recurrence, survival, toxicity) are available in the literature. Regression models are frequently published as nomograms in order to make them easier for medical specialists to read and interpret, often without disclosing the model coefficients. A straightforward technique was utilized to extract the coefficients from nomograms in order to publish the models uniformly on the application. The papers in which nomograms are published will usually report the type of model (e.g., logistic regression); otherwise, this can be determined by examining the part of the nomogram where the total score is mapped to a probability. In the nomogram, numbers of points are assigned to different inputs, for example, female might have more points than male. These points are summed over all the inputs, resulting in a nomogram score. This score is then related to a probability of an outcome. The relationship between the total score and the probability is described by the equation for the model. To obtain an equation describing the nomogram, the total score of the nomogram and the probability are read by digitizing the nomogram, and a linear fit is made using the function describing the model, e.g., Equation (1): where is the probability, is the intercept, and is the scaling factor. Once the relationship is determined between the model parameters and the total score, the coefficients can be calculated. In order to be as precise as possible when reading the nomograms, any graph digitalization software can be used. An example from one of the utilized models is used to illustrate this approach, and it is represented in Figure 3 below [3].

Obtaining Model Coefficients from a Nomogram
A large number of predictive models with respect to treatment outcomes (cure rate, cancer recurrence, survival, toxicity) are available in the literature. Regression models are frequently published as nomograms in order to make them easier for medical specialists to read and interpret, often without disclosing the model coefficients. A straightforward technique was utilized to extract the coefficients from nomograms in order to publish the models uniformly on the application. The papers in which nomograms are published will usually report the type of model (e.g., logistic regression); otherwise, this can be determined by examining the part of the nomogram where the total score is mapped to a probability. In the nomogram, numbers of points are assigned to different inputs, for example, female might have more points than male. These points are summed over all the inputs, resulting in a nomogram score. This score is then related to a probability of an outcome. The relationship between the total score and the probability is described by the equation for the model. To obtain an equation describing the nomogram, the total score of the nomogram and the probability are read by digitizing the nomogram, and a linear fit is made using the function describing the model, e.g., Equation (1): where p is the probability, a is the intercept, and is b the scaling factor. Once the relationship is determined between the model parameters and the total score, the coefficients can be calculated. In order to be as precise as possible when reading the nomograms, any graph digitalization software can be used. An example from one of the utilized models is used to illustrate this approach, and it is represented in Figure 3 below [3].
In this instance, a logistic regression model is used to determine the link between the coefficients and the likelihood; as a result, the association between the nomogram score (total points) and the standard logistic distribution of the probability is linear, from which the slope and intercept can be determined ( Figure 4).
Next, the relationship between the parameters and the nomogram score can be extracted by reading the nomogram, and the coefficient can be obtained by multiplying by the scaling factor (0.307), as shown in Table 1.
The probability P can now be obtained through Equations (2) and (3): 4 of 14 where x 1 to x 4 are 0 or 1 (see Table 1). It is possible to perform a check on how well this equation describes the nomogram by filling in some values into the equation and reading the nomogram to test if the answers are the same. For the example above, you could enter a patient with squamous cell carcinoma, differentiation grade 2, male gender, and a tumor with T-stage 2. Totaling the points (100, 0, 0, and 85) results in a nomogram score of 185, which correlates to a probability of just over 45% when reading the nomogram. Filling this in the equations mentioned above (  . Nomogram for predicting pathologically complete response after neoadjuvant chemoradiotherapy for oesophageal cancer [14]. In this instance, a logistic regression model is used to determine the link between the coefficients and the likelihood; as a result, the association between the nomogram score (total points) and the standard logistic distribution of the probability is linear, from which the slope and intercept can be determined ( Figure 4).

Application Development Process
In order to automate the self-import process of data models and to create a central model repository, a web-based application has been created that can be accessed by researchers from around the world, as shown in Figure 5.
The development of the web application was done using Python 3.6. and the Django v3.1 web framework. One of the main reasons for choosing Python as the main programming language for this application is because it would allow easier integration of existing research that is based on Python. Furthermore, Django is a highly capable web framework that provides a rich feature set for easy implementation and management of a web app, such as built-in support for a database and an admin console. . Nomogram for predicting pathologically complete response after neoadjuvant chemoradiotherapy for oesophageal cancer [14].
In this instance, a logistic regression model is used to determine the link between the coefficients and the likelihood; as a result, the association between the nomogram score (total points) and the standard logistic distribution of the probability is linear, from which the slope and intercept can be determined (Figure 4). Next, the relationship between the parameters and the nomogram score can be extracted by reading the nomogram, and the coefficient can be obtained by multiplying by the scaling factor (0.307), as shown in Table 1. The probability can now be obtained through Equations (2) and (3) where to are 0 or 1 (see Table 1). It is possible to perform a check on how well this equation describes the nomogram by filling in some values into the equation and reading the nomogram to test if the answers are the same. For the example above, you could enter a patient with squamous cell carcinoma, differentiation grade 2, male gender, and a tumor with T-stage 2. Totaling the points (100, 0, 0, and 85) results in a nomogram score of 185, which correlates to a probability of just over 45% when reading the nomogram. Filling this in the equations mentioned above ( 1, 0, 0, 1) results in 0.465.
The exact performance of the equation is often influenced by rounding factors or reading errors.

Application Development Process
In order to automate the self-import process of data models and to create a central model repository, a web-based application has been created that can be accessed by researchers from around the world, as shown in Figure 5. For the back-end of the application, a PostgresSQL database was used to store the details related to the model, such as the author of the model, intended end users, and how the model has been developed. In addition, there are stored the parameters (i.e., age, volume, sphere diameter, type of treatment, stage) of each model to dynamically generate the UI in order to facilitate the testing of the model. For the development of the front end of the web application, technologies such as Javascript, Bootstrap 3, HTML5, and CSS3 were used alongside the Django web framework, as shown in Figure 6. The materials and supplements contain comprehensive information about the database ( Figure S1).
AI4Cancer is intended for the open-source publishing of prediction models, developed to predict outcomes for cancer patients. ume, sphere diameter, type of treatment, stage) of each model to dynamically generate the UI in order to facilitate the testing of the model. For the development of the front end of the web application, technologies such as Javascript, Bootstrap 3, HTML5, and CSS3 were used alongside the Django web framework, as shown in Figure 6. The materials and supplements contain comprehensive information about the database. AI4Cancer is intended for the open-source publishing of prediction models, developed to predict outcomes for cancer patients.

Results
The developed open-source application is accessible online at https://ai4cancerai.herokuapp.com/models-browser# (accessed on 1 April 2022). As shown in Figure 7, it will serve as a repository for published AI prediction models that address all aspects of various cancer types and stages, including diagnosis, prognosis (patient therapy, risk stratification), and follow-up (treatment result, consequences).

Results
The developed open-source application is accessible online at https://ai4cancer-ai. herokuapp.com/models-browser# (accessed on 1 April 2022). As shown in Figure 7, it will serve as a repository for published AI prediction models that address all aspects of various cancer types and stages, including diagnosis, prognosis (patient therapy, risk stratification), and follow-up (treatment result, consequences). Every model that is displayed has an explanation of the technique and clinical datasets that were used to create and validate the model. The limitations of each model are explicitly mentioned. The developed models are lung cancer, rectum, head and neck cancer, and brain metastases, and they are developed to be used only by physicians, not patients, due to the complexity of cancer treatment decision-making options. There are several different types of models that are used, including linear regression, tumor control probability (TCP), linear-quadratic (LQ), Kaplan-Meier, linear-quadratic biologically effective dose (LQ-BED), Cox, and logistic regression [15,16]. As shown in Table 2, there are now 18 models that have been released and deployed.  Every model that is displayed has an explanation of the technique and clinical datasets that were used to create and validate the model. The limitations of each model are explicitly mentioned. The developed models are lung cancer, rectum, head and neck cancer, and brain metastases, and they are developed to be used only by physicians, not patients, due to the complexity of cancer treatment decision-making options. There are several different types of models that are used, including linear regression, tumor control probability (TCP), linear-quadratic (LQ), Kaplan-Meier, linear-quadratic biologically effective dose (LQ-BED), Cox, and logistic regression [15,16]. As shown in Table 2, there are now 18 models that have been released and deployed.    For each online model given, doctors can find: (a) the intended use (predicted outcome) of the model, (b) to which patients it applies, (c) the information and the parameters that the doctor must enter, and (d) how the model was developed. The doctors can access the website, select a suitable model, and enter the required information to produce the likelihood of survival. The website's predictive models follow the same calculations as the models outlined in the academic works on which they are based.
The primary output of this research is a widely applicable application that comprises verified models for various stages, odds of survival, symptoms, and outcomes. The collection of these predictive models will help doctors make decisions.

Discussion
This application can be considered a "model zoo" for academics and medical professionals who are well-versed in the medical ramifications of various cancer forms. None of the models are intended to replace clinical judgment; rather, they are all intended to advance research. The open-source application is not meant for usage by laypeople without assistance (e.g., patients). It is important to emphasize that this manuscript and application are currently only prototypes. The inclusion of all models that satisfy the selection criteria is not claimed. Similarly, any models that are not available on the platform right now should not be viewed as a cause for concern or as a rejection of their validity.
For heightened patient privacy, models that require medical (DICOM) images are not incorporated, as the pipeline does not currently have the ability to check that such uploaded data was properly anonymized. By working together with outside organizations (such as hospitals) that are in charge of preserving the privacy of such data, such machine learning models will be incorporated. In future iterations of the website, it will also be possible to categorize models according to the outcome they are modeling, as well as divide models according to the kind of input data they need. This will result in a more structured semantic-based organization. When the division of models was done based on the result being modeled, there will be dedicated website pages that compare models created to predict a specific outcome.
The models have not been validated for individual use; but rather, only in clinical cohort research. These models should not be used by patients directly and should only be used by doctors who are knowledgeable about the complexity of cancer types. The main intention of these models is to inform doctors and they should not be used for decision support.
Researchers should view this work as an invitation to use this application, with the ability to hide the code from the user while still providing full functionality. Researchers' models will be successfully integrated with the aid of the application itself. This will generate synergies that will inevitably speed up oncology-related AI research. Additionally, it will prevent models from being unnoticed and underutilized, which frequently occurs when numerous publications on the same general topic are released quickly.
There are certain limitations to the technique employed to extract regression model coefficients from a nomogram. One limitation is that the resolution of the published model significantly impacts accuracy. Another limitation is that even though the model coefficients may be retrieved, a nomogram cannot be used to determine the standard error for the parameter coefficients [48]. However, the technique may be employed with any nomogram, making it a useful tool.
By encouraging more researchers to submit their work via this application, hopefully, the number of curated models will increase and the application can continue to serve the public. This will be accomplished as a result of an increase in staff via the Optima Grant, who will perform monthly checks for the new models. The three prerequisites are (1) open-access papers with a clear, implementable model description; (2) papers with a minimum of 10 citations; (3) models of TRIPOD 3 and 4.

Conclusions
The repository accessible at https://ai4cancer-ai.herokuapp.com (accessed on 1 April 2022), in the current prototype level, contains 18 proven machine learning models to assist doctors in making decisions on patient care for a variety of cancer types. Other researchers can make use of this technique for deriving regression coefficients from nomograms. As a result, research teams are being urged to disseminate their models globally.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/biomedicines10112679/s1, Figure S1: Above displayed is the ER Diagram for the Database of the Ai4Cancer web application.