1. Motivation and Significance
Radiotherapy (RT) is the use of high energy radiation, commonly X-rays or electrons for the treatment of cancer. It forms a major component of the treatment plan for up to 50% of cancer patients [
1]. In order to deliver RT safely, patient imaging is required, usually in the form of computed tomography (CT) or magnetic resonance (MR) scans. This imaging data is utilised to delineate the tumour and surrounding normal tissue structures. During the process of RT planning, a personalised treatment plan is created by arranging RT beams around the patient, so that a sufficient RT dose is delivered to the tumour, whilst minimising the dose to the surrounding healthy normal tissue. Therefore, the process of RT planning and treatment delivery generates a large amount of data that can subsequently be examined and utilised to improve patient care.
Radiotherapy datasets have multiple formats which are available in multiple locations [
2]. Patient clinical data, including patient and tumour variables, prescribed treatments, treatment toxicity and outcome data, are available in structured and unstructured formats in electronic health records (EHR). Imaging data are available at multiple time points from the patient management process (e.g., diagnostic imaging, RT planning imaging, follow-up imaging) and from multiple imaging modalities (e.g., CT), which may be stored at multiple locations and may have different patient identifiers.
Imaging data are typically available in digital imaging and communications in medicine (DICOM) format. RT planning data contains information on patient anatomy (including delineation of tumour and normal tissue), dosimetry (such as prescribed dose, dose to tumour and surrounding tissues) and RT beam set-up. Additional data are available from RT treatment delivery, such as daily imaging data (e.g., cone beam CT, orthogonal kilovoltage imaging) which is used to confirm correct patient positioning and from which patient dosimetry can be determined. Radiotherapy treatment planning and delivery data are generally stored in a different location to other data, in proprietary formats that require software specifically designed to export the data to DICOM format. Picture imaging and archiving systems (PACs) have been used to store DICOM data, such as Orthanc, which is widely used by the research community [
3].
With the large amount of data collected during routine patient care, there are great possibilities to use this “real-world data” to directly improve patient care. Artificial intelligence techniques can be utilised in large datasets of “real-world data” to determine associations and support clinical decision making in a way that has not been possible through clinical trials. Analysis of oncology imaging data may lead to improved detection of cancers, identification of cancer phenotypes and modelling of treatment response [
4]. RT data have been utilised in various data mining and machine learning research [
5,
6].
The data collection and processing of RT data, which usually takes 80% of the time required to conduct data analyses [
7], are required to enable data mining and modelling. When using Orthanc, the datasets should be processed and manipulated for use in data mining and machine learning tasks. This paper reports tools developed to extract, transform, and load data from an open-source PACS Orthanc server to enable data mining and analyses across patients treated with radiotherapy.
2. Tools Description
Figure 1 presents an overview of the developed architecture. The Orthanc server has a RESTful API that enables communicating with the server and working with patient data. A REST API is a web interface that facilitates the interaction with data through web services. Extract, transform and load (ETL) tools were implemented in this work for mapping a patient cohort to a format that can be used in data mining and machine learning applications.
2.1. Orthanc Server Technical Overview
The Orthanc server is a light-weight open-source PACS used to index and store DICOM data and was originally introduced in 2012 [
3]. The Orthanc server is used across the research community as it is free, open source and can hold a high number of images. By default, an SQLite database is used to index all the patient data added to the server. A new identifier is given to any patient record added to the server. The Orthanc server contains various plugins developed by the research community aiming to extend its usefulness. This includes plugins to support the use of Postgress databases and collect data from the cancer imaging archive (TCIA). The Orthanc tools are well documented by its developers, which enables the adoption of such tools in various research projects. An important aspect that enriches the Orthanc server is its REST API which enables handling files in the server using hypertext transfer protocol (HTTP) requests.
2.2. Radiotherapy Data Technical Overview and Associations between Modalities
As shown in
Figure 2, a cancer patient might have multiple imaging studies belonging to the same or different treatment site. Multiple studies can also be observed when repeated RT planning imaging occurs during the patient treatment, typically performed due to changes in patient anatomy. Within each study, patient information is stored as a series of data files, each of which is defined as a modality. The scope of this work was to handle up to four types of modalities: CT, RTSTRUCT, RTPLAN and RTDOSE. The CT consists of multiple transverse section images surrounding the cancer site and is usually used for radiotherapy planning [
8]. The RTSTRUCT is a series that contains the organs at risk (OARs) and target volumes (TVs) delineated by the clinicians. The RTPLAN contains details about the treatment plan. The RTDOSE contains the dose grid used in the treatment and other beam setup details. Each study might have multiple associated dose grids. Further details about each modality can be found in a publication by Law and Liu [
9].
Within DICOM files, DICOM tags are meta data used to identify details about the DICOM file. These might include patient names, sex, dates and physician name. Each DICOM file will also contain a patient identifier. For each patient radiotherapy study, the DICOM series are associated with each other via specific DICOM tags as shown in
Figure 3. These tags allow the establishment of associations between the required modalities in our work, taking into consideration the existence of many other series such as the radiotherapy image (RTIMAGE) and positron emission tomography (PET).
2.3. Software Overview
Patient data collection and processing (PDCP) is a module implemented via python to process patient data into neuroimaging informatics technology initiative (NIFTI) format. Data can be easily extracted from NIFTI files into NumPy arrays, which is a universal format for machine learning algorithms and data analysis. PDCP extraction tools facilitate data collection and preparation of RT data in dosimetry analyses, auto-contouring, and outcome prediction research applications. RT data can be used in the development of auto-contouring models, where the CT and RTSTRUCT are required. RT data can be also used in dosimetry analyses of volumes in the treatment plan, where the CT, RTSTRUCT, RTPLAN and RTDOSES series will be required. Considering this variation in requirements, four classes were implemented (patientImaging, patientImagingCR, patientImagingCRD, patientImagingCRDP) with patientImaging being the parent classes inherited by the three other classes. The patientImaging class is the main module and it contains a set of functions used to collect, validate, and track the data preparation process. The scripts have been developed to simulate a split-aggregate-combine design pattern, which we defined to partition the tools into separate blocks. This enabled the inclusion of ‘interceptors’ in the data collection process. An interceptor is a set of decisions that can be used to quickly include new conditions to handle the logic in the data preparation task. RT data is diverse, and the decisions taken/followed by clinicians vary across different treatment sites and hospitals. For this reason, methods were introduced to handle various requirements in multiple projects.
The following list summarises the developed tools in PDCP:
Patient-Imaging-CRDP: a set of methods (class) that inherits from the original class (patientImaging), used to collect and process cancer patients’ data where there is a need to link the four supported modalities. (C: stands for CT, R: stands for RTSTRUCTS, D: stands for RTDOSE, P: stands for the RTPLAN). The RTPLAN was needed to handle the linkage between the study modalities as shown in
Figure 3.
Patient-Imaging-CRD: a set of methods (class) that inherits from the original class, used to collect and process cancer patients’ radiotherapy data where the RTPLAN cannot be obtained. Within this class, it was assumed that the RTDOSE in the study were used in treatment.
Patient-Imaging-CR: a set of methods (class) that inherits from the original class, used to collect and process a patient’s data where there is a need to link the CT with the RTSTRUCT only (e.g., tumour or normal tissue segmentation task).
Object-oriented programming and inheritance were followed to make the code reusable and to extend certain functions. Several python packages were utilized to collect the data and to process the images such as requests, pyorthanc, pydicom, pandas, etc.
2.4. Functionalities
PDCP allows the user to:
Query, retrieve and validate patient imaging summaries from an Orthanc PACS based on the selected data collection type.
Analyse associations in patient studies (linking required modalities).
Retrieve patient imaging data into a local directory.
Prepare the records for use in various research questions (dosimetry analyses, contouring and image standardisation).
Track the data collection process and identify reasons behind excluding certain patient data.
Enable data mining and machine learning on collected datasets
2.5. Retrieval of Patients IDs
PDCP enables preparing an RT dataset for data mining and machine learning, with four main modalities: CT, RTSTRUCT, RTPLAN and RTDOSE. Patients stored in the Orthanc research PACS will have two identifiers: their original identifier and a hexadecimal identifier created by the Orthanc server while indexing the patient’s data. To extract patient-related files from the server, the Orthanc identifier is required. For this reason, an initial functionality was implemented to acquire all the patients’ original identifiers and the Orthanc identifiers into a CSV file saved to a local directory. The cohort required can then be updated by selecting rows in the generated file. The developed function utilized threading to send multiple requests to the server.
2.6. Retrieve Patient Imaging Summaries
A functionality was implemented as a part of this work to allow the collection of all the instances related to the patient’s identifiers. An instance is defined as a file stored in the Orthanc server that belongs to a patient series in a patient study. For each instance, more than 80 DICOM tags were extracted and saved as a row before being aggregated into a CSV file representing the patient’s instances belonging to multiple studies and series.
New functions can easily be introduced as ‘interceptors’ to remove any unneeded studies; e.g., a head and neck cancer dataset consisting of 298 patients is available for public use and can be collected from the cancer imaging archive (TCIA). For each patient in this dataset, two studies have been noticed, one of them being a ‘Tomotherapy’ study. Assuming that the dataset has been already saved into an Orthanc server, and the ‘Tomotherapy’ study is not required as a part of the analyses, an interceptor can be easily added to remove any instance that has the keyword ‘Tomotherapy’ in the study name.
2.7. Retrieval of Patient Data
Tools have been developed to retrieve the patients’ data into a local directory. These scripts use the HTTP protocol to download files from the Orthanc server using the REST web interface. All the retrieved files are saved in NIFTI format, which facilitates the efficient compression and fast extraction of data into NumPy arrays, which are usually the endpoint in any machine learning and data mining analyses. The CT slices will be compressed together and saved into one file. The RTSTRUCT masks will be extracted using a script collected from platipy, which is a tool used for radiotherapy imaging data. If RTDOSES were required, the module will also extract the records into NIFTI format after applying the dose grid scaling. Additional tools were provided to facilitate saving the data in pickle and MATLAB files.
2.8. Validation of Patient Data
Tools have been implemented to validate the patient data by verifying the study modalities and their associations. The logic to select the patient’s study is shown below:
A study is selected if it contains the required modalities selected by the user in the configuration file.
A study will be discarded if the selected CT series contains many instances
A study will be discarded if the study has the required RTSTRUCT, with the RTSTRUCT not containing any of the required keywords (i.e., study with RTSTRUCT with contour names PATIENT, ISO) will not be used.
A study will be discarded if it contains multiple CTs with multiple associations and will require review.
A study will be discarded if it contains a keyword that should not be found in its study name (e.g., a breast cohort is being collected while the study name contains ‘head and neck’ keywords).
A study will be discarded if there are no associations between modalities.
Each processed patient id will be accompanied by a JavaScript object notation (JSON) file located in a directory specified by the user. This JSON object will represent the logs (notes) reported while preparing the patient records. The notes are divided into three main types: verification, retrieval and loading notes. The first reports details about the patient data verification process, e.g., if the patient had multiple studies that appear usable, or if the patient had multiple associations between modalities. The second reports processing notes while extracting data from the server, e.g., if the connection to the server was disconnected while retrieving records. The third reports errors related to issues in loading the generated datasets. The JSON object will contain a flag that shows the possibility of using the patient. Further details can be accessed using the following link:
https://australiancancerdatanetwork.github.io/PDCP/ (accessed on 26 February 2022).
2.9. Dosimetry Features Calculation
The dose–volume histogram (DVH) represents a class used to generate the dosimetry features in a patient’s study. This enabled the generation of the dose features related to each structure associated with the radiotherapy plan. Three types of dosimetry features were used, with one of them requiring the prescribed dose as input to be calculated. Further details can be found in
Figure 4. In addition, tools have been also developed to visualise the exported dosimetry features.
2.10. Two Dimensional Images Representing Central Slices
Central slices which are defined as the CT slices with the highest number of tumour pixels in a volume have been used in machine learning-based applications [
5]. For this reason, tools used to generate the central slices of the patient’s OARs and TV have been included. Scripts have been also implemented to load the patient records into the python environment once the data processing task is finalised.
3. Illustrative Example
PDCP facilitates the data collection and processing of radiotherapy data in three main tasks: dosimetry analyses, outcome prediction, and auto-contouring. This section reports a real-world case scenario for collecting and preparing the patients’ records in a head and neck cancer dataset and a lung dataset collected from the cancer imaging archive (TCIA) [
10,
11]. Examples from the two datasets were loaded into an Orthanc server instance. The head and neck cancer dataset was previously used in outcome prediction tasks [
5,
6]. The lung dataset was previously used in outcome prediction and auto-contouring tasks [
12]. We selected two cohorts from the two datasets, each containing five patients and prepared the records using PCPD. The two examples with respective scripts can be found on Zenodo [
13].
Another dataset that represents the central slices of the combined gross tumour volumes (GTVs) in the head and neck dataset has been created using PDCP and is also available at Zenodo [
14]. The central slices were defined as the slices with the highest contoured voxels for both the primary and nodal volumes in each patient study. Similar datasets have been used for predicting outcomes in [
5]. However, within this dataset, we kept the Hounsfield units (HU) to describe the pixel values in each central slice.
4. Discussion and Limitation
In this work, we described tools to manage various cases in RT data identification and extraction. However, the clinical care decisions that might be taken by the clinicians while treating patients are broad. This may include reimaging and replanning patient RT treatment which might lead to new datasets related to the patient treatment. Therefore, some manual curation of data may be required to select the most appropriate data. At this stage, the developed tools can export the datasets into different folders (quarantine), where manual intervention can be undertaken to select the appropriate dataset. Automating this process can minimize the time required to prepare the datasets. With PDCP tools being shared, conditions can be added by developers to select the targeted study, using interceptors.
Currently, the developed tools cannot handle multiple associations in the dataset. In other words, if the study contains two or more connections between (CT, RTSTRUCT, RTPLAN and RTDOSE), the patient will be excluded from the study. The current tools support four types of modalities. Other modalities such as PET will be included as a part of future work.
5. Conclusions
In this work, we presented patient data collection and processing (PDCP), a set of tools implemented via python to prepare radiotherapy data stored in an open-source picture imaging and archiving system (PACS) known as Orthanc. The implemented tools can be used to query stored data, link patient files together, validate records based on predefined rules, retrieve patient data, extract and visualise planned data. The developed tools can be utilised using the following link:
https://github.com/AustralianCancerDataNetwork/PDCP (accessed on 26 February 2022).
Author Contributions
Conceptualization, A.H., F.A. and L.H.; methodology, A.H.; software, A.H.; formal analysis, A.H.; writing—original draft preparation, A.H. and F.A.; writing—review and editing, A.H., F.A. and L.H.; visualization, A.H.; supervision, L.H.; funding acquisition, A.H. and L.H. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the South Western Sydney Local Health District (SWSLHD); Illawarra and Shoalhaven Local Health District (ISLHD); Western Sydney Local Health District (WSLHD); Nepean Blue Mountains Local Health District (NBMLHD); Ingham Institute for Applied Medical Research, Liverpool, NSW 2170, Australia.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Barton, M.; Jacob, S.; Shafiq, J.; Wong, K.; Thompson, S.R.; Hanna, T.; Delaney, G. Estimating the demand for radiotherapy from the evidence: A review of changes from 2003 to 2012. Radiother. Oncol. 2014, 112, 140–144. [Google Scholar] [CrossRef] [PubMed]
- Roelofs, E.; Dekker, A.; Meldolesi, E.; van Stiphout, R.G.P.M.; Valentini, V.; Lambin, P. International data-sharing for radiotherapy research: An open-source based infrastructure for multicentric clinical data mining. Radiother. Oncol. 2014, 110, 370–374. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jodogne, S. The Orthanc ecosystem for medical imaging. J. Digit. Imaging 2018, 31, 341–352. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Thwaites, D.; Moses, D.; Haworth, A.; Barton, M.; Holloway, L. Artificial intelligence in medical imaging and radiation oncology: Opportunities and challenges. J. Med. Imaging Radiat. Oncol. 2021, 65, 481–485. [Google Scholar] [CrossRef] [PubMed]
- Diamant, A.; Chatterjee, A.; Vallières, M.; Shenouda, G.; Seuntjens, J. Deep learning in head & neck cancer outcome prediction. Sci. Rep. 2019, 9, 2764. [Google Scholar] [PubMed] [Green Version]
- Vallieres, M.; Kay-Rivest, E.; Perrin, L.J.; Liem, X.; Furstoss, C.; Aerts, H.J.; Khaouam, N.; Nguyen-Tan, P.F.; Wang, C.S.; Sultanem, K.; et al. Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer. Sci. Rep. 2017, 7, 10117. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dasu, T.; Johnson, T. Exploratory Data Mining and Data Cleaning; John Wiley & Sons: Hoboken, NJ, USA, 2003. [Google Scholar]
- Battista, J.J.; Rider, W.D.; van Dyk, J. Computed tomography for radiotherapy planning. Int. J. Radiat. Oncol. *Biol. *Phys. 1980, 6, 99–107. [Google Scholar] [CrossRef]
- Law, M.Y.; Liu, B. DICOM-RT and its utilization in radiation therapy. Radiographics 2009, 29, 655–667. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Vallières, M.; Kay-Rivest, E.; Perrin, L.J.; Liem, X.; Furstoss, C.; Khaouam, N.; Nguyen-Tan, P.F.; Wang, C.S.; Sultanem, K. Data from head-neck-PET-CT. Cancer Imaging Arch. 2017, 10, K9. [Google Scholar]
- Aerts, H.J.W.L.; Wee, L.; Rios Velazquez, E.; Leijenaar, R.T.H.; Parmar, C.; Grossmann, P.; Lambin, P. Data from NSCLC-radiomics. Cancer Imaging Arch. 2019, 10, K9. [Google Scholar]
- Aerts, H.J.; Velazquez, E.R.; Leijenaar, R.T.; Parmar, C.; Grossmann, P.; Carvalho, S.; Bussink, J.; Monshouwer, R.; Haibe-Kains, B.; Rietveld, D.; et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat. Commun. 2014, 5, 4006. [Google Scholar] [CrossRef] [PubMed]
- Haidar, A. PDCP Examples (0.0.1). Zenodo 2022. [Google Scholar] [CrossRef]
- Haidar, A. Head-Neck-PET-CT combined GTVs 2D images. Zenodo 2021. [Google Scholar] [CrossRef]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).