Machine Learning for Medical Image Translation: A Systematic Review

Background: CT scans are often the first and only form of brain imaging that is performed to inform treatment plans for neurological patients due to its time- and cost-effective nature. However, MR images give a more detailed picture of tissue structure and characteristics and are more likely to pick up abnormalities and lesions. The purpose of this paper is to review studies which use deep learning methods to generate synthetic medical images of modalities such as MRI and CT. Methods: A literature search was performed in March 2023, and relevant articles were selected and analyzed. The year of publication, dataset size, input modality, synthesized modality, deep learning architecture, motivations, and evaluation methods were analyzed. Results: A total of 103 studies were included in this review, all of which were published since 2017. Of these, 74% of studies investigated MRI to CT synthesis, and the remaining studies investigated CT to MRI, Cross MRI, PET to CT, and MRI to PET. Additionally, 58% of studies were motivated by synthesizing CT scans from MRI to perform MRI-only radiation therapy. Other motivations included synthesizing scans to aid diagnosis and completing datasets by synthesizing missing scans. Conclusions: Considerably more research has been carried out on MRI to CT synthesis, despite CT to MRI synthesis yielding specific benefits. A limitation on medical image synthesis is that medical datasets, especially paired datasets of different modalities, are lacking in size and availability; it is therefore recommended that a global consortium be developed to obtain and make available more datasets for use. Finally, it is recommended that work be carried out to establish all uses of the synthesis of medical scans in clinical practice and discover which evaluation methods are suitable for assessing the synthesized images for these needs.


Introduction
Medical imaging is a routine part of the diagnosis and treatment of a variety of medical conditions.Due to limitations, including the acquisition time of imaging methods and the cost of obtaining medical images, patients may not receive all the imaging modalities that they could benefit from.A possible solution to this is to use deep learning methods to generate synthetic medical images which estimate these modalities from scans the patient did receive.
For example, the diagnosis of brain disorders is often informed by brain scans obtained from the patient.The purpose of such neuroimaging is to rule out or diagnose a variety of conditions caused by lesions in the central nervous system.The most widely used imaging modalities for this purpose are magnetic resonance imaging (MRI) and computerized tomography (CT).MRI is much more sensitive to conditions such as stroke, offering better contrast of soft tissues and excellent anatomical detail in comparison to CT scans; however, MRIs tend to take longer, and be less available and more expensive [1].MRI is also not appropriate for patients with metal implants or claustrophobia.Due to these limitations, CT scans tend to be the first and often only scan a patient receives.Furthermore, compared to CT scans, MRIs provide a more accurate registration to most commonly used brain atlases.Benefitting from the advantages provided by MRI by synthesizing an MRI from a patient's CT scan would therefore improve the treatment of patients presenting with brain disorders.
Deep learning can be utilized to generate images and therefore be applied to this problem.A limitation in using deep learning for medical imaging tasks is the availability of large datasets, a distinguishing factor in terms of what types of deep learning frameworks are suitable.Two commonly used frameworks for image synthesis are generative adversarial networks (GANs) and convolutional neural networks (CNNs).A GAN is a framework that consists of two models-a generator and discriminator-which are simultaneously trained [2].The generator captures the data distribution of the training data and attempts to generate data which fits within this distribution, whilst the discriminator is presented with one piece of data and estimates whether it was generated by the generator.The generator and discriminator then engage in a two-player game, trying to become better at their respective tasks.A CNN is a framework that processes pixel data, and which is often used to detect objects in images [3].In a medical context, one of the most widely used CNNs is U-Net, which is most commonly used for segmentation tasks [4].
A variety of evaluation metrics are used to assess the performance of deep learning models for medical image synthesis.Many of the metrics used to assess the performance of medical image synthesis models are the same as those used in general image synthesis tasks.Metrics assess the difference between two images-the one generated by the model and the ground truth image.Commonly used metrics include the mean error (ME), mean absolute error (MAE), and mean squared error (MSE) which compare pixel intensities.
The purpose of this study is to review the work that has been carried out on medical image synthesis.In medical settings, there is a shortage of large datasets suitable for supervised learning, so this review will consider studies which use supervised learning, unsupervised learning, or both.

Search Strategy
This search was completed using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.The focus was establishing what work had been carried out in terms of developing machine learning models which can translate medical images into different modalities.Therefore, the machine learning frameworks used and dataset details, including body parts studied and modalities studied, were variables of interest.Articles were included for this review if they conducted original research using machine learning methods to translate medical images from one modality into a different modality.Keywords were developed in three categories-machine learning, image generation, and medical imaging-to address these criteria.The keywords in each category are shown in Table 1.

Results
Figure 1 shows the PRISMA flowchart for this review.A total of 392 articles were identified from PubMed, and 297 articles were identified from ArXiv.A further 15 articles which had already been identified as relevant were included from various sources.After title and abstract screening, 138 papers remained, and after screening of the full text, 99 articles were included, which documented 103 studies (Table 3).

Reduce Radiation
wise expose the patient to radiation Segmentation Synthesizing scans of a modality which can help segmentation models either in training or in segmenting the scan

Results
Figure 1 shows the PRISMA flowchart for this review.A total of 392 articles were identified from PubMed, and 297 articles were identified from ArXiv.A further 15 articles which had already been identified as relevant were included from various sources.After title and abstract screening, 138 papers remained, and after screening of the full text, 99 articles were included, which documented 103 studies (Table 3).

Modalities Synthesized
Figure 2 shows the breakdown of the types of synthesis in the included studies.Most studies (76) investigated MRI to CT synthesis, with the majority of these being motivated by MRI-only radiation therapy.Thirteen studies investigated Cross-MRI synthesis, which included T1 to T2 and T2 to FLAIR; often, these studies used a dataset with more than two MRI modalities and performed synthesis between many of the different modalities.All Cross-MRI synthesis studies used datasets of the brain.Eleven of the studies investigated CT to MRI synthesis, three studies investigated MRI to PET synthesis, and one study investigated PET to CT synthesis.

2022
PET to CT tenuation

Modalities Synthesized
Figure 2 shows the breakdown of the types of synthesis in the included studies.Most studies (76) investigated MRI to CT synthesis, with the majority of these being motivated by MRI-only radiation therapy.Thirteen studies investigated Cross-MRI synthesis, which included T1 to T2 and T2 to FLAIR; often, these studies used a dataset with more than two MRI modalities and performed synthesis between many of the different modalities.All Cross-MRI synthesis studies used datasets of the brain.Eleven of the studies investigated CT to MRI synthesis, three studies investigated MRI to PET synthesis, and one study investigated PET to CT synthesis.

Year of Publication
Although no restriction was placed on the year of publication in the literature search, all included papers were published since 2017 (Figure 3).Between 2017 and 2021, the number of papers published appears to grow exponentially, with a drop from 31 studies in 2021 to 24 studies in 2022.Nine studies were from 2023, however, the literature search only included papers until July 2023.

Year of Publication
Although no restriction was placed on the year of publication in the literature search, all included papers were published since 2017 (Figure 3).Between 2017 and 2021, the number of papers published appears to grow exponentially, with a drop from 31 studies in 2021 to 24 studies in 2022.Nine studies were from 2023, however, the literature search only included papers until July 2023.

Evaluation
A total of 36 different methods were used to evaluate model performance (Figure 4).MAE (mean absolute error), PSNR (peak signal-to-noise ratio), and SSIM (structural similarity index) were the three most used evaluation metrics.It was common for studies motivated by MRI-only radiation therapy to use dosimetric evaluation; this was present in 27

Evaluation
A total of 36 different methods were used to evaluate model performance (Figure 4).MAE (mean absolute error), PSNR (peak signal-to-noise ratio), and SSIM (structural similarity index) were the three most used evaluation metrics.It was common for studies motivated by MRI-only radiation therapy to use dosimetric evaluation; this was present in 27 studies.Dosimetric evaluation compared the radiation dosage plan based off the synthetic CT to that which the patient received based on the true CT.

Evaluation
A total of 36 different methods were used to evaluate model performance (Figure 4).MAE (mean absolute error), PSNR (peak signal-to-noise ratio), and SSIM (structural similarity index) were the three most used evaluation metrics.It was common for studies motivated by MRI-only radiation therapy to use dosimetric evaluation; this was present in 27 studies.Dosimetric evaluation compared the radiation dosage plan based off the synthetic CT to that which the patient received based on the true CT.

Motivations
There were multiple motivations mentioned across the surveyed studies (Figure 5).The most common motivation was to achieve MRI-only radiation therapy, which was a

Motivations
There were multiple motivations mentioned across the surveyed studies (Figure 5).The most common motivation was to achieve MRI-only radiation therapy, which was a motivation for 60 studies-these studies all synthesized CTs from MRIs.Fourteen studies were motivated by synthesizing unobtained scans to aid diagnosis.Eight studies were motivated by increasing the size of paired datasets by synthesizing missing modalities.

Deep Learning Used
GANs were the main type of deep learning algorithm used, with 72% of studies incorporating a GAN and 48% studies incorporating a CNN (Figure 6).

Deep Learning Used
GANs were the main type of deep learning algorithm used, with 72% of studies incorporating a GAN and 48% studies incorporating a CNN (Figure 6).

Deep Learning Used
GANs were the main type of deep learning algorithm used, with 72% of studies incorporating a GAN and 48% studies incorporating a CNN (Figure 6).

Dataset Sizes
The number of subjects in the dataset had a mean of 91 and median of 39 (Figure 7).Some of the studies with smaller datasets used the leave-one-out method where the model is trained on all the data but one instance and then tested on the one instance that is left out.This is then repeated, leaving each piece of data out in turn.The mean number of patient in the dataset for cross-MRI synthesis was 274, much larger than the means for MRI to CT (56) and CT to MRI (134).

Dataset Sizes
The number of subjects in the dataset had a mean of 91 and median of 39 (Figure 7).Some of the studies with smaller datasets used the leave-one-out method where the model is trained on all the data but one instance and then tested on the one instance that is left out.This is then repeated, leaving each piece of data out in turn.The mean number of patient in the dataset for cross-MRI synthesis was 274, much larger than the means for MRI to CT (56) and CT to MRI (134).Blue X marks the mean.

Discussion
This systematic review analyzed the current state of medical image synthesis using deep learning.The year of publication; type of synthesis; machine learning framework; dataset size; motivation; and evaluation methods used were analyzed.

Discussion
This systematic review analyzed the current state of medical image synthesis using deep learning.The year of publication; type of synthesis; machine learning framework; dataset size; motivation; and evaluation methods used were analyzed.
The most common synthesis was MRI to CT synthesis, and almost every study performing this synthesis was motivated by MRI-only radiation therapy.The benefits of MRI-only radiotherapy are that the patient does not have to be exposed to the radiation of the CT scan, and that time and money are saved.Other motivations included turning datasets of MRIs into paired MRI/sCT datasets and completing datasets by synthesizing missing CTs.Minimal research has been conducted on MRI synthesis from CT scans.Since CTs are often the first or only scans taken for neurological issues, the time advantage and additional information from CT-synthesized MRI would be clinically beneficial.MRI gives superior tissue contrast for the diagnosis of several brain diseases and disorders, such as stroke and traumatic brain injury.CT-synthesized MRI could improve the speed and quality of treatment for stroke patients and provide a solution for the cross-modality registration problem in the context of comparing patients' CT scans to MRI brain atlases.Depending on the training dataset, the generation of T1, T2 weighted, or even FLAIR images from CT could be investigated.These different types of MR modalities provide complementary information which can be utilized for diagnostic purposes and for registration to different brain atlases.Eleven papers [18][19][20][21][22][23][24][25][26][27][28] studied MRI synthesis from CT which demonstrates a knowledge gap in this area.
The lack of paired MRI/CT datasets is a significant problem that inhibits the use of supervised learning for cross-modality synthesis.It is therefore suggested that future studies investigate whether within-modality synthesis models could be used to generate paired datasets.Paired MRI/CT datasets are useful for a variety of applications, including training models for cross-modality synthesis and training models to perform other tasks that require paired data.
In part due to the lack of consensus on which metrics to use for evaluation, there does not appear to be a consensus on the level of accuracy required for synthetic medical images.The quality of the generated images in some publications is an area of particular concern, as some models output blurry images which mask the details of smaller-scale features.A benchmark image quality for the models for use in a clinical setting is much needed.This task will be hampered, however, by different motivations, since different studies may require different levels of accuracy and image quality.Research that helps provide a consensus or that gives guidance on the best evaluation methods is warranted to improve the progress towards clinically useful synthesized medical images.
There were a range of research motivations across the different studies; however, most papers did not mention more than one of these.The motivations for MRI synthesis from CT were quite different to the motivations for CT synthesis from MRI.A focus for future research should be establishing how different motivations for medical image synthesis affect how the synthesized images should be assessed and evaluated.This would help establish which methods perform best for medical image generation in different contexts.The motivations of the studies strongly affected the methods of evaluation used.A common evaluation method for the CTs generated from MRI for the purpose of MRI-only radiotherapy was dosimetric evaluation, which does not make sense for other types of synthesis.Research investigating clinical uses for synthetic medical images would therefore be significant.
The studies reviewed did not provide much insight into how different machine learning frameworks compare for medical image translation.The research has instead been focused on demonstrating that synthesizing medical images with deep learning is feasible.Studies used GANs and CNNs, but no particular focus was put on finding out which of these frameworks is more suited to the problem.Many of the papers used GANs, and a selection of these introduced novel contributions to the GAN model that they implemented to improve image synthesis.A much smaller selection of the papers used CNNs, and most of these did not implement novel features to adapt the models for this type of synthesis.It is recommended research be carried out on how CNNs can be adapted for this type of synthesis.
GANs are renowned for image generation, and this is presumably why they have been used so often in this area.The reason they are so popular for image generation is because they produce high-quality images due to matching the training distribution.With a dataset of medical images, the distribution statistics will be affected by the percentage of scans with artifacts such as lesions.This leads to the possibility of hallucinating or erasing lesions or other artifacts.Even in the case of supervised models such as Pix2Pix, the models still fit to the distribution of the training data [104].CNNs only fit to the one-to-one pairings between the paired data input.This means they require a lot more data than GANs for stable training, however, this ensures the model learns the relationship between the input and output modalities.The papers including CNNs mostly used UNet and variations of UNet.Despite UNet being normally used for segmentation, a model of this architecture has proved to work well for image synthesis.A few papers did compare GANs against CNNs, however, no consistent consensus on their relative performance was found.
More studies are required to determine which deep learning architectures and implementations work best for medical image synthesis.To assist the development of this area, it is recommended that future research test and compare different methods of evaluating synthesized medical images, in order to determine the level of accuracies required for the synthesized images to be clinically useful in different contexts.Finally, it is recommended that the feasibility of a model generating pairs of synthetic CTs and synthetic MRIs be investigated.This has not been previously done and would have helpful implications for using deep learning for synthesis, segmentation, and a variety of other clinical tasks if feasible.Lack of available large medical datasets is an ongoing issue; it is therefore recommended that a global consortium be established to collate currently available datasets and coordinate with researchers and medical professionals to encourage ongoing collaboration.

Conclusions
In conclusion, this systematic review has revealed a knowledge gap within the field of medical image synthesis.Specifically, very limited research has been conducted on synthesizing MRIs from CT scans, despite a variety of motivations.Since MRIs give superior tissue contrast and are preferred for the diagnosis of several brain diseases and disorders, synthesis of such data from CTs (which are more commonly obtained) would be clinically beneficial.All studies reviewed on medical image translation have been published since 2017, making this a relatively new area-as such, there is little consensus around methods of assessing and testing the performance of models for this task.We therefore recommend that more research be conducted into MRI synthesis from CT scans.Current advances in deep learning have shown clinical utility for stroke and traumatic brain injury patients, making this approach promising as a candidate for solving the cross-modality registration problem.Recommendations were given for the directions of future research in this field, including on a related application (not yet discussed in the literature) of using image synthesis techniques to generate pairwise datasets.It was concluded that more research is required to determine which deep learning methods are most effective and accurate in synthesizing medical images for use in a clinical setting.

Figure 1 .
Figure 1.The PRISMA diagram detailing this systematic review.

Figure 2 .
Figure 2. Breakdown of type of synthesis.

Figure 2 .
Figure 2. Breakdown of type of synthesis.

Bioengineering 2023 , 18 Figure 3 .
Figure 3. Year of publication of the reviewed studies.

Figure 3 .
Figure 3. Year of publication of the reviewed studies.

Figure 3 .
Figure 3. Year of publication of the reviewed studies.

Figure 4 .
Figure 4. Methods for evaluating the synthetic images.

Figure 4 .
Figure 4. Methods for evaluating the synthetic images.

Bioengineering 2023 ,
10,  x FOR PEER REVIEW 10 of 18 motivation for 60 studies-these studies all synthesized CTs from MRIs.Fourteen studies were motivated by synthesizing unobtained scans to aid diagnosis.Eight studies were motivated by increasing the size of paired datasets by synthesizing missing modalities.

Figure 5 .
Figure 5. Stated motivations for medical image synthesis.

Figure 5 .
Figure 5. Stated motivations for medical image synthesis.

Figure 5 .
Figure 5. Stated motivations for medical image synthesis.

Figure 6 .
Figure 6.Deep learning frameworks used for medical image synthesis.

Figure 6 .
Figure 6.Deep learning frameworks used for medical image synthesis.

Bioengineering 2023 , 18 Figure 7 .
Figure 7. Boxplot of number of patients comprising dataset (axis limited to exclude extremes).Blue X marks the mean.

Figure 7 .
Figure 7. Boxplot of number of patients comprising dataset (axis limited to exclude extremes).Blue X marks the mean.

Table 2 .
Descriptions of the Synthesis Type and Motivations categories.