# Complex Data Imputation by Auto-Encoders and Convolutional Neural Networks—A Case Study on Genome Gap-Filling

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Data Imputation: A Brief Survey

## 3. Models

#### 3.1. CAE Model

#### 3.2. CNN Model

## 4. Experimental Setup

**gap-filling task**.

**reconstruction task**, should reconstruct the values already present in the input sequence, that is, the known nucleotides, while simultaneously imputing the missing ones, that is, the nucleotides filling the gaps.

#### 4.1. Sequence Data Coding

#### 4.2. Training, Test, and Validation Partitions

**validation set**composed by all the other unseen (single-gap and multivariate-gap) sequences, independently on the number of gaps (that is, this instance is tested both on sequences in the single-gap test set, and on all the sequences with more than one gap, and belonging to the multivariate train and test sets). Similarly, the second instance of the model is trained by using the sequences with many gaps in the multivariate-gap training dataset, and is tested on the

**validation set**composed by all the other not-seen sequences (single-gap train and test sequences plus multivariate-gap test sequences). Of course, the two instances are also tested on all the sequences in the biological dataset.

## 5. Results

**gap-filling task**, which aims at imputing the central nucleotide (Section 5.1); then, we present the models’ performance on the

**reconstruction task**, where the entire input sequence is reconstructed and filled (Section 5.2). Subsequently, we report the results obtained by the CNN models in the

**gap-filling task**(Section 5.3), and we conclude by reporting the

**comparative evaluation**between the best CAE and CNN models in the gap-filling task, where we use as a benchmark the gap-filling results achieved by the KNN-imputation method proposed in Troyanskaya et al. [8].

#### 5.1. CAE—Gap-Filling

#### 5.2. CAE—Reconstruction

#### 5.3. CNN—Gap-Filling

#### 5.4. Models Comparison

## 6. Discussions, Conclusions and Future Works

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

AE | Autoencoder |

CAE | Contractive Autoencoder |

CNN | Convolutional Neural Network |

AUROC | Area Under the Receiver Operating Characteristic curve |

AUPRC | Area Under the Precision Recall Curve |

Conv2D | 2D Convolutional Layer |

Conv2DT | 2D Transposed Convolutional Layer |

ReLU | Rectified Linear Unit |

NADAM | Nesterov ADAptive Momentum estimator |

UCSC | University of California Santa Cruz |

EBlock | Encoder Block |

DBlock | Decoder Block |

## Appendix A. Genome Sequence Datasets

#### Appendix A.1. Single-Gaps Sequences

#### Appendix A.2. Synthetic Multivariate Nucleotides Gaps

#### Appendix A.3. Biological Validation Dataset

**Figure A1.**Procedure for generating the biological validation dataset. (

**a**) Gap identification and neighborhood selection. (

**b**) Identification of the corresponding region on hg38 (local alignment).(

**c**) Selection of right (RN) and left neighborhood (LN) on hg19 fragment. (

**d**) Semi-global alignment of RN and LN on hg38 fragment. (

**e**) Distance between aligned regions is equal to gap length: filler sequence is automatically retrieved. (

**f**) Distance between aligned regions not equal to gap length: filler sequence is not automatically retrieved.

## References

- Osman, M.S.; Abu-Mahfouz, A.M.; Page, P.R. A Survey on Data Imputation Techniques: Water Distribution System as a Use Case. IEEE Access
**2018**, 6, 63279–63291. [Google Scholar] [CrossRef] - García-Laencina, P.J.; Sancho-Gómez, J.L.; Figueiras-Vidal, A.R. Pattern classification with missing data: A review. Neural Comput. Appl.
**2010**, 19, 263–282. [Google Scholar] [CrossRef] - Silva-Ramírez, E.L.; Pino-Mejías, R.; López-Coello, M.; de-la Vega, M.D.C. Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Netw.
**2011**, 24, 121–129. [Google Scholar] [CrossRef] [PubMed] - Jansen, I.; Hens, N.; Molenberghs, G.; Aerts, M.; Verbeke, G.; Kenward, M.G. The nature of sensitivity in monotone missing not at random models. Comput. Stat. Data Anal.
**2006**, 50, 830–858. [Google Scholar] [CrossRef] - Scheet, P.; Stephens, M. A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet.
**2006**, 78. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Di Mauro, C.; Steri, R.; Pintus, M.A.; Gaspa, G.; Macciotta, N.P.P. Use of partial least squares regression to predict single nucleotide polymorphism marker genotypes when some animals are genotyped with a low-density panel. Animal
**2011**, 5, 833–837. [Google Scholar] [CrossRef] - Di Mauro, C.; Cellesi, M.; Gaspa, G.; Ajmone-Marsan, P.; Steri, R.; Marras, G.; Macciotta, N. Use of partial least squares regression to impute SNP genotypes in Italian cattle breeds. Genet. Sel. Evol.
**2013**, 45, 15. [Google Scholar] [CrossRef] [Green Version] - Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for DNA microarrays. Bioinformatics
**2001**, 17, 520–525. [Google Scholar] [CrossRef] [Green Version] - Kalton, G. Compensating for Missing Survey Data; Survey Research Center, Institute for Social Research, The University of Michigan: Ann Arbor, MI, USA, 1983. [Google Scholar]
- Owen, A.B.; Perry, P.O. Bi-cross-validation of the SVD and the nonnegative matrix factorization. Ann. Appl. Stat.
**2009**, 3, 564–594. [Google Scholar] [CrossRef] [Green Version] - Hunt, L.; Jorgensen, M. Mixture model clustering for mixed data with missing information. Recent Developments in Mixture Model. Comput. Stat. Data Anal.
**2003**, 41, 429–440. [Google Scholar] [CrossRef] - Lin, T.I.; Lee, J.C.; Ho, H.J. On fast supervised learning for normal mixture models with missing information. Pattern Recognit.
**2006**, 39, 1177–1187. [Google Scholar] [CrossRef] - Steele, R.J.; Wang, N.; Raftery, A.E. Inference from multiple imputation for missing data using mixtures of normals. Stat. Methodol.
**2010**, 7, 351–365. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Stekhoven, D.J.; Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics
**2012**, 28, 112–118. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Marseguerra, M.; Zoia, A. The AutoAssociative Neural Network in signal analysis: II. Application to on-line monitoring of a simulated BWR component. Ann. Nucl. Energy
**2005**, 32, 1207–1223. [Google Scholar] [CrossRef] - Marwala, T.; Chakraverty, S. Fault classification in structures with incomplete measured data using autoassociative neural networks and genetic algorithm. Curr. Sci.
**2006**, 90, 542–548. [Google Scholar] - Qiao, W.; Gao, Z.; Harley, R.G.; Venayagamoorthy, G.K. Robust neuro-identification of nonlinear plants in electric power systems with missing sensor measurements. Eng. Appl. Artif. Intell.
**2008**, 21, 604–618. [Google Scholar] [CrossRef] - Miranda, V.; Krstulovic, J.; Keko, H.; Moreira, C.; Pereira, J. Reconstructing missing data in state estimation with autoencoders. IEEE Trans. Power Syst.
**2011**, 27, 604–611. [Google Scholar] [CrossRef] [Green Version] - Krstulovic, J.; Miranda, V.; Costa, A.J.S.; Pereira, J. Towards an auto-associative topology state estimator. IEEE Trans. Power Syst.
**2013**, 28, 3311–3318. [Google Scholar] [CrossRef] - Choudhury, S.J.; Pal, N.R. Imputation of missing data with neural networks for classification. Knowl. Based Syst.
**2019**, 182, 104838. [Google Scholar] [CrossRef] - Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
- Zhuang, Y.; Ke, R.; Wang, Y. An Innovative Method for Traffic Data Imputation based on Convolutional Neural Network. IET Intell. Transp. Syst.
**2018**, 13. [Google Scholar] [CrossRef] - Yoon, J.; Jordon, J.; van der Schaar, M. GAIN: Missing Data Imputation using Generative Adversarial Nets. In Proceedings of Machine Learning Research, Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; PMLR: Stockholm, Sweden, 2018; Volume 80, pp. 5689–5698. [Google Scholar]
- Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.L.; Chen, S.C.; Iyengar, S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. CSUR
**2018**, 51, 1–36. [Google Scholar] [CrossRef] - Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal.
**2017**, 42, 60–88. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Casiraghi, E.; Huber, V.; Frasca, M.; Cossa, M.; Tozzi, M.; Rivoltini, L.; Leone, B.E.; Villa, A.; Vergani, B. A novel computational method for automatic segmentation, quantification and comparative analysis of immunohistochemically labeled tissue sections. BMC Bioinform.
**2018**, 19, 75–91. [Google Scholar] [CrossRef] - Zhang, S.; Yao, L.; Sun, A.; Tay, Y. Deep learning based recommender system: A survey and new perspectives. ACM Comput. Surv. (CSUR)
**2019**, 52, 1–38. [Google Scholar] [CrossRef] [Green Version] - Barricelli, B.R.; Casiraghi, E.; Fogli, D. A Survey on Digital Twin: Definitions, Characteristics, Applications, and Design Implications. IEEE Access
**2019**, 7, 167653–167671. [Google Scholar] [CrossRef] - Barricelli, B.R.; Casiraghi, E.; Gliozzo, J.; Petrini, A.; Valtolina, S. Human Digital Twin for Fitness Management. IEEE Access
**2020**, 8, 26637–26664. [Google Scholar] [CrossRef] - Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis.
**2020**, 128, 261–318. [Google Scholar] [CrossRef] [Green Version] - Chen, Y.; Li, Y.; Narayan, R.; Subramanian, A.; Xie, X. Gene expression inference with deep learning. Bioinformatics
**2016**, 32, 1832–1839. [Google Scholar] [CrossRef] [Green Version] - Tan, J.; Hammond, J.; Hogan, D.; Greene, C. ADAGE-Based Integration of Publicly Available Pseudomonas aeruginosa Gene Expression Data with Denoising Autoencoders Illuminates Microbe-Host Interactions. mSystems
**2016**, 1. [Google Scholar] [CrossRef] [Green Version] - Gupta, A.; Wang, H.; Ganapathiraju, M. Learning structure in gene expression data using deep architectures, with an application to gene clustering. In Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, USA, 9–12 November 2015; pp. 1328–1335. [Google Scholar] [CrossRef]
- Lin, C.; Jain, S.; Kim, H.; Bar-Joseph, Z. Using neural networks for reducing the dimensions of single-cell RNA-Seq data. Nucleic Acids Res.
**2017**, 45, e156. [Google Scholar] [CrossRef] [Green Version] - Chen, H.; Chiu, Y.; Zhang, T.; Zhang, S.; Huang, Y.; Chen, Y. GSAE: An autoencoder with embedded gene-set nodes for genomics functional characterization. BMC Syst. Biol.
**2018**, 12. [Google Scholar] [CrossRef] [PubMed] - Nguyen, N.G.; Tran, V.A.; Ngo, D.L.; Phan, D.; Lumbanraja, F.R.; Faisal, M.R.; Abapihi, B.; Kubo, M.; Satou, K. DNA sequence classification by convolutional neural network. J. Biomed. Sci. Eng.
**2016**, 9, 280. [Google Scholar] [CrossRef] [Green Version] - Kelley, D.R.; Snoek, J.; Rinn, J.L. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res.
**2016**, 26, 990–999. [Google Scholar] [CrossRef] [Green Version] - Zeng, H.; Edwards, M.D.; Liu, G.; Gifford, D.K. Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics
**2016**, 32, i121–i127. [Google Scholar] [CrossRef] - Naito, T. Human splice-site prediction with deep neural networks. J. Comput. Biol.
**2018**, 25, 954–961. [Google Scholar] [CrossRef] [PubMed] - Rubin, D.B.; Schafer, J.L. Efficiently creating multiple imputations for incomplete multivariate normal data. In Proceedings of the Statistical Computing Section of the American Statistical Association; American Statistical Association: Alexandria, VA, USA, 1990; Volume 83, p. 88. [Google Scholar]
- Rubin, D.B. Formalizing subjective notions about the effect of nonrespondents in sample surveys. J. Am. Stat. Assoc.
**1977**, 72, 538–543. [Google Scholar] [CrossRef] - Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; John Wiley & Sons: Hoboken, NJ, USA, 2004; Volume 81. [Google Scholar]
- Zhang, P. Multiple Imputation: Theory and Method. Int. Stat. Rev.
**2003**, 71, 581–592. [Google Scholar] [CrossRef] - Sovilj, D.; Eirola, E.; Miche, Y.; Björk, K.M.; Nian, R.; Akusok, A.; Lendasse, A. Extreme learning machine for missing data using multiple imputations. Neurocomputing
**2016**, 174, 220–231. [Google Scholar] [CrossRef] - Mills, H.L.; Heron, J.; Relton, C.; Suderman, M.; Tilling, K. Methods for Dealing With Missing Covariate Data in Epigenome-Wide Association Studies. Am. J. Epidemiol.
**2019**, 188, 2021–2030. [Google Scholar] [CrossRef] - Buuren, S.V.; Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J. Stat. Softw.
**2010**, 1–68. [Google Scholar] [CrossRef] [Green Version] - Rifai, S.; Vincent, P.; Muller, X.; Glorot, X.; Bengio, Y. Contractive Auto-Encoders: Explicit Invariance during Feature Extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML’11), Bellevue, WA, USA, 28 June–2 July 2011; Omnipress: Madison, WI, USA, 2011; pp. 833–840. [Google Scholar]
- Cappelletti, L.; Petrini, A.; Gliozzo, J.; Casiraghi, E.; Schubach, M.; Kircher, M.; Valentini, G. Bayesian optimization improves tissue-specific prediction of active regulatory regions with deep neural networks. In Proceedings of the 8th International Work-Conference on Bioinformatics and Biomedical Engineering (IWWBIO), Granada, Spain, 6–8 May 2020. [Google Scholar]
- Genome International Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature
**2001**, 409, 860–921. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Dozat, T. Incorporating nesterov momentum into adam. In Proceedings of the Workshop Track—ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Bergstra, J.; Bengio, Y. Random Search for Hyper-parameter Optimization. J. Mach. Learn. Res.
**2012**, 13, 281–305. [Google Scholar] - Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 2951–2959. [Google Scholar]
- Oba, S.; Sato, M.A.; Takemasa, I.; Monden, M.; Matsubara, K.I.; Ishii, S. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics
**2003**, 19, 2088–2096. [Google Scholar] [CrossRef] [PubMed] - Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst.
**1987**, 2, 37–52. [Google Scholar] [CrossRef] - Moon, T.K. The expectation-maximization algorithm. IEEE Signal Process. Mag.
**1996**, 13, 47–60. [Google Scholar] [CrossRef] - Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol.
**1977**, 39, 1–22. [Google Scholar] - Tresp, V.; Ahmad, S.; Neuneier, R. Training neural networks with deficient data. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 1994; pp. 128–135. [Google Scholar]
- Ghahramani, Z.; Jordan, M.I. Supervised learning from incomplete data via an EM approach. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 1994; pp. 120–127. [Google Scholar]
- Yu, Q.; Miche, Y.; Eirola, E.; Van Heeswijk, M.; SéVerin, E.; Lendasse, A. Regularized extreme learning machine for regression with missing data. Neurocomputing
**2013**, 102, 45–51. [Google Scholar] [CrossRef] - Eirola, E.; Lendasse, A.; Vandewalle, V.; Biernacki, C. Mixture of Gaussians for distance estimation with missing data. Neurocomputing
**2014**, 131, 32–42. [Google Scholar] [CrossRef] [Green Version] - Akusok, A.; Eirola, E.; Björk, K.M.; Miche, Y.; Johnson, H.; Lendasse, A. Brute-force Missing Data Extreme Learning Machine for Predicting Huntington’s Disease. In Proceedings of the 10th International Conference on PErvasive Technologies Related to Assistive Environments, Sland of Rhodes, Greece, 21–23 June 2017; pp. 189–192. [Google Scholar]
- Li, K.H. Imputation using Markov chains. J. Stat. Comput. Simul.
**1988**, 30, 57–79. [Google Scholar] [CrossRef] - Schafer, J.L. Analysis of Incomplete Multivariate Data; CRC Press: Cleveland, OH, USA, 1997. [Google Scholar]
- Azola, C.; Harrell, F. An Introduction to S-Plus and the Hmisc and Design Libraries. Ph.D. Thesis, University of Virginia School of Medicine, Charlottesville, VA, USA, 2001. [Google Scholar]
- Farhangfar, A.; Kurgan, L.A.; Pedrycz, W. A Novel Framework for Imputation of Missing Values in Databases. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum.
**2007**, 37, 692–709. [Google Scholar] [CrossRef] - Jörnsten, R.; Wang, H.Y.; Welsh, W.J.; Ouyang, M. DNA microarray data imputation and significance analysis of differential expression. Bioinformatics
**2005**, 21, 4155–4161. [Google Scholar] [CrossRef] - Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: A new learning scheme of feedforward neural networks. In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks, Budapest, Hungary, 25–29 July 2004; Volume 2, pp. 985–990. [Google Scholar]
- Huang, G.B. An insight into extreme learning machines: Random neurons, random features and kernels. Cogn. Comput.
**2014**, 6, 376–390. [Google Scholar] [CrossRef] - Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML’15), Lille, France, 6–11 July 2015; Volume 37, pp. 448–456. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
- Springenberg, J.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for Simplicity: The All Convolutional Net. In Proceedings of the ICLR (Workshop Track), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Wilcoxon, F.; Katti, S.; Wilcox, R.A. Critical values and probability levels for the Wilcoxon rank sum test and the Wilcoxon signed rank test. Sel. Tables Math. Stat.
**1970**, 1, 171–259. [Google Scholar] - Plackett, R.L. Karl Pearson and the chi-squared test. In International Statistical Review/Revue Internationale de Statistique; International Statistical Institute (ISI): The Hague, The Netherlands, 1983; pp. 59–72. [Google Scholar]
- Zar, J.H. Significance testing of the Spearman rank correlation coefficient. J. Am. Stat. Assoc.
**1972**, 67, 578–580. [Google Scholar] [CrossRef] - Chollet, F. Keras. Available online: https://github.com/fchollet/keras (accessed on 9 May 2015).
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
- Agarwal, V.; Bell, G.W.; Nam, J.W.; Bartel, D.P. Predicting effective microRNA target sites in mammalian mRNAs. eLife
**2015**, 4, e05005. [Google Scholar] [CrossRef] [PubMed] - Dudoit, S.; Fridlyand, J.; Speed, T.P. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc.
**2002**, 97, 77–87. [Google Scholar] [CrossRef] [Green Version] - Langfelder, P.; Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinform.
**2008**, 9, 559. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Bantscheff, M.; Schirle, M.; Sweetman, G.; Rick, J.; Kuster, B. Quantitative mass spectrometry in proteomics: A critical review. Anal. Bioanal. Chem.
**2007**, 389, 1017–1031. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Kalpić, D.; Hlupić, N.; Lovrić, M. Student’s t-Tests. In International Encyclopedia of Statistical Science; Springer: Berlin/Heidelberg, Germany, 2011; pp. 1559–1563. [Google Scholar] [CrossRef]
- Logan, J.D.; Wolesensky, W.R. Pure and Applied Mathematics: A Wiley-interscience Series of Texts, Monographs, and Tracts. In Chapter 6: Statistical Inference; Chapter Mathematical Methods in Biology; John Wiley and Sons, Inc.: Oxford, UK, 2009. [Google Scholar]
- Eraslan, G.; Avsec, Ž.; Gagneur, J.; Theis, F.J. Deep learning: New computational modelling techniques for genomics. Nat. Rev. Genet.
**2019**, 20, 389–403. [Google Scholar] [CrossRef] - Jaques, N.; Taylor, S.; Sano, A.; Picard, R. Multimodal autoencoder: A deep learning approach to filling in missing sensor data and enabling better mood prediction. In Proceedings of the 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, USA, 23–26 October 2017; pp. 202–208. [Google Scholar]
- Gers, F.A.; Schmidhuber, J.A.; Cummins, F.A. Learning to Forget: Continual Prediction with LSTM. Neural Comput.
**2000**, 12, 2451–2471. [Google Scholar] [CrossRef] [PubMed] - Di Tucci, L.; Guidi, G.; Notargiacomo, S.; Cerina, L.; Scolari, A.; Santambrogio, M.D. HUGenomics: A support to personalized medicine research. In Proceedings of the 2017 IEEE 3rd International Forum on Research and Technologies for Society and Industry (RTSI), Modena, Italy, 11–13 September 2017; pp. 1–5. [Google Scholar]

**Figure 3.**Axis-wise Softmax activation and Cross-entropy loss. (

**a**) Softmax on the 2nd axis of the batch tensor. (

**b**) Categorical cross-entropy on the 2nd axis of the batch tensor

**Figure 7.**Frequencies of gaps within considered nucleotide windows.

**Left**: single-gap sequence.

**Center**: multivariate-gap sequence.

**Right**: Biological sequence. The frequency of the synthetic gaps in the multivariate-gap sequence (center) is similar to the frequency of gaps in the biological sequences.

**Figure 8.**Nucleotides encoding. Each known nucleotide is represented through a vector with 4 binary values, where only the value corresponding to the nucleotide is set to 1. If the nucleotide is missing, it is represented by a vector where all the values are set to 0.25.

**Figure 9.**Human male karyotype. The test set consists of the randomly selected 17 and 18 chromosomes.

**Figure 10.**Gap-filling results achieved by CAE models when trained on (

**a**) single-gap and (

**b**) multivariate-gap sequences. The gap-modality used for training is highlighted with yellow background. Bars are clustered according to the weight ($w=\{1,2,10\}$) used during training (“unweighted” means that the weight is constantly equal to 1 for all the nucleotides); for each weight (each group of bars), the three bars show the performance of the models with different input sizes ($l=\{200,500,1000\}$).

**Figure 11.**Bar plots of AUPRC (

**top**) and AUROC (

**bottom**) scores achieved in the reconstruction task by CAEs, when the models are trained on either the single-gap dataset (

**a**) or on the multivariate-gap dataset (

**b**) scores . The yellow background highlights the dataset used for training. Independently from the weight, CAE 200 model is the best performing one, followed by CAE 500 and CAE 1000.

**Figure 12.**Results achieved by gap-filling CNNs when trained on (

**a**) single-gap datasets and (

**b**) multivariate-gap datasets. The gap-modality used for training is highlighted with a yellow background.

**Figure 13.**Comparative evaluation of gap-filling CAEs, gap-filling CNNs, and KNN imputers ($k=5,1024,\mathrm{10,000}$), when trained on single-gap sequences (

**a**) and multivariate-gap sequences (

**b**). The gap-modality used for training is highlighted with yellow background.

**Table 1.**CAE models summary. CAE model per the three considered window sizes. The models are slightly different mainly to be able to elaborate the larger windows without requiring a significant increase in total parameters number. We abbreviated kernel and strides sizes with ‘ker’ and ‘str’ respectively and the encoder blocks and decoder blocks with ‘EBlock’ and ‘DBlock’. The total parameters count refers to the total trainable parameters of the models.

CAE 200 | CAE 500 | CAE 1000 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Layer | Units | Ker. | Str. | Layer | Units | Ker. | Str. | Layer | Units | Ker. | Str. |

EBlock | 64 | (10, 4) | (2, 1) | EBlock | 64 | (10, 4) | (2, 1) | EBlock | 64 | (20, 4) | (5, 1) |

EBlock | 32 | (10, 4) | (5, 2) | EBlock | 32 | (10, 4) | (5, 2) | EBlock | 32 | (10, 4) | (5, 2) |

EBlock | 16 | (10, 2) | (2, 2) | EBlock | 16 | (10, 2) | (2, 2) | EBlock | 16 | (10, 2) | (2, 2) |

EBlock | 8 | (10, 1) | (1, 1) | EBlock | 8 | (10, 1) | (1, 1) | ||||

Flatten | Flatten | Flatten | |||||||||

Dense | 100 | Dense | 150 | Dense | 200 | ||||||

Encoder total parameters: | 126,577 | Encoder total parameters: | 126,510 | Encoder total parameters: | 131,120 | ||||||

Dense | 160 | Dense | 200 | Dense | 160 | ||||||

Reshape | (10, 1, 16) | Reshape | (25, 1, 8) | Reshape | (20, 1, 8) | ||||||

DBlock | 8 | (10, 1) | (1, 1) | DBlock | 8 | (10, 1) | (1, 1) | ||||

DBlock | 16 | (10, 2) | (2, 2) | DBlock | 16 | (10, 2) | (2, 2) | DBlock | 16 | (10, 2) | (2, 2) |

DBlock | 32 | (10, 4) | (5, 2) | DBlock | 32 | (10, 4) | (5, 2) | DBlock | 32 | (10, 4) | (5, 2) |

DBlock | 64 | (10, 4) | (2, 1) | DBlock | 64 | (10, 4) | (2, 1) | DBlock | 64 | (20, 4) | (5, 1) |

Cov2DT | 1 | (10, 4) | (1, 1) | Cov2DT | 1 | (10, 4) | (1, 1) | Cov2DT | 1 | (20, 4) | (1, 1) |

Decoder total parameters: | 111,156 | Decoder total parameters: | 138,721 | Decoder total parameters: | 225,161 | ||||||

Total parameters: | 237,733 | Total parameters: | 265,231 | Total parameters: | 356,281 |

**Table 2.**CNN models summary. CNN model per the three considered window sizes. The models, as for the CAEs, are slightly different mainly to be able to elaborate the larger windows without requiring a significant increase in total parameters number. The total parameters count refers to the total trainable parameters of the models.

CNN 200 | CNN 500 | CNN 1000 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Layer | Units | Kernel | Pool | Layer | Units | Kernel | Pool | Layer | Units | Kernel | Pool |

Conv2D | 32 | (10, 2) | Conv2D | 32 | (10, 2) | Conv2D | 32 | (10, 2) | |||

EBlock | 64 | (10, 2) | EBlock | 64 | (10, 2) | EBlock | 64 | (10, 2) | |||

Conv2D | 32 | (10, 1) | Conv2D | 64 | (10, 1) | Conv2D | 64 | (10, 1) | |||

EBlock | 16 | (10, 1) | EBlock | 32 | (10, 1) | EBlock | 32 | (10, 1) | |||

MaxPool2D | (4, 1) | MaxPool2D | (4, 1) | MaxPool2D | (4, 1) | ||||||

Conv2D | 32 | (10, 1) | |||||||||

EBlock | 64 | (10, 1) | |||||||||

Conv2D | 32 | (10, 1) | Conv2D | 32 | (10, 1) | ||||||

EBlock | 16 | (10, 1) | EBlock | 16 | (10, 1) | ||||||

MaxPool2D | (4, 1) | MaxPool2D | (4, 1) | ||||||||

Flatten | Flatten | Flatten | |||||||||

Dense | 32 | Dense | 32 | Dense | 32 | ||||||

Dense | 32 | Dense | 32 | Dense | 32 | ||||||

Dense | 4 | Dense | 4 | Dense | 4 | ||||||

Total parameters: | 110,708 | Total parameters: | 174,356 | Total parameters: | 213,492 |

**Table 3.**

**Datasets summary**The single-gap and multivariate-gaps datasets (for window sizes of 1000 nucleotides) have balanced missing-nucleotide distributions, while the biological gaps, being significantly smaller, are considerably unbalanced among the four nucleotides.

Dataset (Total no. Sequences) | Type | Sequences | Adenine | Cytosine | Guanine | Thymine |
---|---|---|---|---|---|---|

multivariate gaps (≈2.86 × 10${}^{6}$) | Train | 2,708,590 | 29.58% | 20.38% | 29.64% | 20.40% |

Test | 150,707 | 28.60% | 21.29% | 28.80% | 21.31% | |

single gap (≈2.86 × 10${}^{6}$) | Train | 2,708,590 | 29.57% | 20.40% | 29.66% | 20.38% |

Test | 150,707 | 28.60% | 21.33% | 28.84% | 21.22% | |

biological gap (19) | Validation | 19 | 44.97% | 18.58% | 19.97% | 16.49% |

Weight | Accuracy | AUPRC | AUROC | ||||||
---|---|---|---|---|---|---|---|---|---|

W | T | L | W | T | L | W | T | L | |

No weight | 0 | 5 | 4 | 0 | 6 | 3 | 0 | 4 | 4 |

2 | 1 | 6 | 2 | 0 | 7 | 2 | 1 | 6 | 2 |

10 | 5 | 4 | 0 | 5 | 4 | 0 | 5 | 4 | 0 |

Model | Accuracy | AUPRC | AUROC | ||||||
---|---|---|---|---|---|---|---|---|---|

W | T | L | W | T | L | W | T | L | |

CAE 1000 | 2 | 0 | 0 | 2 | 0 | 0 | 2 | 0 | 0 |

CAE 200 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |

CAE 500 | 0 | 1 | 1 | 0 | 0 | 2 | 0 | 0 | 2 |

**Table 6.**

**Correlations between gap-filling results achieved by CAEs and the window size of their input.**Both Pearson correlation coefficients and Spearman correlation coefficients show that the CAEs gap-filling performance is correlated with the size of the input-window. p-values are computed with the t-test [83] for Pearson correlation, and the exact permutation distribution test [84] for Spearman correlation.

Metric | Pearson (p-Value) | Spearman (p-Value) |
---|---|---|

Accuracy | 0.06 (0.28) | 0.22 (≈0) |

AUPRC | 0.58 (≈0) | 0.54 (≈0) |

AUROC | 0.23 (≈0) | 0.15 (0.004) |

**Table 7.**

**CAE models mean performance on Gap-filling task.**Though Adenine and Thymine seem having a much lower accuracy (and therefore lower AUROC), they have an higher AUPRC.

Nucleotide | Mean | ||
---|---|---|---|

Accuracy | AUPRC | AUROC | |

Adenine | 0.7120 | 0.3937 | 0.6066 |

Cytosine | 0.7914 | 0.3439 | 0.6428 |

Guanine | 0.7979 | 0.3416 | 0.6340 |

Thymine | 0.7050 | 0.3964 | 0.6097 |

**Table 8.**

**Win-Tie-Loss tables comparing the results of the gap-filling CAE models on each nucleotide.**The Table has been obtained using a Wilcoxon signed-rank test with p-value $0.05$.

Model | Accuracy | AUPRC | AUROC | ||||||
---|---|---|---|---|---|---|---|---|---|

W | T | L | W | T | L | W | T | L | |

Adenine | 1 | 0 | 2 | 2 | 1 | 0 | 0 | 1 | 2 |

Cytosine | 2 | 0 | 1 | 0 | 1 | 2 | 3 | 0 | 0 |

Guanine | 3 | 0 | 0 | 0 | 1 | 2 | 2 | 0 | 1 |

Thymine | 0 | 0 | 3 | 2 | 1 | 0 | 0 | 1 | 2 |

**Table 9.**

**Win-Tie-Loss CAE Reconstruction with varing Weights.**Win-Tie-Loss table comparing the reconstruction results achieved by CAEs with different

**weights**on the validation set (biological set excluded).

Weight | Accuracy | AUPRC | AUROC | ||||||
---|---|---|---|---|---|---|---|---|---|

W | T | L | W | T | L | W | T | L | |

No weight | 0 | 7 | 2 | 0 | 7 | 2 | 0 | 7 | 2 |

2 | 2 | 7 | 0 | 2 | 7 | 0 | 2 | 7 | 0 |

10 | 1 | 7 | 1 | 1 | 7 | 1 | 1 | 7 | 1 |

**Table 10.**

**Win-Tie-Loss CAE Reconstruction with varing Input size.**Win-Tie-Loss table comparing the reconstruction results achieved by CAEs with different

**input sizes**on the validation set (biological set excluded).

Model | Accuracy | AUPRC | AUROC | ||||||
---|---|---|---|---|---|---|---|---|---|

W | T | L | W | T | L | W | T | L | |

CAE 1000 | 0 | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 2 |

CAE 200 | 2 | 0 | 0 | 2 | 0 | 0 | 2 | 0 | 0 |

CAE 500 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |

**Table 11.**

**Correlations between the input-window size and the reconstruction results achieved by CAE models.**Both Pearson coefficients and Spearman coefficients show that the CAEs reconstruction performance is inversely correlated with the size of the input-window.

Metric | Pearson (p-Value) | Spearman (p-Value) |
---|---|---|

Accuracy | −0.73 (≈0) | −0.84 (≈0) |

AUPRC | −0.98 (≈0) | −0.94 (≈0) |

AUROC | −0.99 (≈0) | −0.94 (≈0) |

**Table 12.**

**Correlations between reconstruction results and gap-filling results achieved by CAE models.**The CAEs performance in the gap-filling task and in the reconstruction one for all nucleotides have negative correlation. Therefore the two task have an adversarial behaviour.

Metric | Pearson (p-Value) | Spearman (p-Value) |
---|---|---|

Accuracy | −0.15 (0.22) | −0.21 (0.07) |

AUPRC | −0.59 (≈0) | −0.47 (≈0) |

AUROC | −0.22 (0.06) | −0.11 (0.34) |

**Table 13.**

**Mean over the validation set (biological set excluded) of the performance achieved by CAE models on the reconstruction task of each nucleotide.**On average, the models achieved good results.

Nucleotide | Mean | ||
---|---|---|---|

Accuracy | AUPRC | AUROC | |

Adenine | 0.8495 | 0.7784 | 0.8687 |

Cytosine | 0.8993 | 0.7738 | 0.8909 |

Guanine | 0.8930 | 0.7478 | 0.8767 |

Thymine | 0.8598 | 0.7957 | 0.8764 |

**Table 14.**

**Win-Tie-Loss tables comparing, for each nucleotide, the CAE reconstruction performance on the validation set (biological set excluded).**On average, Adenine is the nucleotide loosing most of the times.

Nucleotide | Accuracy | AUPRC | AUROC | ||||||
---|---|---|---|---|---|---|---|---|---|

W | T | L | W | T | L | W | T | L | |

Adenine | 0 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 3 |

Cytosine | 3 | 0 | 0 | 1 | 1 | 1 | 3 | 0 | 0 |

Guanine | 2 | 0 | 1 | 0 | 0 | 3 | 1 | 1 | 1 |

Thymine | 1 | 0 | 2 | 3 | 0 | 0 | 1 | 1 | 1 |

**Table 15.**

**Win-Tie-Loss table comparing the CAEs’ performance when trained on multivariate gaps vs single gaps.**There is no statistically significant difference between the achieved results, with the exception of the AUROC metric for the gap-filling task.

Metric | Test | Reconstruction task | Gap filling task | ||||||
---|---|---|---|---|---|---|---|---|---|

W | T | L | p-Value | W | T | L | p-Value | ||

Accuracy | Multivariate gaps vs single gap | 0 | 1 | 0 | 0.316740 | 0 | 1 | 0 | 0.085191 |

AUPRC | Multivariate gaps vs single gap | 0 | 1 | 0 | 0.361327 | 0 | 1 | 0 | 0.160683 |

AUROC | Multivariate gaps vs single gap | 0 | 1 | 0 | 0.342291 | 1 | 0 | 0 | 0.002577 |

**Table 16.**

**Win-Tie-Loss tables comparing all CNN models on the Gap-filling task.**Win-Tie-Loss tables for the CNN models in Gap filling task. We can observe that all the models performs similarly.

Model | Accuracy | AUPRC | AUROC | ||||||
---|---|---|---|---|---|---|---|---|---|

W | T | L | W | T | L | W | T | L | |

CNN 1000 | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 2 | 0 |

CNN 200 | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 1 | 1 |

CNN 500 | 0 | 2 | 0 | 0 | 2 | 0 | 1 | 1 | 0 |

**Table 17.**

**Correlations between CNNs’ performance and the window size of the input sequences.**The input window sizes of the CNN models in the gap-filling task have no correlation with the performance of the models.

Metric | Pearson (p-Value) | Spearman (p-Value) |
---|---|---|

Accuracy | 0.03 (0.77) | 0.00 (0.98) |

AUPRC | 0.03 (0.75) | −0.11 (0.21) |

AUROC | 0.05 (0.60) | −0.03 (0.78) |

**Table 18.**

**Win-Tie-Loss table comparing the CNNs’ gap-filling performance when trained on multivariate gaps vs. single gaps.**The models trained on multivariate gaps outperform the models trained on single gaps.

Metric | Test | Gap Filling Task | |||
---|---|---|---|---|---|

Win | Tie | Loss | p-Value | ||

Accuracy | Multivariate gaps vs. single gap | 1 | 0 | 0 | 0.000178 |

AUPRC | Multivariate gaps vs. single gap | 1 | 0 | 0 | 0.000653 |

AUROC | Multivariate gaps vs. single gap | 1 | 0 | 0 | 0.000923 |

Nucleotide | Accuracy | AUPRC | AUROC | ||||||
---|---|---|---|---|---|---|---|---|---|

W | T | L | W | T | L | W | T | L | |

Adenine | 0 | 1 | 2 | 2 | 0 | 1 | 0 | 0 | 3 |

Cytosine | 2 | 1 | 0 | 1 | 0 | 2 | 3 | 0 | 0 |

Guanine | 2 | 1 | 0 | 0 | 0 | 3 | 2 | 0 | 1 |

Thymine | 0 | 1 | 2 | 3 | 0 | 0 | 1 | 0 | 2 |

Nucleotide | Mean | ||
---|---|---|---|

Accuracy | AUPRC | AUROC | |

Adenine | 0.7230 | 0.5042 | 0.7230 |

Cytosine | 0.8059 | 0.4736 | 0.8059 |

Guanine | 0.7984 | 0.4683 | 0.7984 |

Thymine | 0.7262 | 0.5094 | 0.7262 |

**Table 21.**

**Win-Tie-Loss tables between the best CAE 1000 model, the CNN 1000 model and knn-imputers on the Gap-filling task.**The Win-Tie-Loss table has been built by comparing the performance achieved by the models on the validation set (excluding the biological set) through a Wilcoxon signed-rank test (p-value threshold $0.05$). The knn always loose when compared to either the CAE 1000 or the CNN 1000.

Model | Accuracy | AUPRC | AUROC | ||||||
---|---|---|---|---|---|---|---|---|---|

W | T | L | W | T | L | W | T | L | |

CAE 1000 | 3 | 1 | 0 | 3 | 1 | 0 | 3 | 1 | 0 |

CNN 1000 | 3 | 1 | 0 | 3 | 1 | 0 | 3 | 1 | 0 |

KNN 5 | 0 | 2 | 2 | 0 | 2 | 2 | 0 | 2 | 2 |

KNN 1024 | 0 | 2 | 2 | 0 | 2 | 2 | 0 | 2 | 2 |

KNN 10k | 0 | 2 | 2 | 0 | 2 | 2 | 0 | 2 | 2 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Cappelletti, L.; Fontana, T.; Di Donato, G.W.; Di Tucci, L.; Casiraghi, E.; Valentini, G.
Complex Data Imputation by Auto-Encoders and Convolutional Neural Networks—A Case Study on Genome Gap-Filling. *Computers* **2020**, *9*, 37.
https://doi.org/10.3390/computers9020037

**AMA Style**

Cappelletti L, Fontana T, Di Donato GW, Di Tucci L, Casiraghi E, Valentini G.
Complex Data Imputation by Auto-Encoders and Convolutional Neural Networks—A Case Study on Genome Gap-Filling. *Computers*. 2020; 9(2):37.
https://doi.org/10.3390/computers9020037

**Chicago/Turabian Style**

Cappelletti, Luca, Tommaso Fontana, Guido Walter Di Donato, Lorenzo Di Tucci, Elena Casiraghi, and Giorgio Valentini.
2020. "Complex Data Imputation by Auto-Encoders and Convolutional Neural Networks—A Case Study on Genome Gap-Filling" *Computers* 9, no. 2: 37.
https://doi.org/10.3390/computers9020037