MET Exon 14 Skipping: A Case Study for the Detection of Genetic Variants in Cancer Driver Genes by Deep Learning

Background: Disruption of alternative splicing (AS) is frequently observed in cancer and might represent an important signature for tumor progression and therapy. Exon skipping (ES) represents one of the most frequent AS events, and in non-small cell lung cancer (NSCLC) MET exon 14 skipping was shown to be targetable. Methods: We constructed neural networks (NN/CNN) specifically designed to detect MET exon 14 skipping events using RNAseq data. Furthermore, for discovery purposes we also developed a sparsely connected autoencoder to identify uncharacterized MET isoforms. Results: The neural networks had a Met exon 14 skipping detection rate greater than 94% when tested on a manually curated set of 690 TCGA bronchus and lung samples. When globally applied to 2605 TCGA samples, we observed that the majority of false positives was characterized by a blurry coverage of exon 14, but interestingly they share a common coverage peak in the second intron and we speculate that this event could be the transcription signature of a LINE1 (Long Interspersed Nuclear Element 1)-MET (Mesenchymal Epithelial Transition receptor tyrosine kinase) fusion. Conclusions: Taken together, our results indicate that neural networks can be an effective tool to provide a quick classification of pathological transcription events, and sparsely connected autoencoders could represent the basis for the development of an effective discovery tool.


Introduction
It is known that in eukaryotes, alternative splicing plays an important role in defining the protein diversity and enhancing the complexity of gene expression regulation [1]. In humans, the majority of multi-exon genes is affected by alternative splicing, which generates proteins with different functions in distinct cellular processes [2]. Disruption of alternative splicing (AS) is associated with human diseases [3] and exon skipping (ES) is one of the most observed events [4]. The analyses and studies of alternative splicing advance our understanding of mRNA complexity and its regulation, providing valuable insights to grasp disease etiology, and assisting the development of therapeutic interventions for splicing-related diseases [5]. ExonSkipDB (https://ccsm.uth.edu/ExonSkipDB/, accessed on 1 October 2020) [6] has been recently developed, which is a database collecting ES events affecting disease associated genes. Within the 8266 ExonSkipDB genes, annotated as genes loosing functional features due to in-frame ES events in TCGA (https://portal.gdc. 2 of 15 cancer.gov/, accessed on 1 March 2020), 449 are part of the 710 COSMIC census genes [7]. 25 Table S1).
Notably, MET exon 14 skipping is the only ES event encompassing a massive number of citations (119 from 2015 to 2021 reported in the PUBMED repository). MET exon 14 skipping is a splicing aberration that results in deletion of the MET juxtamembrane domain, which contains negative regulatory sites of the MET receptor. Thus, exon 14 deletion results in impaired receptor ubiquitin-mediated receptor degradation, decreased turnover and increased downstream signaling [8,9]. MET exon 14 skipping was described in lung adenocarcinoma (3%), other lung neoplasms than adenocarcinomas (2.3%), brain glioma (0.4%), and tumors of unknown primary origin (0.4%) [8]. Furthermore, Champagnac and coworkers [10] observed that genomic alterations affecting MET exon 14 are present in 2.6% of non-small cell lung cancer (NSCLC) patients. MET exon 14 skipping can lead to acquisition of transforming ability and was identified as a potential therapeutic target for NSCLC [11,12]. Many different mutations at the DNA level can cause the aberrant splicing of exon 14, and the only search at the genomic level for MET exon 14 skipping does not guarantee that the mutated MET transcript is actively expressed. Furthermore, given the relatively small deletion, it remains a question as to whether antibodies can be developed with enough specificity against this splice variant [13]. RNA sequencing is today a straightforward approach thanks to the possibility to perform targeted RNAseq in paraffin embedded samples [14]. However, to efficiently detect MET exon 14 skipping, an effective computing detection algorithm for this specific ES event is also required. Inspecting the available literature on MET exon 14 skipping (https://pubmed.ncbi.nlm.nih.gov/, accessed on 01 March 2021) we observed that the identification of this skipping event is mainly done using DNA-based amplicon-mediated target enrichment [15] or RNA-based next-generation sequencing target enrichment [16], where the RNA based method provides a higher detection rate of exon 14 skipping [16]. We never found any article using deep learning or machine learning methods for the detection of MET exon 14 skipping, which are of particular interest to screen large cohorts of specimens. Notably, we found two articles [17,18] predicting exon skipping events using RNAseq data. In Zhang's paper [17] a convolutional neural network (CNN) is used to classify splice junctions derived from primary RNA-seq data. Instead, in Du's paper [18] a Rotation Forest algorithm is used to predict ES events integrating RNA-Seq data and genome sequence information. Moreover, CNN was implemented in SpliceRover [19] a generalist tool for splice site prediction. Similarly, in [20] a general tool-namely, SpliceAI-was proposed to predict splicing from a pre-mRNA sequence using CNN.
To the best of our knowledge, we did not find any tool designed specifically to detect an exon skipping event as MET exon 14 skipping. In this manuscript, we investigated different neural network architectures to provide sensitive and rapid detection of MET exon 14 skipping events using RNAseq data. Standard Neural Network (NN), Convolutional Neural Network (CNN), and a Sparsely Connected Autoencoder (SCA) were thus compared in detail on different datasets. With respect to models predicting ES events, as well as other generalist CNNs [21,22], which are designed using nucleotide sequence information for their prediction, our models are designed to handle expression data in the form of kmer counts or coverage.

Neural Network for the Detection of MET Exon 14 Skipping (MET∆14).
To detect MET exon 14 skipping events, an NN made of six layers was built; Figure 1A. exon 14 skipping events using RNAseq data. Standard Neural Network (NN), Convolutional Neural Network (CNN), and a Sparsely Connected Autoencoder (SCA) were thus compared in detail on different datasets. With respect to models predicting ES events, as well as other generalist CNNs [21,22], which are designed using nucleotide sequence information for their prediction, our models are designed to handle expression data in the form of kmer counts or coverage.

Neural Network for the Detection of MET Exon 14 Skipping (METΔ14).
To detect MET exon 14 skipping events, an NN made of six layers was built; Figure  1A. As a training set for the NN, we used data from amplified WT MET and exon 14 skipping MET (Table 1). Specifically, we split the MET reads in random non-overlapping subgroups of 1000 reads. Although at 1000 reads, coverage of the detection of METΔ14 becomes a bit blurry- Figure 2D; this threshold allows for a generation of large number of MET (1447) and METΔ14 (846) to not overlap subsamples, and a high numerosity of training data is an important element for efficient learning of the NN. As a training set for the NN, we used data from amplified WT MET and exon 14 skipping MET (Table 1). Specifically, we split the MET reads in random non-overlapping subgroups of 1000 reads. Although at 1000 reads, coverage of the detection of MET∆14 becomes a bit blurry- Figure 2D; this threshold allows for a generation of large number of MET (1447) and MET∆14 (846) to not overlap subsamples, and a high numerosity of training data is an important element for efficient learning of the NN.  Each of the above-mentioned subgroups was converted in 31 and 16 k-mers. MET expression was represented by the amount of each k-mers spanning over MET exons, and these data were used to train the NN. We observed that the learning curve at 16 k-mers was slightly better than the one at 31 k-mers (not shown), and thus we ran the following analyses using the 16 k-mers representation of MET. As training sets, we also used k-mer count frequency [23] for full MET locus, k-mer count frequency for MET exons 13 ÷ 15 and coverage frequency for MET exons 13 ÷ 15.
As test sets, we used subsets of WT and exon 14 skipping MET from cell lines characterized by a physiological MET expression. NN performance was investigated using as test set: (i) subsets made of random not overlapping subgroups of 500, 1000 and 5000 reads, converted in 16 k-mer counts, (ii) k-mer count frequency on full MET locus, (iii) kmer count frequency on MET exons 13 ÷ 15 and (iv) coverage frequency for MET exons 13 ÷ 15. Each of the above-mentioned subgroups was converted in 31 and 16 k-mers. MET expression was represented by the amount of each k-mers spanning over MET exons, and these data were used to train the NN. We observed that the learning curve at 16 k-mers was slightly better than the one at 31 k-mers (not shown), and thus we ran the following analyses using the 16 k-mers representation of MET. As training sets, we also used k-mer count frequency [23] for full MET locus, k-mer count frequency for MET exons 13 ÷ 15 and coverage frequency for MET exons 13 ÷ 15.
As test sets, we used subsets of WT and exon 14 skipping MET from cell lines characterized by a physiological MET expression. NN performance was investigated using as test set: (i) subsets made of random not overlapping subgroups of 500, 1000 and 5000 reads, converted in 16 k-mer counts, (ii) k-mer count frequency on full MET locus, (iii) k-mer count frequency on MET exons 13 ÷ 15 and (iv) coverage frequency for MET exons 13 ÷ 15.
The detection efficiency of MET∆14 using 16 k-mers frequency counts showed best performances at 500 and 1000 reads coverage, Figure 3A,B, as instead at 5000 reads coverage of all the different test sets performed in the same way, Figure

Neural Network Validation and Discovery on TCGA Samples
To validate the METΔ14 discovery potential of the above-described NN, we used a set of 690 RNAseq samples from the TCGA bronchus and lung dataset. The 690 samples were manually inspected using the Broad's integrative genomics viewer (IGV) [24] and we detected 17 exon 14 skipping events (2.4%), which is in line with the frequency of the exon 14 skipping events observed in published literature [8,10]. We tested on this tumor set the NN trained with k-mer counts frequency, which predicted 4 samples out of 17 as METΔ14, but only one was a real exon skipping event (sensitivity 5.88%, specificity 99.5%). The NN trained with exons 13 ÷ 15 MET k-mer counts frequency improved the detection of METΔ14 events, 9 out of 17 (sensitivity 52.9%), but this prediction included a massive increase of false positives, 129 samples, (specificity 81.3%). The best results were obtained using the NN trained using only the coverage frequency for MET exons 13 ÷ 15, which predicted 18 skipping events, including all 17 true skipping events (sensitivity 100%) and one false positive (specificity 99.8%).
Using the NN trained with the coverage frequency for MET exons 13 ÷ 15, we extended the METΔ14 discovery to 2605 TCGA tumor tissues; Table 2.

Neural Network Validation and Discovery on TCGA Samples
To validate the MET∆14 discovery potential of the above-described NN, we used a set of 690 RNAseq samples from the TCGA bronchus and lung dataset. The 690 samples were manually inspected using the Broad's integrative genomics viewer (IGV) [24] and we detected 17 exon 14 skipping events (2.4%), which is in line with the frequency of the exon 14 skipping events observed in published literature [8,10]. We tested on this tumor set the NN trained with k-mer counts frequency, which predicted 4 samples out of 17 as MET∆14, but only one was a real exon skipping event (sensitivity 5.88%, specificity 99.5%). The NN trained with exons 13 ÷ 15 MET k-mer counts frequency improved the detection of MET∆14 events, 9 out of 17 (sensitivity 52.9%), but this prediction included a massive increase of false positives, 129 samples, (specificity 81.3%). The best results were obtained using the NN trained using only the coverage frequency for MET exons 13 ÷ 15, which predicted 18 skipping events, including all 17 true skipping events (sensitivity 100%) and one false positive (specificity 99.8%).
Using the NN trained with the coverage frequency for MET exons 13 ÷ 15, we extended the MET∆14 discovery to 2605 TCGA tumor tissues; Table 2.
We could detect only one MET∆14 in 280 bladder samples. Then, we detected few false MET∆14 in cervix, corpus uteri, heart/mediastinum/pleura, kidney and skin samples, Table 2. The six transcripts detected in cervix, Table 2, were erroneously detected as MET∆14, because they have a blurry coverage on exons 13 ÷ 15, Figure 4C. However, when the full MET locus is observed, it is clear that these MET∆14 false positives are a completely different type of transcript. A shared characteristic of these transcripts is the high accumulation of reads in the second intron (approx. chr7:116,715,690-116,717,329), Figure 4A, in the 6th exon, Figure 4B, and in the last non-coding MET exon, Figure 4D. We could detect only one METΔ14 in 280 bladder samples. Then, we detected few false METΔ14 in cervix, corpus uteri, heart/mediastinum/pleura, kidney and skin samples, Table 2. The six transcripts detected in cervix, Table 2, were erroneously detected as METΔ14, because they have a blurry coverage on exons 13 ÷ 15, Figure 4C. However, when the full MET locus is observed, it is clear that these METΔ14 false positives are a completely different type of transcript. A shared characteristic of these transcripts is the high accumulation of reads in the second intron (approx. chr7:116,715,690-116,717,329), Figure 4A, in the 6th exon, Figure 4B, and in the last non-coding MET exon, Figure 4D. The above observation also applies to the other false METΔ14 detected in corpus uteri, heart/mediastinum/pleura, kidney and skin samples; supplementary Figure S1.
A possible explanation could be that we are observing the transcriptional effect of a The above observation also applies to the other false MET∆14 detected in corpus uteri, heart/mediastinum/pleura, kidney and skin samples; supplementary Figure S1.
A possible explanation could be that we are observing the transcriptional effect of a LINE1-MET fusion, which was firstly described a few years ago in triple negative breast cancers [25]. We further investigated this point searching for LINE1 alignment, in the subset of MET reads, where only one of the two pair-end reads maps on MET. Indeed, in 10 out of 15 samples, detected as characterized by a transcription peak in MET second intron, we detected LINE1 mapping reads, Supplementary Table S3. From the samples shown in Supplementary Table S3, we extracted the paired reads associated with MET reads, i.e., only one read of the pair is mapping in MET locus. We blasted [26] these reads on a LINE1 sequence (chr1:62194249-62212928, hg38) and indeed, some of these reads map to LINE1 sequence; supplementary Table S4. On the basis of the MET read position we could identify the putative fusion point with MET, which is mainly located in MET intronic regions and in the last non-coding exon. Unfortunately, we cannot pair the TCGA RNAseq samples to genomics data to further validate the presence of a LINE1 insertion on the basis of genome sequencing data.

Convolutional Neural Network (CNN) for the Detection of MET∆14
To detect MET exon 14 skipping events, we constructed a CNN made by a 1D convolutional layer, 1D Max pooling layer, a flat fully connected dense layer with 50 nodes and an output layer with one node; Figure 1C. The CNN was challenged with the same training and test set used for the flat neural network. In this implementation, the convolutional layer included 10 kernels, for more information see Material and Method section. In Figure 5, the MET∆14 detection ability of CNN on the basis of different representation of the MET expression data are reported.

Convolutional Neural Network (CNN) for the Detection of METΔ14
To detect MET exon 14 skipping events, we constructed a CNN made by a 1D convolutional layer, 1D Max pooling layer, a flat fully connected dense layer with 50 nodes and an output layer with one node; Figure 1C. The CNN was challenged with the same training and test set used for the flat neural network. In this implementation, the convolutional layer included 10 kernels, for more information see Material and Method section. In Figure 5, the METΔ14 detection ability of CNN on the basis of different representation of the MET expression data are reported. The results are organized ( Figure 5) on the basis of the type of input data, i.e., whole MET exons kmer counts ( Figure 5A), MET exons 13 ÷ 15 kmer count frequency ( Figure  5B) and MET exons 13-15 coverage frequency ( Figure 5C). The best ratio between true positive and false positive is shown for all kernels using test samples characterized by 5000 reads coverage. As also seen for NN (Figure 3), the specificity progressively decreases when the coverage is reduced.

Convolutional Neural Network Validation on Bronchus and Lung Samples
We validated the CNN model using the kernel 100, which is one of the best performing The results are organized ( Figure 5) on the basis of the type of input data, i.e., whole MET exons kmer counts ( Figure 5A), MET exons 13 ÷ 15 kmer count frequency ( Figure 5B) and MET exons 13-15 coverage frequency ( Figure 5C). The best ratio between true positive and false positive is shown for all kernels using test samples characterized by 5000 reads coverage. As also seen for NN (Figure 3), the specificity progressively decreases when the coverage is reduced.

Convolutional Neural Network Validation on Bronchus and Lung Samples
We validated the CNN model using the kernel 100, which is one of the best performing kernels independently by the coverage of the test set ( Figure 5). The validation was done on the 690 TCGA bronchus and lung sample manually inspected for the presence of MET∆14. We tested on this tumor set the CNN trained with k-mer counts, which predicted 10 samples as MET∆14, but only one was a real exon skipping events (sensitivity 5.88%, specificity 97.6%, supplementary Table S6). The best results were obtained using the CNN trained with exons 13 ÷ 15 MET k-mer counts frequency. All of the 16 samples predicted as MET∆14 belong to the 17 true MET∆14 present in the data set (sensitivity 94.11%, specificity 100%, supplementary Table S6). Finally, the CNN trained using only the coverage frequency for MET exons 13 ÷ 15, predicted 8 skipping events and all of them belong to the true skipping events (sensitivity 47.05%, specificity 100%, supplementary Table S6). Since we observed that NN was detecting some false positives in cervix tumor tissues (Figure 4), we evaluated if CNN was more specific than NN. CNN trained with exons 13 ÷ 15 MET k-mer counts frequency detects the same false positives detected by NN, Figure 4.

Sparsely Connected Autoencoders (SCA) to Detect MET Non-Canonical Isoforms
Our group have recently published a paper on the use of SCA for the identification of hidden functional regulatory elements in single cell RNAseq data [27]. We tested this type of autoencoder to see if we could grasp non-canonical isoforms from the analysis of the TCGA samples used in the previous paragraph. The SCA was designed to take as input k-mer count frequency or coverage frequency of MET exons. The SCA hidden layer, i.e., latent space, is representing MET exons. Input nodes are only connected to the exon nodes they are associated ( Figure 1B). We trained the SCA with the 2605 TCGA samples and clustered the latent space data using gridFLOW [28]. To estimate the stability of clusters generated using the SCA latent space, we compared thousands of pairs of clusters generated by SCA latent space clustering, as previously described by us [27]. The rationale of this approach is that, if a cluster's organization is conserved, it should be depicted by the multiple comparisons of randomly paired latent space cluster representations [27]. The best results were obtained using normalized [29] MET coverage frequency data ( Figure 5A). Unfortunately, the stability of the clusters was very poor ( Figure 6A). However, an inspection of a random subsets of samples associated with cluster 2 ( Figure 6B) suggests that at least cluster 2 seems to be made mainly of transcripts recalling the organization of MET-LINE1 fusion, which we have described in previous paragraphs. the multiple comparisons of randomly paired latent space cluster representations [2 The best results were obtained using normalized [29] MET coverage frequency data (F ure 5A). Unfortunately, the stability of the clusters was very poor ( Figure 6A). Howev an inspection of a random subsets of samples associated with cluster 2 ( Figure 6B) su gests that at least cluster 2 seems to be made mainly of transcripts recalling the organi tion of MET-LINE1 fusion, which we have described in previous paragraphs.

Discussion
We used MET exon 14 skipping as a case study for the detection of genetic variants in cancer driver genes through deep learning. In recent years, a lot of evidence has in-dicated that MET inhibitors have a good anti-tumor effect in patients with MET exon 14 skipping mutation, suggesting that MET exon 14 skipping may be a new target for NSCLC patients [30]. Thus, the availability of effective tools for the detection of MET exon 14 skipping are needed for a fast identification of patients suitable for MET targeted therapy.
It is notable that, digging into the published literature, all the found exon skipping tools use nucleotide sequence analysis to infer skipping events, and they are only able to predict skipping events in a generalist way. Since we could not find any tool providing the detection of a unique skipping event in a gene over a large cohort of specimens, we designed specific neural networks for the identification of MET exon 14 skipping, using transcript expression information.
We designed a conventional neural network (NN) made of four fully connected hidden layers and a convolutional neural network (CNN) made of one 1D convolutional layer, one 1D max pooling layer and a fully connected dense layer. Although we performed an automated optimization of the hyperparameters, the prediction efficacy of our CNN and NN comes from the special attention we put on defining the optimal representation of the data for each architecture, i.e kmer counts for CNN and coverage from NN.
The NN and the CNN training was done using the RNAseq data of a lung cancer cell line expressing amplified form of the wild type MET (WT, EBC-1), and a gastric cancer cell line expressing exon 14 skipped MET (HS746T). HS746T cell line was selected because, to the best of our knowledge, it is the only cell line displaying amplification of MET exon 14 skipping isoform. MET gene amplification has been observed in about 2-5% of gastroesophageal cancers and represents an oncogenic driver and therapeutic target [31,32]. MET exon 14 skipping was initially described in NSCLCs (caused by a mutation in the splice donor site in intron 14 and afterwards reported in a variety of tumors, including gastrointestinal cancers, suggesting it as a potential mechanism leading to MET activation [33]. Therefore, HS746T, together with EBC-1, were invaluable instruments to provide a large amount of data for the NN/CNN training. Validation was done instead using RNAseq data from lung cancer cell lines expressing at physiological level MET (A549 expressing WT MET and NCI-H596 expressing exon 14 skipped MET).
Since we could not compare our models with respect to pre-existing methods for MET exon 14 skipping, we manually curated a set of TCGA data to provide an objective evaluation of the performance of our tool. Specifically, we manually curated a cohort of WT and exon 14 skipped samples made of the 690 RNAseq samples belonging to the TCGA (https://www.cancer.gov/tcga, accessed on 1 March 2020) bronchus and lung collection (1310 samples) showing a MET coverage of at least 5000 reads. Given the manual curation of this dataset, i.e., each single sample was inspected on IGV browser for the presence of MET exon 14 skipping, it represents a robust instrument to quantify the predictive performance of our neural network models.
Skewed datasets are not uncommon and the MET exon 14 skipping detection is a typical example. Although skewed datasets are tough to handle, our models, i.e., CNN and NN, seem to handle this issue efficiently, since sensitivity greater than 94% and specificity greater than 99% are reached on an extremely skewed data set such as TCGA bronchus and lung 690 samples with only 17 MET exon skipping events (2.46%). Notably, the high sensitivity is obtained by CNN with a training based on kmer counts spanning among MET exon 13 and exon 15. Instead, in the case of the NN the optimal sensitivity was obtained with a training based on coverage data encompassing the region among MET exon 13 and exon 15.
Our analysis, using both CNN and NN, on 2605 TCGA tumors (13 primary sites, Table 2) highlights that MET exon 14 skipping is a peculiar event of lung specimens. Then, mainly in uterine cancers, we detected a set of MET exon 14 skipping false positives, sharing a common feature: an unexpected peak of coverage in the MET intron 2. This observation brought us to speculate that we were observing a transcriptional signature for a LINE1-MET fusion event [25]. This hypothesis has been supported by the identification of MET paired-end reads, having one read mapping on MET and the other on LINE1 sequence. Notably, transcription of the LINE1-MET fusion was observed in advanced stages of cancer [25,34], but very little is still known about the effect of the LINE1-MET chimera in cancer.
At the present time, we cannot manage to eliminate LINE1-MET false positives, mainly because we do not have enough data to train a model to detect LINE1-MET fusion, to be implemented in parallel with the MET exon 14 skipping models. However, we are generating large RNAseq data from MCF7, a breast cancer cell line harboring LINE1-MET fusion [25], to build a specific CNN to be integrated with our MET exon 14 skipping models, to improve their specificity.
Having identified more than one artefactual event in MET, we investigated the possibility to discover those anomalous events by the integration of a particular type of deep learning tool, sparsely connected autoencoders [27], with clustering techniques used in multicolor cytometry. Although the actual implementation of the SCA tool could be further improved in terms of its precision and sensitivity, currently we were able to detect from TCGA specimens a set of tumors sharing the putative LINE1-MET fusion.
Taken together, our results indicate that neural networks can be an effective tool to provide a quick classification of pathological transcription events. However, from the discovery point of view there is still some work to be done to obtain an effective discovery tool using sparsely connected autoencoders.

Generating the Data for the Neural Network Training and Test Set
We generated RNAseq data from EBC-1 [35], a non-small cell lung cancer (NSCLC) cell line, harboring MET amplification and from Hs746T, a gastric cancer cell line, harboring amplified MET exon 14 skipped isoform (MET∆14) [36]. Furthermore, we have performed RNAseq on human lung adenocarcinoma cell line A549, expressing c-Met [37] and on NCI-H596, derived from an NSCLC, expressing exon 14 skipped MET [38], Table 1. Both cell lines express physiological levels of MET.
Total RNA was extracted from cell lines using Trizol reagent (Invitrogen, Carlsbad, CA, USA), following the manufacturer indication. Total RNA was quantified using the Qubit 2.0 fluorimetric Assay (Thermo Fisher Scientific, Waltham, MA, USA) and sample integrity, based on the RIN (RNA integrity number), was assessed using an RNA ScreenTape assay on TapeStation 4200 (Agilent Technologies, Santa Clara, CA, USA).
Libraries were prepared from 400 ng of total RNA using the RNAseq (total RNA Full length) sequencing service (Next Generation Diagnostics srl) which included rRNA-Globin depletion, library preparation, quality assessment and sequencing on a NovaSeq 6000 sequencing system using a paired-end, 300 cycle strategy (2 × 150) (Illumina Inc., San Diego, CA, USA). Read were trimmed to remove adapters sequences using skewer (https://github.com/relipmoc/skewer accessed on 1 March 2021) Read were mapped using STAR on ENSEMBL HG38 human genome assembly.

16/31. k-mer Training Set
The training set for the neural networks (NN/CNN) was generated using the cell lines with MET amplification: EBC-1 and Hs746T. EBC-1 and Hs746T reads were organized in subgroups of 1000 reads, randomly selected and not overlapping. This approach generated a large set of samples for the training the NN, i.e., 1447 subsets for EBC-1 and 846 for Hs746T. Subsampled reads were associated to MET exons (supplementary Table S1) and converted in 16/31 k-mers using BFcounter [39].

16/31. k-mer Test Set
The test set for the neural networks (NN/CNN) was generated using the cell lines with physiological MET expression: A549 and NCI-H596. A549 and NCI-H596 reads were organized in subgroups of 500, 1000 and 5000 reads, randomly selected and not overlapping. Subsampled reads were associated with MET exons (supplementary Table S2) and converted in 16/31 k-mers using BFcounter [39].

Coverage Training and Test Set
The training and test set for the neural networks (NN/CNN) was generated using the RNAseq data used for the 16/31 k-mers training and test sets. For the training, reads were organized in subgroups of 1000 reads, randomly selected and not overlapping, for the test set, reads were organized in subgroups of 500, 1000 and 5000 reads. Subsampled reads were used to calculate coverage associated with MET exons 13, 14 and 15.

TCGA RNAseq Datasets
We registered a TCGA project for the study of MET exon 14 skipping events, to obtain access to TCGA raw sequencing data, i.e., RNAseq BAM files. Since the size of the TCGA transcription data exceeds 200 TB, we progressively downloaded the BAM files on the basis of the cancer tissue locus. Then, from each BAM file, we extracted the reads encompassing MET locus (chr7:116672196-116798377, hg38 human genome assembly). We kept only samples where the MET locus was covered by at least 5000 reads. To define 5000 reads as the minimal coverage for MET, we inspected the expected coverage for exons 13, 14 and 15 in A549 (WT MET cell line), in NCI-H596 (MET∆14 cell line) and in random subsets of 5000, 1000, 500 and 250 reads from NCI-H596, Figure 2. We observed that the detection of exon 14 skipping become blurry below 5000 reads coverage. Together with the MET linked reads we also extracted the MET paired reads, where only one of the two reads maps on MET locus.

Model Coding and Hyperparameter Selection for NN
We constructed a NN made of 6 layers. The input layer has variable size depending on the type of input (k-mers or coverage). 1st and 2nd hidden layers are made of 256 nodes, 3rd and 4th are made of 128 nodes, all using RELU (rectified linear unit) as activation function and 0.1 as dropout rate. The output layer is made by 1 node, associated with a sigmoid activation function. We implemented the models in python (version 3.  3.3). Optimization was done using Adam (Adaptive moment estimation), with the following parameters lr = 0.01, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e−08, decay = 0.0, loss = 'mean_squared_error'. Hyperparameter optimization was done using Talos (https://github.com/autonomio/talos, accessed on 01 January 2021), which is an automated tool to define the optimal combination of the hyperparameters. Specifically, Talos takes as input the hyperparameter space to be investigated. Then, Talos performs all possible combinations and selects the optimal configuration of the hyperparameters.
The trained NN is implemented in a docker container together with all tools needed to extract MET reads from fastq data. The NN can be used for the discovery of MET∆14 using conventional RNAseq or MET targeted RNAseq. The tool can be requested for the corresponding author. It is provided free of charge to Accademia and non-profit organizations for research use only.

Model Coding and Hyperparameter Selection for Sparsely Connected Autoencoders (SCA)
Autoencoders learning is based on an encoder function that projects input data onto a lower dimensional space. Then, autodecoder function recovers the input data from the lowdimensional projections minimizing the reconstruction. We implemented the models in python (version 3. . Optimization was done using Adam (Adaptive moment estimation) with the following parameters lr = 0.01, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e−08, decay = 0.0, loss = 'mean_squared_error'. RELU (rectified linear unit) was used as activation function for the dense layer.

Conclusions
Taken together, our results indicate that neural networks can be an effective tool to provide a quick classification of pathological transcription events, and sparsely connected autoencoders could represent the basis for the development of an effective discovery tool in this field.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/ijms22084217/s1, Figure S1: MET∆14 false positive detected in corpus uteri. (a) WT MET from A549 RNAseq sample (33 million reads), 27,152 reads mapping on MET locus, (b) MET∆14 from NCI-H596 RNAseq sample (27 million reads), 24,850 reads mapping on MET locus, (c-f) False MET∆14 in corpus uteri samples, (g) False MET∆14 in heart/mediastinum/pleura samples, (h-l) False MET∆14 in kidney samples, (m) False MET∆14 in skin samples. Table S1: Set of COSMIC census genes present in the ExonSkipDB. The first column is the set of genes associated to articles describing the presence of a skipping event in that gene linked to cancer. Table S2: samples in TCGA bronchus and lung predicted as MET∆14 using different training configurations. MET∆14 score ≤ 0.1 predicts a skipped event. NN quality score ranges between 1 and 0, where 1 indicates an optimal NN performances.

Data Availability Statement:
Examples and instructions for the use of the NN/CNN tools are available at https://github.com/kendomaniac/metObservatory (accessed on 13 April 2021). RNAseq used for training and test data are available at https://github.com/kendomaniac/metObservatory (accessed on 13 April 2021).

Conflicts of Interest:
The authors declare no conflict of interest.