Validation of Splicing Events in Transcriptome Sequencing Data

Genomic alignments of sequenced cellular messenger RNA contain gapped alignments which are interpreted as consequence of intron removal. The resulting gap-sites, genomic locations of alignment gaps, are landmarks representing potential splice-sites. As alignment algorithms report gap-sites with a considerable false discovery rate, validations are required. We describe two quality scores, gap quality score (gqs) and weighted gap information score (wgis), developed for validation of putative splicing events: While gqs solely relies on alignment data wgis additionally considers information from the genomic sequence. FASTQ files obtained from 54 human dermal fibroblast samples were aligned against the human genome (GRCh38) using TopHat and STAR aligner. Statistical properties of gap-sites validated by gqs and wgis were evaluated by their sequence similarity to known exon-intron borders. Within the 54 samples, TopHat identifies 1,000,380 and STAR reports 6,487,577 gap-sites. Due to the lack of strand information, however, the percentage of identified GT-AG gap-sites is rather low. While gap-sites from TopHat contain ≈89% GT-AG, gap-sites from STAR only contain ≈42% GT-AG dinucleotide pairs in merged data from 54 fibroblast samples. Validation with gqs yields 156,251 gap-sites from TopHat alignments and 166,294 from STAR alignments. Validation with wgis yields 770,327 gap-sites from TopHat alignments and 1,065,596 from STAR alignments. Both alignment algorithms, TopHat and STAR, report gap-sites with considerable false discovery rate, which can drastically be reduced by validation with gqs and wgis.


Introduction
Analysis of transcriptome sequencing data focuses on differential expression of genes, as well as alternative splicing. Genomic alignments of transcriptome sequencing data contain alignment gaps. Alignment gaps are landmarks indicating potential splice-sites. Thus, due to the complexity of eukaryotic transcriptomes, transcript reconstruction from sequenced mRNA encompasses considerable ambiguities. Low specificity of reported alignment gaps may seriously compromise validity of analysis results. Currently, transcript reconstruction algorithms suffer from inaccuracies [1,2] and even the (much simpler) identification of splice events is associated with a high false discovery rate (FDR) [3]. Additionally, lack of strand information makes assignment of the correct strand difficult.
The FDR can be reduced by additional validations. Early and simplifying "validation" procedures include restriction to (for example) GT-AG-sites [4] and filtering for minimal exonic match length or filtering for minimal number of supporting alignments (as reported by RGASP, RNA-seq Genome Annotation Assessment Project [3]).
Recently 184 splice sites with non-canonical dinucleotides and U2/U12 like consensus sequences have been reported [5] and therefore, filter based on intronic dinucleotides are likely to be inappropriate. The rationale for usage of quality scores is to differentiate between biology (for example regulated splicing events) and artefacts (for example stochastic noise in splicing, sequencing and alignment). In the following, two approaches for validation of splicing events in transcriptome sequencing data are described and evaluated: An approach solely relying on alignment data (gqs) and a second approach which additionally includes information from genomic sequence (wgis).

Genomic Alignments and Gap-Sites
The following section describes the definition of gap-sites and the parameters collected from alignment data. Alignment results are reported in BAM files and structure of genomic alignments is described in BAM file format [6] (see also Appendix A.1). Alignment data contained in BAM files (currently) does not include strand information because it is lost during sequencing [7]. In genomic alignments of sequenced cellular mRNA, reads crossing splice sites cover at least two exons which are separated by an intron and thus result in gapped alignments ( Figure 1). Alignment gap locations possibly shared by multiple alignments define a gap-site. Assuming that alignment gaps result from splicing events, gap-site positions are candidates for splice junctions. A gap-site is characterised by two genomic positions: lend and rstart and the alignments are called "supporting alignments". The number of supporting alignments is called alignment coverage or nAligns-value for the respective gap-site. The number of samples in which a gap-site is identified is called the multiplicity (nProbes) value for this gap-site.

Left Exon
Right Exon gap−site lend rstart Alignments Genome Figure 1. Definition of gap-sites. Since no strand information is provided by alignments, gap-site positions are defined reading direction of genomic sequence (left to right). lend is defined as position of last exonic nucleotide on the left side. rstart is defined as position of first exonic nucleotide on right side. The "left" and "right" side are directions on genomic reference sequence, defined by lower and higher position coordinates respectively relative to the actual reading position (in accordance with common reading orientation). The gap-site is covered by three alignments (nAligns = 3).
For single alignments covering gap-sites, the minimum length of both (exonic) nucleotide matching regions (the minimum CIGAR length value or mcl value, described in Appendix A.1.3) is included into gqs. The mcl criterion provides a lower limit for support of an alignment gap by subsequent nucleotide matches. Two other alignment derived values, quartet sum of mcl (=qsm) and number of distinct (left) alignment start positions (nlstart) are described in detail in Sections Appendixs A.1.4 and A.1.5 respectively.
On the plus-strand the splice-site donor (or 5 splice-site) is located at lend position and the splice-site acceptor (or 3 splice-site) is located at rstart position ( Figure 1). On the minus-strand the opposite rule applies.

Intronic Genome Sequence at Gap-Sites
Statistical properties of gap-sites are evaluated using distribution of intronic dinucleotides (IDIN, for example GT) and IDIN-pairs (for example GT-AG) and sequence logos. Therefore, the two subsequent nucleotides following the lend position and preceding the rstart position are analysed.
The location of IDIN is called "left" or "right" as long as no strand information is considered or available. Using strand information exon-intron boundaries can further be classified into 5 -and 3 -splice-sites. For "+"-strand, the raw left IDIN are from 5 splice-junctions (called 5 IDIN) and the raw right IDIN are from 3 splice-sites (called 3 IDIN). For "−"-strand, the reverse-complement of right IDIN are from 5 splice-junctions (called 5 IDIN) and the reverse-complement of left IDIN are from 3 splice-junctions (called 3 IDIN).
Gap-sites are referred to by 5 IDIN (for example GT-sites) or by IDIN-pairs (for example GT-AG-sites or AT-AC-sites). From here on, IDIN and IDIN-pairs in strand-corrected orientation (5 to 3 ) are shown in italic letters. Non cursive printed nucleotides represent uncorrected genomic sequence (left to right).

General Considerations on Quality Scores
Recognition of rare events (for example non-canonical splicing) requires high sensitivity and therefore data is merged from multiple samples and large amounts of alignment data. Merging of data from multiple samples and the structure of BAM files implicate that validating information on gap-sites emerges sequentially. Therefrom, two basic requirements for gap-site scores derive: • Monotonicity: By adding new alignment information, the score must not decrease and should increase with the number of alignment matches to genomic nucleotides. Due to monotonicity, mean or median values cannot be used. This criterion precludes that an initially high score decreases when low scoring alignments are added. • Informational limit: Collection of information for each gap-site is restricted to a static size where splicing events are sufficiently confirmed. Using this limit, data from multiple alignments can be packed into integer variables allowing fast processing without necessity of specialised data containers.

Definition of Gap Quality Score (gqs)
The Gap Quality Score (gqs) solely uses information extracted from genomic alignments and therefore can directly be calculated from (multiple) BAM files. As shown in Equation (1), gqs essentially consists of two factors: qsm and nlstart. A calculated example is given in Appendix (Table A1). The gqs is distributed between 0 and 1000 (for read length of 100 and on a 64 bit operating system). A score of 1000 requires at least 8 supporting alignments and four maximal mcl values (at least 50 matching nucleotides on both sides in at least 4 alignment gaps for qsm= 200). Gap-sites attaining a gqs of 1000 are called "gqs-validated" (see Figure 6b for empirical base of definition).
The structure of gqs implies that the probability of being gqs-validated increases with higher alignment coverage.
The gqs is defined as: The number of different alignment start positions for a gap-site denoted nlstart value (see Appendix A. 1.5). qsm is the sum of the four largest minor alignment match lengths beside the alignment gap (see Appendix A. 1.4). n is the number of bytes in an integer (n = 4 on a 32 bit system and n = 8 on a 64 bit operating system). The gqs additionally is truncated to integral values.

Definition of Weighted Gap Information Score (wgis)
The Weighted Gap Information Score (wgis) is a successor of gqs because of the gqs being too insensitive on gap-sites supported by only few alignments. From alignments, the wgis basically utilises the same input values as gqs (qsm and nlstart). Additionally, the similarity of the genomic sequence with splice-sites is evaluated using MaxEnt scores: score5 for 5 splice-sites and score3 for 3 splice-sites [8,9]. All factors are weighted (using log2). Thresholds are applied to qsm, score5 and score3. Information on empirical base for thresholds is provided in supplemental material. For gap-sites with any quality below a given threshold, wgis = 0 is returned. All sites with wgis = 0 are called "wgis-validated". Strand information is reported by wgis via signature. The detailed definition of wgis is given in Equation (2).

Sequence Similarity between Gap-Sites and Splice-Sites
The probability of a gap-site being a true splice-site is rated using the MaxEnt score developed by Burge et al. The MaxEnt score uses a log-odd ratio (defined as ratio of sequence motif frequency in true splice sites and decoy). Sequence 9-mers and 23-mers are analysed for evaluation of 5 and 3 splice-sites respectively (positions −3 to +6 for 5 sites and −20 to +3 for 3 sites). The maximum-entropy is an estimation procedure for probability distributions originally developed in statistical mechanics [10,11]. In an iterative approach, the maximum entropy distribution has been estimated in advance. Scores are extracted from pre-calculated tables based on sequence related indices. In their paper, Burge et.al. describe that a training data set had been derived from a set of 1821 transcripts spanning 12,715 introns (5 and 3 splice sites).

Strand Information in wgis
The strand sign (s str ) of wgis validated gap-sites is determined by evaluation of MaxEntscore3. Calculation of score3 at the left exon-intron boundary (around lend position) in original orientation (left to right) yields score3 in "+"-strand orientation (score3(+)). Calculation of score3 at the right exon-intron boundary (around rstart position) in reverse-complement orientation (right to left) yields score3 in "−"-strand orientation (score3(−)). The strand sign is calculated by comparison of these two values as shown in Equation (3).
The biological interpretation is, that when the right boundary of a gap-site has stronger resemblance to a 3 splice-site than the left boundary of the gap-site (read in reversed direction), then "+"-strand is assumed.

GQL
The distribution of wgis, naturally separates gap-sites into four sub-populations. The derived categories consist of quality levels 0 to 3 named "Gap Quality Level" (gql): gql0 to gql3. Gap-sites assigned to gql-level i are denoted gqli-sites. The set of gql0 sites for example is a s "not wgis-validated" (wgis = 0). Details of gql definition are described later on in Section 2.3.1.

Annotation of Gap-Sites
During annotation of query objects derived from genomic alignments, overlap with annotated genomic features is examined and (if existent) an optimal matching pair is selected. The annotation process for gap-sites is implemented in (CRAN) R-package refGenome (see Figure 2). First, intron-connected exon pairs need to be identified because exons are included in GTF file format (the format down-loadable from Ensembl or UCSC) as distinct entities only related by transcript_id and exon_number.
In a second step, a query list (containing gap-site data) and a reference list (containing positions of connected exons) are traversed simultaneously. During traverse, for each gap-site a region containing possible overlaps is searched for an optimal match (minimising sod).
Overlap is defined as the intersection of element-ranges in the query and the reference list. Element-ranges in the query list range from the first nucleotide with an alignment match (leftmost C in query3 in Table A1) to the last nucleotide with an alignment match (rightmost G in query1 in Table A1). An element-range in the reference list range from the start position of the left exon to the end position of the right exon. When no overlap is found (i.e., all intersections are empty), no annotation data is provided for a (query) gap-site (The R implementation in refGenome reports missing overlaps as "NA"-values (Not available).). The annotation procedure reports exact matches (sod = 0) as well as inexact matches (sod > 0, Figure 2). Exact matches are gap-sites residing on annotated splice sites. Inexact matches may be due to biology, for example alternative splicing events (exon skipping, intron retention or alternative donor or acceptor sites) as well as splicing errors. However, inaccuracies may also be due to technical or bioinformatic issues like sequencing or alignment errors.

Validation Strategy for Quality Scores
As no reference method is available for validating gap-sites (the biochemical RT-PCR method allows validation of small numbers of splicing events and is not feasible for genome wide analysis), a systemic approach using global statistical properties is utilised. The values which are statistically analysed reflect the complementarity to the 5 end of U1 snRNA or known distribution of nucleotides around splice-sites. In detail, the used criteria are:
The reported left IDIN (without strand information) are CT, GT, GC and AT and the right IDIN are AG, AC, GC and AT (both decreasingly ordered by abundance). In total, TopHat reports only gap-sites with 6 (from possible 256) different IDIN-pairs (GTAG, CTAC, GCAG, CTGC, ATAC, GTAT; decreasingly ordered by abundance).
In order to provide an upper limit for percentage of GT-AG gap-sites, the proportions of GT-AG and CT-AG sites are added. In single samples, the observed percentages vary in the range from 95.21% to 97.34% ( Figure 3). The mean proportion is 96.4% (SD = 0.46%). When multiple samples are merged, the proportion of GT-AG sites will decline due to accumulation of noisy observations. The proportion in 54 merged transcriptomes is 89.6%. Thus, when multiple samples are merged, the proportion of GT-AG gap-sites can be expected to vary in the range between 89% to 98%.

Estimated GT−AG percentage in TopHat
Percentage of GT−AG gap−sites

Alignments from STAR Aligner
STAR reported in total 2.4 × 10 9 alignment gaps containing 6,487,577 gap-sites. Gap-sites from STAR aligner were supported in mean by 371 alignments. Therefrom 4,437,270 (68.4%) gap-sites are supported by only one alignment (nAligns = 1). STAR reports 129,758 (2.0%) gap-sites are present in all 54 samples and 256,044 (3.94%) annotated splice-sites. STAR reports all possible nucleotide combinations in left and right IDIN ( STAR also allows gap-sites with IDIN's containing N, for example AN and NT are present in left IDIN. From left IDIN, 31 were NN and from right IDIN, 74 were NN.) .
The proportion of GT-AG gap-sites in single samples (estimated by adding percentages of GT-AT-sites and CT-AG-sites) vary in the range from 57.72% to 82.35% (Figure 4). The mean proportion in single samples is 72.73% (SD = 5.43%). In merged data from 54 samples, 42.2% GT-AG (or CT-AG) sites are present. Thus, when multiple samples are merged, the proportion of GT-AG gap-sites can be expected to vary in the range between 42% and 83%.

Comparison of Alignment Numbers
In general, the number of gap-sites decreases with higher alignment coverage (Figure 5a). For gap-sites with nAligns >100, the gap-site numbers essentially are distributed equally in TopHat and STAR alignments.
There are 906,219 gap-sites which are reported by both aligners (TopHat and STAR). On these gap-sites, the nAligns numbers from STAR are approximately 76% of numbers from TopHat ( Figure 5b). Thus, the in mean 10 times lower nAligns numbers in STAR alignments result only in a 24% decrease on single gap-sites and is mainly caused by a 10 times larger number of gap-sites with nAligns <100 (which are only partially also reported by TopHat).

Distribution of gqs
The gqs values distribute in a characteristic U-like shape ( Figure 6a). Alignments from STAR contain more gap-sites with gqs <200 than alignments from TopHat.
The median gqs value was 195 from TopHat alignments and 12 from STAR alignments. In order to achieve a median sod of 0 (equivalent to: >50% of gap-sites are annotated) via the gqs-based filter, a gqs of ≥990 is needed in TopHat-alignments and gqs = 1000 in STAR-alignments. Thus, a threshold of gqs = 1000 is used as threshold for gqs-validation.

Number of gqs-Validated Gap-Sites
In alignments from TopHat, 156,251 gap-sites were validated by gqs (18.5% of all reported gap-sites). In alignments from STAR, 166.294 gap-sites were validated by gqs (2.6% of all reported gap-sites). The proportion of gqs-validated gap-sites increases uniformly with alignment depth, producing a sigmoidal increasing line on log 10 -transformed nAligns numbers (Figure 7). For validation of more than 50% gap-sites with a coverage of more than 468 alignments TopHat alignments and more than 240 alignments STAR alignments are required. The rising proportions of gqs-validated gap-sites reflects the fact that gqs-validation is more likely for higher alignment coverage.

Distribution of IDIN on gqs-Validated Gap-Sites
In gqs-validated gap-sites, the upper limit for the percentage of GT-AG-sites is 98.61% (49.54% + 49.07%) from TopHat alignments and 97.59% (49.13% + 48.46%) from STAR alignments ( Table 2). The analogue calculation for splice-sites from the minor spliceosome (AT-AC sites) yields estimates of 0.25% for TopHat and 0.08% for STAR.

Validation of wgis
Observed values for wgis ranged from −119.2 to +118.7. In total, 1,083,629 gap-sites were validated by wgis in either TopHat or STAR alignments. In alignments from TopHat, 770,327 (77.0%) gap-sites were validated by wgis. In alignments reported by STAR, 1,065,596 (16.43%) gap-sites were validated by wgis. In general, the proportion of wgis-validated gap-sites increases with number of supporting alignments (nAligns, Figure 8). Using wgis provided strand information, 50.77% and 50.49% of wgis validated gap-sites were assigned to "+"-strand in TopHat and STAR alignments respectively. The wgis-distribution is almost identical in gap-sites assigned to "+"-strand and to "−"-strand ( Figure 9).

Definition of GQL Limits
Parallel evaluation of median sod values and number of gap-sites (Figure 9b) in STAR alignments shows that wgis-validated gap-sites appear to be a heterogeneous population separated by two natural limits: • A limit at |wgis| = 30, where number of gap-sites and sod have a local minimum. • A second limit at |wgis| = 80, where median sod drops to <10 in STAR alignments (Median sod = 0 for |wgis| > 75 in TopHat alignments).
Together with the limit |wgis| > 0 (separating wgis-validated from not validated gap-sites), four different groups, called gql (gap-site quality level) can be separated. Definition of gql and proportions of assigned gap-sites are shown in Table 3.

Distribution of Intronic Dinucleotide Pairs
Distribution of intronic dinucleotide pairs in gap-sites with different gql levels are shown in Table 4. Intronic dinucleotide pairs associated with the minor spliceosome (AT-AC) constitute <1% of wgis-validated gap-sites.  [18]. Empty spaces indicate proportion <0.005%. Obviously GT-AT sites and CT-AC sites are assigned to the wrong direction.

Sequence Logos of Validated Gap-Sites
Sequence logos of wgis validated gap-sites for 5 -junctions (Figure 10a,c) and for 3 -junctions (Figure 10b,d) show high similarity between TopHat and STAR alignments. The sequence logos show presence of the second GT at intronic position 5 and 6 in 5 splice-junctions (positions 8 and 9 in Figure 10a,c) as well as pyrimidine rich 3 terminal intronic regions (positions 1 to 6 in Figure 10b,d). The sequence logos are also highly similar to those calculated on annotated splice sites (shown in supplemental material).

Comparison of TopHat and STAR Alignments and gqs and wgis Validation
For comparison of validated gap-sites from TopHat and STAR aligner, tables with (gqs and wgis) validated gap-sites from TopHat and STAR were merged (using genomic coordinates as identity criterion).

Global Statistics
In total, 6,581,738 gap-sites were identified by either TopHat or STAR in the 54 fibroblast samples. Therefrom, 5,581,358 (84.8%) are only present in STAR alignments, 906,219 (13.8%) are identified by both aligners and 94,161 (1.43%) are solely present in TopHat alignments (Table 5). From alignments reported exclusively by STAR, exclusively by TopHat or by both aligners, 97.6%, 82.9% and 65.0% of gap-sites are not validated by either gqs or wgis.
In result, STAR as well as TopHat do report gap-sites not seen by the other aligner. Some of these may even be validated by both scores.

Relation between Score Value and Gap-Site Multiplicity
The proportion of samples in which a gap-site is observed can be related to (absolute) score values (for gqs and |wgis|) which describe to which extent one criterion reflects the other one. Figure 11 shows that, in general, increasing score values (gqs and |wgis|) are associated with a higher proportion of samples in which a gap-site is identified.
There are gap-sites with low gqs which are present in a substantial fraction of samples. A high gqs is required (980 for TopHat and 975 for STAR alignments) in order to achieve presence in the majority (>50%) of samples.
Relations for |wgis| indicate a clearer relationship than observed for gqs. Gap-sites are present in the majority of samples (>50%) when their |wgis| value is >82. The median multiplicity becomes >1 when |wgis| is >26. Thus the gap-site multiplicities also support the gql classification criteria (In detail: The 50% limit is exceeded at |wgis| values of 81 in TopHat alignments and 82 in STAR alignments. The nProbes = 1 limit is exceeded at |wgis| values of 28 in TopHat alignments and 27 in STAR alignments.). Additionally, the straight and sigmoidal shape of the curve for |wgis| shows that wgis has a much better capability to predict gap-site multiplicity than gqs.

Dependence of Gap-Site Validation on Gap-Site Coverage
The distribution of gap-site numbers in TopHat and STAR alignments verified by gqs or wgis for different levels of alignment coverage is shown in Figure 12. Globally, STAR reports 38.3% more wgis validated gap-sites than TopHat and 6.4% more gqs-validated gap-sites (More details are provided in supplemental material.). A (local) maximum of gap-site numbers is present at ≈10 4 alignments coverage, a value essentially determined by the total sequencing depth (on 54 fibroblast samples). The majority of wgis validated and only a small minority of gqs validated gap-sites are supported by <100 alignments: For gqs-validation in 3.0% and 5.7% of gap-sites in TopHat and STAR alignments respectively. For wgis-validation in 74.1% and 71.4% of gap-sites in TopHat and STAR alignments respectively.

Relation between gqs and wgis Validation
Globally, the majority of verified gap-sites are verified by wgis alone (Table 5) but there may be differences in gap-sites with high alignment coverage. Therefore the verification with gqs and wgis is compared for different nAligns ranges (Figure 13). The percentage of gap-sites verified only by gqs is <3.2% in TopHat alignments and <6% in STAR alignments and thus very low.
The vast majority of gap-sites is verified by wgis alone or by both scores with the proportion of gqs validated sites increasing when alignment coverage is larger than 100. Proportions including gap-sites not validated by both scores (gqs and wgis) are shown in Supplemental material.

Unvalidated Gap-Sites
A gap-site is not gqs validated when nlstart < 8 or qsm < 200. A gap-site is not wgis validated when qsm ≤ 15 or when MaxEnt score5 ≤ 1 or score3 ≤ 1 in either strand direction. Low read coverage of gap-sites reduces likelihood of gqs validation while wgis validation still relies on MaxEnt scores as long as qsm > 15. While for gqs-validation, a coverage at least 8 alignments is required, wgis-validation can already be achieved with one single alignment. From wgis-unvalidated gap-sites 8431 (1.00%) in TopHat and 9803 (0.16%) in STAR alignments are found in all 54 samples. Also, TopHat and STAR report 8689 and 10,590 Ensembl annotated splice-sites respectively not validated by wgis as well as 7330 and 8186 splice-sites respectively not validated by gqs. Sequence logos for gap-sites reported by STAR aligner and not validated by wgis are shown in Figure 14. The sequence logos largely do not match splice-site consensus sequences demonstrating that validation is a critical prerequisite for further analysis.

Maximal Alignment Coverage on Unvalidated Gap-Sites
In gap-sites not validated by gqs, the maximal alignment coverage is 2,994,298 in alignments from TopHat and 1,664,581 in alignments from STAR comprising 45.18% (TopHat) and 30.67% (STAR) of the maximal number of supporting alignments.
In gap-sites not validated by wgis, the maximal alignment coverage is 5,880,547 in alignments from TopHat and 3,545,564 in alignments from STAR comprising 88.73% (TopHat) and 65.34% of the maximal number of supporting alignments.

gqs
Gap-sites collected from 54 fibroblast samples and validated by gqs contain a reasonable high proportion (≈98% in both aligners) of GT-AG sites. Using gqs, ≈156,000 gap-sites are validated in TopHat alignments and ≈ 166,000 in STAR alignments (6% more) in merged data from 54 samples. It may be advantageous that no additional information from genomic sequence is required for calculation of gqs. But for the validation of >50% of gap-sites, alignment coverage of more than ≈250-500 is required, a high number indicating limitations in sensitivity. This number is a static value independent of number of samples and depends solely on total alignment depth (on merged samples together). Also, gqs cannot provide strand information. Thus, validation of gap-sites with gqs shows, that suitable quality criteria can be derived from pure alignment data (BAM file content), but sensitivity is low and strand information must be provided from a second source which might be a challenging task.

WGIS
The wgis uses the similarity of a gap-site with known splice-sites as quality criterion additionally to the alignment data. Using wgis, ≈770,000 gap-sites are validated in TopHat alignments and ≈1,066,000 in STAR alignments (38% more) in merged data from 54 samples. Thus, wgis validates 4.9 times (in TopHat alignments) and 6.4 times (in STAR alignments) as much as gqs. Approximately 3 4 of gap-sites reported by TopHat are validated by wgis. Thus, the validation sensitivity is considerably higher in wgis than in gqs. Also, the strand information provided by wgis (without requirement of annotation) is a potentially valuable feature. Gap-sites validated by wgis can further be categorised into quality levels gql1 to gql3. Thus, usage of features in genomic sequence appears to be a valuable source of information for validation of gap-sites which may substantially increase sensitivity and also may provide strand information.

Sensitivity and FDR
TopHat is known to have a low mapping yield while STAR has been shown to report many alignments with a high base-wise accuracy [3]. RGASP further found the rate of correctly mapped gapped-reads in the range of 96.3% to 98.4% but also notes at the same time, that many erroneously reported gap-sites are a major obstacle in STAR alignments [3].
For detection of occasionally observed splice sites and low abundant events, merging of data from multiple samples is necessary in order to increase sensitivity [19]. But usage of this approach increases sensitivity as well as FDR, possibly leading to GT-AG proportions of <50% in alignments from STAR. Also, the fact that 68.4% of unique gap-site positions (≈4.4 × 10 6 in our 54 samples) are only supported by one single alignment indicates a low specificity (high FDR) for STAR aligner.
An FDR in this magnitude diminishes reliability of downstream analysis. It is easily conceivable how high FDR for gap-sites deteriorate results of dependent functions like intron detection. Thus, we propose that for transcript reconstruction, a high FDR for gap-sites may be an impeding factor additionally to the difficulties imposed by the complexity of the human genome [2].

Distribution of Gapped Alignments
In merged data from 54 RNA-seq samples, we found that although STAR aligner reports much more gap-sites than TopHat, the number of gapped alignments in STAR alignments actually is lower than in TopHat alignments. Thus, the higher mapping numbers reported by STAR (which partly are due to the ability to report truncated alignments [3]) seem to be not present in gapped alignments. Also, the higher number of gap-sites reported by STAR is mainly due to a different distribution of gapped alignments leading to a reduced alignment coverage on gap-sites (in mean 75.6% coverage in TopHat alignments). STAR reports 6.4% more gqs-validated gap-sites and 38.3% more wgis-validated gap-sites than TopHat. Thus the already reported higher sensitivity for STAR in general [3] is also present in (gqs or wgis) validated gap-sites.

Validation Strategies
As validation strategies for gap-sites may greatly enhance accuracy of the detection process, alternative strategies have already been proposed or implemented: • Restrict alignment gaps to a small subset of possible intronic dinucleotides (as implemented in TopHat) • Filter on alignment gaps supported by minimal 20 matching nucleotides on each side of the gap [3] • Filter junctions on number of supporting alignments [3] • Filter out alignments assessed as low confidence by regression model (basing on nucleotide distributions around splice junctions and intron size, as implemented in oLego) [20].
As the first filter bases on look-up of genomic sequence not present in BAM files and the second filter requires examination of CIGAR items adjacent to alignment gaps, both filters cannot be implemented with a simple algorithm. A filter simply basing on number of supporting alignments requires consideration of alignment depth (and a normalisation step). Also, in order to provide a consistent quality criterion, very high threshold values are required (at least 1.5 ×10 6 alignments; see Section 2.6.1). Thus, using this criterion alone aggravates the low sensitivity of gqs for gap-sites with low coverage.
Based on considerations that the first three strategies "as is" are insufficient, we combined alignment based criteria into gqs and included sequenced based information using a complex criterion (MaxEnt) into wgis. The results indicate that criteria solely based on BAM-file content seem to lack sensitivity, but this handicap may be overcome by assessment of splice-site similarity. Criteria which are more sophisticated than simply filtering on IDIN-pairs even have the potential to detect rare splice-events possibly blurred by statistical noise.

Usage of MaxEnt Score
Filtering gap-sites based on MaxEnt scores may greatly enhance percentage of GT-AG-sites although this method might have limitations. First, evaluation of genomic sequence is not absolutely reliable for prediction of splice-site utilisation [21]. Also, MaxEnt in its current form assigns low scores to splice-sites recognised by minor spliceosome and ignores positions 10 and 11 of the 5 splice-site. A correction cannot be introduced by shifting thresholds because this would lead to increased validation of gap-sites not similar to splice-sites recognised by the major spliceosome. Due to optimality properties of the MaxEnt estimation process, the only way for possible improvement strategies are: • Calculation of new MaxEnt score tables basing on larger (or modified) "standard" samples. • Creation of a second score solely basing on recognition by the minor spliceosome and using the maximum of both.

Limitations
Although it can be supposed, that reasonable proportions of nucleotide distributions (IDIN, IDIN-pairs and sequence logos) indicate that algorithms and decision rules act correctly in a majority of cases, these numbers should be interpreted with caution. As (potential) splice-sites are counted by occurrence and not by abundance, merging of data from multiple sources (for example by merging samples or by collecting data in an annotation data base) inevitably leads to rising numbers for rare events and thus may lead to overestimated proportions. The size of this effect is affected by the extensiveness of included information.

Fibroblast Samples
The transcriptome data analysed in this study originated from an investigation where effects of age, gender and UV exposition were studied in samples of dermal fibroblasts obtained from healthy human donors. A main result of the study is, that no consistent differential expression of genes is observable. The gene expression hence is assumed to be essentially homogeneous in these samples.
Details of sample preparation and the results differential expression analysis recently have been described elsewhere [12]. In short, the mRNAs of 54 samples were sequenced on an Illumina HiSeq 2000 sequencer yielding in total 8,785,501,333 reads. Subsequent alignments were calculated on unpreprocessed FASTQ files. Alignments were calculated using bowtie2 (2.2.5) [13], tophat (2.0.14) [14] and STAR (2.4.1d modified) [15]. Collection and processing of dermal samples from donors was approved by the Ethical Committee of the Medical Faculty of the University of Düsseldorf (# 3361) in 2011.

Software
All described algorithms are implemented in R and are available from CRAN or Bioconductor. The software interface to samtools, the container and extraction algorithms for gap-sites are written in C/C++ and contained in CRAN package rbamtools [16]. Implementations for import of annotation data (GTF) and the annotation procedure for gap-sites are written in C/C++ and contained in CRAN package refGenome. Implementations for calculation of MaxEnt scores were translated into C and are publicly available in Bioconductor package spliceSites (version ≥ 1.23.5). Algorithms for extraction of DNA sequence for exon/intron boundaries use Bioconductor framework (Biostrings). The source code for R (for analysis and package development) as well as the latex source code for this document was developed using RStudio [17].

Statistical Evaluation
For wgis-validated gap-sites, IDIN are reported after correction for strand orientation. Strand assignments are calculated as described in Section 1.3.2 (the wgis-reported strand). In sequence logos, letter height corresponds to nucleotide frequencies at given positions. Nucleotides are decreasingly ordered according to their frequency from top to bottom. All shown sequence logo's are calculated after correction for strand orientation.

Conclusions
The aligners TopHat and STAR report a high rate of unvalidated gap-sites emphasising the necessity for further validation. Validation of gap-sites without using genomic sequence data (gqs) requires high alignment coverages which can greatly be reduced when the similarity to known splice-sites is evaluated.
represented by a set of data items, mainly the reference sequence (chromosome number), the position and CIGAR items.

Appendix B.2. Weighting of qsm Value
A major obstacle for alignment based validation of splicing events is a too small number of aligned nucleotides (on either side of the gap-site). Therefore, a lower limit for the number of aligned nucleotides on either adjacent region is introduced, implemented as lower limit for qsm. The weight function for qsm is constructed so that qsm values of 17 and 29 result in a qsm-factor of 1 and 2 respectively. The maximal possible qsm-factor (for reads of length of 101) is 2.91. The qsm weight function describing the relation between qsm and qsm-factor is shown in Figure A2.