Hybrid Multitask Learning Reveals Sequence Features Driving Specificity in the CRISPR/Cas9 System

CRISPR/Cas9 technology is capable of precisely editing genomes and is at the heart of various scientific and medical advances in recent times. The advances in biomedical research are hindered because of the inadvertent burden on the genome when genome editors are employed—the off-target effects. Although experimental screens to detect off-targets have allowed understanding the activity of Cas9, that knowledge remains incomplete as the rules do not extrapolate well to new target sequences. Off-target prediction tools developed recently have increasingly relied on machine learning and deep learning techniques to reliably understand the complete threat of likely off-targets because the rules that drive Cas9 activity are not fully understood. In this study, we present a count-based as well as deep-learning-based approach to derive sequence features that are important in deciding on Cas9 activity at a sequence. There are two major challenges in off-target determination—the identification of a likely site of Cas9 activity and the prediction of the extent of Cas9 activity at that site. The hybrid multitask CNN–biLSTM model developed, named CRISP–RCNN, simultaneously predicts off-targets and the extent of activity on off-targets. Employing methods of integrated gradients and weighting kernels for feature importance approximation, analysis of nucleotide and position preference, and mismatch tolerance have been performed.


Introduction
Clustered regularly interspersed short palindromic repeats (CRISPR) and their associated nucleases (Cas) have revolutionized genetics by allowing easy modification of the genome as well as gene regulation. CRISPR-based genomic perturbations mainly fall into two categories-Cas9-mediated knockout and Cas9-mediated activation or inhibition. Cas9-mediated knockout is carried out by the Cas9 nuclease directed to the target site by the guide RNA, creating a double-stranded break resulting in loss of function [1][2][3]. CRISPR inhibition/activation, on the other hand, employs catalytically inactive Cas9, enabling localization of an effector domain to repress or activate transcription of the target gene without inducing heritable changes to DNA [4,5].
The optimal design of the single guide RNA (sgRNA) is paramount to the efficiency of a CRISPR-based experiment. Bound to Cas9, the gRNA contains a spacer sequencer complementary to the target DNA ("protospacer") and various loops to which Cas9 binds. It has, however, been reported in multiple studies that Cas9 acts at sequences similar to the target sequence as well, resulting in unwanted off-target effects [6,7]. The requirement of an NGG consensus PAM has been established for the recognition and cleavage of target DNA. Studies in the past have also revealed that the sequence composition of the 3 end of the spacer and downstream sequence of the DNA contribute to efficiency; however, the position-specific effect of various spacer DNA bases is not yet known [8,9].

Negative Off-Target Data Source
The sequence dataset required for training the classification mode of the multitask model was obtained using the off-target prediction method CRISPcut [16]. The off-targets obtained from CRISPcut were checked for duplicates and positive off-target sequences were eliminated from the negative set. Recent reports indicate that Cas9 tolerates DNA/RNA bulges as well, apart from mismatches, which are represented as gaps in terms of sequence alignment. To account for the bulges, the predicted off-targets were aligned using the pairwise2 module of biopython tools [19,20]. Pairwise alignment using a dynamic programming algorithm allows adjusting the parameters to obtain results close to those observed in experiments. The optimum parameters were obtained by testing various combinations simultaneously, then checking the number of allowed mismatches and gaps while retaining as many negative off-target sequences as possible.

sgRNA-Target Sequence Encoding
As input for the multitask convolutional neural network prediction model, the sequences were one-hot encoded and converted to images. Front padding of 'N' base was added to equalize the length of the sequences after accommodating the gaps. The scheme used to convert an example sequence to an input image can be understood from Supplementary Figure S1. As the Cas9 protein tolerates mismatches between the guide RNA and DNA, two guide RNAs may have some similar off-targets depending on the number of mismatches tolerated. The sequence of targets is relevant to the off-target prediction, and thus the intended target DNA, perfectly complementary to the gRNA, is also included in the prediction pipeline along with the off-target sequences for each guide RNA.
The intended target DNA sequence information, the on-target, is supplied with all possible sequences in the genome that could be potential off-targets. If the off-target is detected in any of the experimental methods mentioned earlier, it is a positive off-target. If a similar sequence is present in the genome, but no experimental proof of Cas9 activity is available at that site, it is labelled as a negative off-target.

Mismatch and Bulge Propensity Analysis
The number and position of mismatches were counted along the length of the guide RNA by implementing a Python script. The corresponding nucleotide at that position in the target sequence is simultaneously noted. The script was also employed to count the number and position of gaps in the sequences, indicating a DNA/RNA bulge, while noting the corresponding target nucleotide. The results from the script were plotted as a percentage of the total to comprehend the trends in mismatch and bulge tolerance.

Deep Learning
Techniques, Model Evaluation, and Feature Importance 2.3.1. Model Architecture A hybrid CNN-bidirectional LSTM network was trained; the architecture is depicted as a schematic in Figure 1 and a detailed representation is included as Supplementary Figure  S6. Similar to a Siamese network, the network contained two identical sister networks taking input from the off-target and target images, respectively [21,22]. Each sister network consisted of two CNN-batch normalization blocks in tandem, feeding their output to a bidirectional LSTM layer. The sister networks were then concatenated. The output of the concatenation layer was split and fed into two separate but fully connected dense layer blocks, one for each of the classification and regression tasks. The model architecture represented a hard-sharing multitask learning model with the advantage of reducing overfitting [23]. The size of the layers was optimized using grid search [24]. Representative target and off-target sequences are depicted as input for the model. The input is fed into convolution layers, followed by an LSTM layer. The two branches are then concatenated and fed into two separate but fully connected dense layers, which perform the classification and regression tasks.

Model Training
The dataset contained 153,924 samples and was split into training/validation/testing sets in the ratio of 80:10:10. Out of the 153,924 samples, the negative off-target and positive off-target datasets contained 146,734 and 7190 samples, respectively, which led to a data imbalance of approximately 20:1. To overcome this problem, we used the class_weights feature of keras model fitter, which balances the contribution of the minority class and the majority class to the model s overall loss [25]. The classification and regression task models were first trained separately for different architectures on the dataset to determine the range of the losses. The losses were then combined after multiplying the regression loss by a constant factor of 2 × 10 −4 to bring both losses to the same scale. The classification and regression losses were then added to give the final loss function, considering the model is a multitask network with hard parameter sharing.
The model encountered overfitting initially, which was later tackled by utilizing L2 regularization (optimum L2 value = 5 × 10 −4 ) in the fully connected dense layers. We used the Adam Optimizer [26] initialized with a learning rate of 10 −4 , and the learning rate was halved whenever the validation loss did not decrease in the previous five epochs. We trained the different architectures on an Nvidia Titan Black GPU and evaluated the model five times with different splits of the training and validation sets and assessed the average distribution of the evaluation metrics. The best-performing CNN-biLSTM model was selected for further analysis.

Evaluation Metrics
The model performs two tasks: (a) classification, predicting positive and negative offtargets; and (b) regression, predicting the extent of Cas9 activity at the positive off-targets. Hence, these two tasks were evaluated separately based on different performance metrics. The classification performance was tested on the basis of the following.
Area under the precision-recall curve is one among the standard criteria for classification performance evaluation. Precision measures how many of the predicted positives are correct, while recall measures how many actual positives are captured by the model. These measures are defined as follows: Representative target and off-target sequences are depicted as input for the model. The input is fed into convolution layers, followed by an LSTM layer. The two branches are then concatenated and fed into two separate but fully connected dense layers, which perform the classification and regression tasks.

Model Training
The dataset contained 153,924 samples and was split into training/validation/testing sets in the ratio of 80:10:10. Out of the 153,924 samples, the negative off-target and positive off-target datasets contained 146,734 and 7190 samples, respectively, which led to a data imbalance of approximately 20:1. To overcome this problem, we used the 'class_weights' feature of keras model fitter, which balances the contribution of the minority class and the majority class to the model's overall loss [25]. The classification and regression task models were first trained separately for different architectures on the dataset to determine the range of the losses. The losses were then combined after multiplying the regression loss by a constant factor of 2 × 10 −4 to bring both losses to the same scale. The classification and regression losses were then added to give the final loss function, considering the model is a multitask network with hard parameter sharing.
The model encountered overfitting initially, which was later tackled by utilizing L2 regularization (optimum L2 value = 5 × 10 −4 ) in the fully connected dense layers. We used the Adam Optimizer [26] initialized with a learning rate of 10 −4 , and the learning rate was halved whenever the validation loss did not decrease in the previous five epochs. We trained the different architectures on an Nvidia Titan Black GPU and evaluated the model five times with different splits of the training and validation sets and assessed the average distribution of the evaluation metrics. The best-performing CNN-biLSTM model was selected for further analysis.

Evaluation Metrics
The model performs two tasks: (a) classification, predicting positive and negative offtargets; and (b) regression, predicting the extent of Cas9 activity at the positive off-targets. Hence, these two tasks were evaluated separately based on different performance metrics. The classification performance was tested on the basis of the following.
Area under the precision-recall curve is one among the standard criteria for classification performance evaluation. Precision measures how many of the predicted positives are correct, while recall measures how many actual positives are captured by the model. These measures are defined as follows: where TP stands for true positives, TN stands for true negatives, and FP denotes false positives.
While selecting the optimal model for the purpose of classifying the off-targets, especially because the dataset is imbalanced and the positive class is the minority class, the overall recall as well as minority class recall were considered. Additionally, the Matthew's correlation coefficient (MCC), which is known to be a reliable metric in binary classification of unbalanced datasets, was also compared among the candidate models [27].
The F1-MCC plot was generated to evaluate the model performance, considering the imbalance in the data. The F1-MCC curves allow the integration of many aspects of performance over multiple classification thresholds [28].
The performance of the prediction model for the regression task was measured by the coefficient of determination (R 2 ). The R 2 measures the goodness-of-fit between the predicted values and actual values of the datapoints.

Identification of Important Sequence Features
To interpret CNN-based models, pixel attribution methods were used to locate regions in the input image that have a greater influence in deciding the model output. One such method is gradient-based saliency, which calculates the gradient of the model output with respect to input pixels and assigns importance scores to individual pixels [29]. Smoothened saliency advances the saliency approach by sampling similar images with some added noise. For each image in the samples, the saliency was calculated and the resulting saliency maps were averaged [17]. Though gradient-based saliency maps act as a great baseline, they suffer from gradient discontinuity and saturation because the negative gradients get zeroed out because of the non-linear activation layer [30]. Hence, gradient saliency methods fail at highlighting the features that contribute negatively to the model output. A gradient integration approach was used to mitigate the gradient discontinuity and saturation problem [31]. The gradients were integrated as the inputs are scaled from a reference value to the actual input value found in the dataset. A similar approach named DeepLIFT [30] computes the importance scores based on the difference between the user-defined reference input and actual inputs to the model. The positive and negative contributions were treated separately and shown to be closely connected with game-theorybased Shapely values. The 'tf-keras-vis v0.6.2' Python library was used to calculate the smoothened saliency and, for integrated gradient-based analysis, the 'shap v0.39.0' Python library was implemented.

Data Assembly and Preparation
A total of 153,924 off-target sequences-both positive and negative-corresponding to 10 unique target sites were employed for analysis and model training. The positive dataset consists of experimentally determined off-target sites by the technique of circularization for in vitro reporting of cleavage effects by sequencing (CIRCLE-seq) [18]. The negative dataset consists of similar sequences found in the genome, identified using the tool CRISPcut [16], but not found to be cleaved in the CIRCLE-seq experiment. For each individual target, the read count for each off-target sequence was normalized and denoted as the output variable for the regressor. The positive off-targets and negative off-targets were labelled to be the outputs for the classifier. The sequences were aligned to include the information of DNA/RNA bulges. The cleaned dataset was one-hot encoded and converted to images to be used for model training and visualization, respectively. An example of the image encoding is shown in Supplementary Figure S1.

Analysis of Mismatch Occurrence and Nucleotide-Specific Mismatch Propensity
The experimentally observed off-target sequences were checked for the fraction of mismatches found at each position of the 20-nucleotide guide RNA to determine the trend in mismatch tolerance for the Cas9-gRNA complex. As reported in previous studies, the number of mismatches tolerated was higher in the PAM-distal ends [32]. The overall mismatch tolerance decreases towards the PAM-proximal end. The least number of mismatches is observed in the middle region of the guide, ranging from nucleotides labelled 10-15, as demonstrated in Figure 2a. Position 14 was seen to tolerate the least number of mismatches.
Biomolecules 2023, 13, x 6 of 15 labelled to be the outputs for the classifier. The sequences were aligned to include the information of DNA/RNA bulges. The cleaned dataset was one-hot encoded and converted to images to be used for model training and visualization, respectively. An example of the image encoding is shown in Supplementary Figure S1.

Analysis of Mismatch Occurrence and Nucleotide-Specific Mismatch Propensity
The experimentally observed off-target sequences were checked for the fraction of mismatches found at each position of the 20-nucleotide guide RNA to determine the trend in mismatch tolerance for the Cas9-gRNA complex. As reported in previous studies, the number of mismatches tolerated was higher in the PAM-distal ends [32]. The overall mismatch tolerance decreases towards the PAM-proximal end. The least number of mismatches is observed in the middle region of the guide, ranging from nucleotides labelled 10-15, as demonstrated in Figure 2a. Position 14 was seen to tolerate the least number of mismatches. A similar count-based analysis was performed again, considering the normalized cleavage frequency observed in experiments to be a weight for each sequence and, consequently, the mismatches in that sequence. The sequences would be represented fairly depending on the fraction of times it was observed to be acted upon by Cas9 in the experiment. The observed trend in mismatches remained the same, with a decrease in the tolerance towards the PAM-proximal end (Figure 2b). In both sets of analyses, the PAM region also retains the canonical NGG motif and is among the least tolerant to mismatches. Consistent with previous literature, the most common non-canonical PAM motifs are NGA and NAG [33]. These two PAM sites account for more than 7% out of 11% of non-canonical motifs.
A count-based analysis was performed again for each of the four nucleotides observed at each position of the guide to determine if there is a nucleotide-specific preference for or against mismatches at certain positions of the guide. For example, considering all positions with an A at the corresponding site in the target sequence, the mismatches at each position were noted. The fraction of the other nucleotides that contributed to the total mismatches was also noted, and the trend observed was summarized in Figure 3a. With the dataset being limited, certain positions such as 1 and 10 did not show an A in any target sequence, thus mismatch tolerance could not be determined. For the PAM sequence, as the motif is NGG , any nucleotide present at the position of N is not counted as a mismatch. The positions that retained the expected A at more than 90% of all of the occurrences were 7, 12, and 16, and were said to have low mismatch tolerance. A similar count-based analysis was performed again, considering the normalized cleavage frequency observed in experiments to be a weight for each sequence and, consequently, the mismatches in that sequence. The sequences would be represented fairly depending on the fraction of times it was observed to be acted upon by Cas9 in the experiment. The observed trend in mismatches remained the same, with a decrease in the tolerance towards the PAM-proximal end (Figure 2b). In both sets of analyses, the PAM region also retains the canonical NGG motif and is among the least tolerant to mismatches. Consistent with previous literature, the most common non-canonical PAM motifs are 'NGA' and 'NAG' [33]. These two PAM sites account for more than 7% out of 11% of non-canonical motifs.
A count-based analysis was performed again for each of the four nucleotides observed at each position of the guide to determine if there is a nucleotide-specific preference for or against mismatches at certain positions of the guide. For example, considering all positions with an 'A' at the corresponding site in the target sequence, the mismatches at each position were noted. The fraction of the other nucleotides that contributed to the total mismatches was also noted, and the trend observed was summarized in Figure 3a. With the dataset being limited, certain positions such as 1 and 10 did not show an 'A' in any target sequence, thus mismatch tolerance could not be determined. For the PAM sequence, as the motif is 'NGG', any nucleotide present at the position of 'N' is not counted as a mismatch. The positions that retained the expected 'A' at more than 90% of all of the occurrences were 7, 12, and 16, and were said to have low mismatch tolerance. A similar count-based analysis was performed for the other three nucleotides-T , G , and C (Supplementary Figure S2). The nucleotides that showed the least number of mismatches at each position were determined, as summarized in Table 1. Certain positions-1, 2, 10, and 17-are not included in the table because examples of all four nucleotides at these positions in the target site are unavailable in the dataset considered. Table 1 also mentions the bases most likely to be found in the case of a mismatch at each position, even when the target contains the ideal nucleotide. The most likely mismatch at each position, given the target nucleotide, will help design better tools keeping in mind the likely threats of off-targets. Accounting for the number of times a unique sequence has been cleaved in an experiment by including it as a weight in the count-based analysis gives a more realistic understanding of the threats of the possible off-targets. Hence, the position-specific nucleotide- A similar count-based analysis was performed for the other three nucleotides-'T', 'G', and 'C' (Supplementary Figure S2). The nucleotides that showed the least number of mismatches at each position were determined, as summarized in Table 1 Table 1 also mentions the bases most likely to be found in the case of a mismatch at each position, even when the target contains the ideal nucleotide. The most likely mismatch at each position, given the target nucleotide, will help design better tools keeping in mind the likely threats of off-targets. Accounting for the number of times a unique sequence has been cleaved in an experiment by including it as a weight in the count-based analysis gives a more realistic understanding of the threats of the possible off-targets. Hence, the position-specific nucleotide-wise mismatch tolerance was weighted by the normalized frequency of cleavage and calculated. Table 2 highlights the key findings. The trend for the nucleotide that tolerates the least number of mismatches at each position is similar, with the exceptions of positions 7, 12, and 15. The nucleotides found to be most substituted in the off-targets also showed a similar trend to those obtained earlier. Hence, based on mismatch occurrences observed in experiments, position-nucleotide combinations were deduced that exhibit the least tolerance to mismatches. In the rare events that mismatches were tolerated at these pairs of combinations, the presumable mismatch at that position was also identified. These can be used to collate an ideal guide with the least probable off-targets as well as the ability to identify the off-target sequences that are expected to be more of a threat than others.
On the other hand, the study also indicates the base-position combinations that tolerate the most mismatches; these would be the least favourable for the ideal guide design. The bases at each position that show the least fidelity are summarized in Supplementary Table S1.

Analysis of Bulge Occurrence and Nucleotide-Specific Bulge Propensity
The single RNA-guided Cas9 protein is reported to tolerate not only mismatches but DNA/RNA bulges as well, resulting in off-target effects [7,18,34]. Represented as gaps in a sequence, DNA/RNA bulges also account for a certain fraction of differences from the target site. Hence, to ascertain whether there are patterns in the tolerance of gaps in the dataset, the sequences were aligned and then analysed for mismatch tolerance at each position. The fraction of gaps tolerated across the spacer region in the context of mismatches is summarised in Figure 2c.
Most of the bulges tend to be tolerated towards the middle of the RNA-DNA hybrid, as indicated by the plot (Figure 2c). A nucleotide-wise analysis of the position and type of mismatch revealed a consistency between the results obtained for this aligned sequence dataset and the unaligned dataset. The number of gaps observed at every position for each nucleotide was also noted; the position-base pairs tolerating the most gaps are summarized in Supplementary Table S2. The trends in mismatch occurrence and nucleotide and position preferences did not show much deviation from the earlier observations. Weighting the gap counts by the experimental read count also revealed a similar pattern in the tolerance of DNA/RNA bulges (Supplementary Figure S3).
However, as the observations mentioned before were based on a single experimental technique, the reproducibility of the observations needs to be probed. There is a need for a method that is both accurate on unseen data and can elucidate the importance of a nucleotide at a specific position. Therefore, a deep-learning-based approach was adopted to predict Cas9 activity at off-target sequences while retaining information about the contribution of the position and sequence features to the model output.

Predictive Performance of the CRISP-RCNN Model
Various neural networks were trained on the target and off-target sequences to explore the nucleotides and positions deemed important for off-target activity prediction. Convolutional neural nets (CNNs), CNN coupled with a long short-term memory (LSTM) layer, and CNNs with a bidirectional LSTM layer were optimized and tested. The performance of the multiple architectures tested is summarized in Supplementary Figure S4. The best-performing model was a convolutional network with a bi-LSTM layer, referred to henceforth as CRISP-RCNN. The negative dataset consisted of sequences similar to the target sequences, with up to nine mismatches and gaps, but not cleaved by Cas9 in experiments. CRISP-RCNN performed the classification of positive and negative off-targets with an MCC of 0.80, while other classification performance metrics are summarized in Figure 4 and Table 3. The CRISP-RCNN model also simultaneously predicted the extent of Cas9 activity on the target site with an R 2 value of 0.71 when evaluated on the held-out test dataset.
When compared against the performances of previously reported methods for performing similar tasks individually, that is, the classification performance against an off-target classifier and the regression performance against a guide RNA activity prediction model, CRISP-RCNN is shown to outperform both ( Figure 4) [35,36]. summarized in Supplementary Table S2. The trends in mismatch occurrence and nucleotide and position preferences did not show much deviation from the earlier observations. Weighting the gap counts by the experimental read count also revealed a similar pattern in the tolerance of DNA/RNA bulges (Supplementary Figure S3).
However, as the observations mentioned before were based on a single experimental technique, the reproducibility of the observations needs to be probed. There is a need for a method that is both accurate on unseen data and can elucidate the importance of a nucleotide at a specific position. Therefore, a deep-learning-based approach was adopted to predict Cas9 activity at off-target sequences while retaining information about the contribution of the position and sequence features to the model output.

Predictive Performance of the CRISP-RCNN Model
Various neural networks were trained on the target and off-target sequences to explore the nucleotides and positions deemed important for off-target activity prediction. Convolutional neural nets (CNNs), CNN coupled with a long short-term memory (LSTM) layer, and CNNs with a bidirectional LSTM layer were optimized and tested. The performance of the multiple architectures tested is summarized in Supplementary Figure S4. The best-performing model was a convolutional network with a bi-LSTM layer, referred to henceforth as CRISP-RCNN. The negative dataset consisted of sequences similar to the target sequences, with up to nine mismatches and gaps, but not cleaved by Cas9 in experiments. CRISP-RCNN performed the classification of positive and negative off-targets with an MCC of 0.80, while other classification performance metrics are summarized in Figure 4 and Table 3. The CRISP-RCNN model also simultaneously predicted the extent of Cas9 activity on the target site with an R 2 value of 0.71 when evaluated on the held-out test dataset.
When compared against the performances of previously reported methods for performing similar tasks individually, that is, the classification performance against an offtarget classifier and the regression performance against a guide RNA activity prediction model, CRISP-RCNN is shown to outperform both ( Figure 4) [35,36].   plots (a,b), the dashed line indicates random prediction, i.e., 50% accuracy. (c) The R 2 value for the CRISP-RCNN regressor is 0.71, showing better performance than a previous guide RNA prediction model built on sequence and physicochemical parameters, which shows an R 2 value of 0.49 [36].

Activation Maps
The representations of the weights in the layers of a CNN before performing the prediction task are called activation maps (AMs). AMs for a specific class would indicate the discriminative regions (in this case, nucleotide and the position of nucleotides) that the deep learning network deems critical for identification with that class [37]. AMs were implemented to interpret the feature importance in the images containing DNA sequence information in the context of understanding and improving CRISPR/Cas9 activity. The on-targets, positive off-targets, and negative off-target sequences would each generate unique AMs, highlighting the distinguishing features.
The smoothened saliency, DeepSHAP, and gradient explainer SHAP (GradExp) were employed to generate AMs of the various sequences, as detailed in Section 2. The three algorithms helped visualize the models to focus on mismatches and gaps when present. DeepSHAP and GradExp highlighted the gaps and mismatches in blue, indicating a negative impact of their presence on the model's output decision. In case of mismatches as well as gaps, the spot of the expected nucleotide is accentuated ( Figure 5). In the case of the NGG PAM, the trend observed indicates a preference against the presence of a GGG PAM, which has also been established in the literature [33].

Activation Maps
The representations of the weights in the layers of a CNN before performing the prediction task are called activation maps (AMs). AMs for a specific class would indicate the discriminative regions (in this case, nucleotide and the position of nucleotides) that the deep learning network deems critical for identification with that class [37]. AMs were implemented to interpret the feature importance in the images containing DNA sequence information in the context of understanding and improving CRISPR/Cas9 activity. The on-targets, positive off-targets, and negative off-target sequences would each generate unique AMs, highlighting the distinguishing features.
The smoothened saliency, DeepSHAP, and gradient explainer SHAP (GradExp) were employed to generate AMs of the various sequences, as detailed in Section 2. The three algorithms helped visualize the models to focus on mismatches and gaps when present. DeepSHAP and GradExp highlighted the gaps and mismatches in blue, indicating a negative impact of their presence on the model s output decision. In case of mismatches as well as gaps, the spot of the expected nucleotide is accentuated ( Figure 5). In the case of the NGG PAM, the trend observed indicates a preference against the presence of a GGG PAM, which has also been established in the literature [33].
The model could successfully distinguish between the negative impact of mismatches or gaps highlighted in blue and the positive impact of features highlighted in red. The residues and positions that drive model output in the positive direction, classifying to the positive class or predicting a higher regression score, are indicated by the algorithms in shades of yellow, orange, and red. The importance of the individual bases at each position was probed further by generating average AMs across the datasets.  The model could successfully distinguish between the negative impact of mismatches or gaps highlighted in blue and the positive impact of features highlighted in red. The residues and positions that drive model output in the positive direction, classifying to the positive class or predicting a higher regression score, are indicated by the algorithms in shades of yellow, orange, and red. The importance of the individual bases at each position was probed further by generating average AMs across the datasets.

Average AMs' Feature Importance
AMs generated for multiple off-target sets of both positive and negative classes were averaged to obtain an overview of the bases and positions that were consistently marked as important across the datasets. The average saliency, DeepSHAP, and GradExp maps were generated for a single target along with both the positive and negative off-targets individually. As an example, the EMX1 target sequence was considered ( Figure 6). The average saliency maps generated for the positive off-target set along with the reference target sequence are shown in Figure 6-most target bases and positions are seen to be marked similarly in both positive and negative classes, indicating a similar importance of the target sequence in both datasets. However, the DeepSHAP and GradExp average AMs indicate a similar magnitude but opposite importance of the nucleotides and positions of the target sequence in the positive class against the negative class. Although a particular base may be considered important at the same position in both classes, in the positive class, the most importance is given to bases that result in classification in the opposite class, as seen in Figure 6. Hence, most bases are marked in shades of blue, indicating a negative impact on class assignment. Although the pattern of importance in the negative class target reference sequence is similar, the bases are marked in red, indicating a positive score. The target sequence average AMs show an importance pattern mimicking the target sequence. class target reference sequence is similar, the bases are marked in red, indicating a positive score. The target sequence average AMs show an importance pattern mimicking the target sequence.
In the example considered, a clear preference for the variable nucleotide in the PAM region is observed in the order of A and T , with C and G being the least preferred. This observation is consistent across the other target and off-target sets considered and is in accordance with previous studies [38].
The positive off-target average saliency maps indicate important bases and positions, some of which could also be observed in the case of negative off-targets. Hence, it was important to note that the positions that were marked in one set but not in the other to derive critical features. In the case of EMX1 positive off-targets, positions 1 and 2, as well as positions 11 to 18, were seen to be mostly contributing negatively. A similar trend was noted in the average AMs generated for the other target and off-target sets. The observations seem to be reinforcing the idea that seed regions tolerate fewer to no mismatches, additionally indicating that the PAM-distal-most positions also play a role in efficacy.  In the example considered, a clear preference for the variable nucleotide in the PAM region is observed in the order of 'A' and 'T', with 'C' and 'G' being the least preferred. This observation is consistent across the other target and off-target sets considered and is in accordance with previous studies [38].
The positive off-target average saliency maps indicate important bases and positions, some of which could also be observed in the case of negative off-targets. Hence, it was important to note that the positions that were marked in one set but not in the other to derive critical features. In the case of EMX1 positive off-targets, positions 1 and 2, as well as positions 11 to 18, were seen to be mostly contributing negatively. A similar trend was noted in the average AMs generated for the other target and off-target sets. The observations seem to be reinforcing the idea that seed regions tolerate fewer to no mismatches, additionally indicating that the PAM-distal-most positions also play a role in efficacy.
The average DeepSHAP and GradExp AMs of the positive off-target sets indicated positions of mismatches that result in the sequence being classified in the opposite class. As an extrapolation, the mismatch regions marked in red associate the sequence with the positive class, i.e., mismatches that were tolerated and likely to be observed in experiments. In the example of EMX1, a match at positions 4-6 and 8-10, as well as in the seed region at 15 and 16, drive the model output to classify the sequence as a positive off-target. The matches at 5, 6, and 10 are indicated to be more important than those at the other positions. A similar trend was also observed in the sets analysed.
The negative off-target AMs averaged over multiple sequences showed many similarities in the importance patterns with the positive off-target average AMs. Among the regions uniquely crucial to the negative off-target set, as seen in the averaged saliency maps, regions 1 and 2, as well as regions 11 to 18, were found to be important. The observations of DeepSHAP and GradExp were consistent in denoting positive importance to these sites when classifying to the negative set. This trend is also consistent with the opposite denotation in the positive class AMs.

Discussion
In research and clinical applications, precise genetic manipulation is a major goal. Additionally, in the case of CRISPR/Cas9, potential off-target activity and consequent mutations hinder applications of the system. The on-target activity and off-target effects of CRISPR/Cas9 are dependent on the guide RNA sequence selection and design. Understanding the sequence features that determine Cas9 activity at any genomic locus has been a long-standing goal. Here, we present a count-based analysis of experimental data as well as a deep-learning-based feature importance estimation to investigate the position-specific and base-specific tolerances and preferences of the CRISPR/Cas9 system. The trends observed indicate a lower tolerance to mismatches towards the PAM-proximal end, concordant with the "seed" region hypothesis. The trends in preferred PAM sites were also in accordance with experimental reports that, after 'NGG' PAM, the preferred non-canonical PAMs are 'NAG' and 'NGA' [33,38].
The CRISP-RCNN model allows insights into the sequence feature importance while reliably predicting Cas9 activity. An advantage of such a predictor is the robustness to new target and off-target data. Analysing the average AMs of the negative and positive class target reference sequences allows insight into positions that tolerate more or fewer mismatches, leading to assignment in each class. The average AMs of the positive off-target sequences indicate the mismatches that are likely to be tolerated by the Cas9 system. The positions that were found less tolerant to mismatches were the eight positions proximal to the PAM and the two to three bases most distal to the PAM site, in accordance with recent reports [32,39]. In each case, the base preference depended on the target site sequence. In certain positions of the sequence, known to tolerate mismatches, the base that is rarely observed as a mismatch is less of a threat. Hence, at each position, knowing the leastreported bases represents important knowledge. Although the importance of the seed region has been acknowledged in multiple studies, the low mismatch tolerance of the PAM-distal end has not been extensively studied. However, molecular-dynamics-based approaches have indicated that mismatches at positions 1-3 are less likely to be tolerated, while those at positions 3-10 do not hinder Cas9 nuclease activity [39]. Hence, such knowledge can be incorporated into rule-based off-target prediction algorithms, improving target selection, leading to the efficient design of CRISPR/Cas9 experiments.
The negative class off-target sequences, when analysed by generating average AMs, suggest the mismatches that are not likely to be tolerated in an experimental set-up. Among the regions that contribute to model output, those present exclusively in the negative offtarget set-based AMs obtained for the various target sequences were studied. The average AMs were compared against the count-based study. Certain base-position combinations and their tolerance to a type of mismatch were found in both approaches; for example, if a 'C' is present at position 5, the most likely tolerated substitution is 'A', seen in the average AMs for EMX1 ( Figure 5) and VEGFA2 ( Figure S5). However, not all trends are consistent among the two analyses, implying a need for expanding the dataset to improve the countbased study. On the other hand, the consonance of the results of the deep-learning-based approach with the previous reports that aimed at establishing rules for Cas9 activity allows confidence in the predictions of the method.

Conclusions
The CRISPR/Cas9 genome editor has allowed rapid transformation of the field of biotechnology. Understanding and controlling the undesired outcomes of a CRISPR/Cas9 experiment has been an area of focus in the recent past. The study presented aims to allow an insight into the rules that steer the activity of Cas9 based on available experimental data. The CRISP-RCNN model was implemented for simultaneous classification and regression to understand position and sequence feature importance. Hence, this study will add to the rule-based guide RNA design and target site selection for efficacious Cas9 experiment design. Similar approaches can be used for optimizing other CRISPR-associated nucleases with the goal of improving research and clinical applications.

Supplementary Materials:
The following supporting information can be downloaded at https:// www.mdpi.com/article/10.3390/biom13040641/s1, Figure S1: Sequence to image encoding scheme; Figure S2: Mismatch tolerance across the length of the guide RNA; Figure S3: gap tolerance across the length of the guide RNA; Figure S4: Predictive performance of various neural networks tested; Figure S5: Average class activation maps for the VEGFA2 locus off-targets; Figure S6: Architecture used for the dual classification and regression tasks; Table S1: Base-position pairs that tolerate the most mismatches; Table S2: Gap tolerance across the length of the guide RNA.