1. Introduction
In eukaryotes, transcriptional regulation is essential to maintaining cell identity, responding to intra- and extra-cellular signals, and coordinating various gene activities [
1], whereas its dysregulation can cause a broad range of diseases [
2]. It is known that numerous
cis-regulatory elements (CREs), such as enhancers and promoters, and their complex interactions orchestrate the precise spatiotemporal transcriptional regulation in a cell-type-specific manner. Variants in CRES can disturb the designed regulation systems, resulting in diseases. Therefore, accurate CRE localization and precise variant impact quantification are crucial to understanding transcriptional regulation and linking CREs’ genetic variations with phenotypic changes in human diseases. Fortunately, single-cell transposase-accessible chromatin using a sequencing assay (scATAC-seq) has recently emerged for the measurement of simultaneous chromatin accessibilities in thousands of individual cells [
3,
4], illuminating cellular heterogeneity at the epigenetic level and probing functional CREs at the single-cell resolution.
The general steps of analyzing an scATAC-seq include mapping reads to a reference genome, producing aligned reads, calling peaks, and returning a peak-by-cell matrix. The binary peak-by-cell matrix represents the peak accessibility for every cell. It can be used for downstream analyses including cell clustering, identifying differential accessible peaks, construction of
cis-regulatory network, and inferring trajectories [
5]. Most pipelines input a bam file to MACS2 [
6] to call peaks, though the detailed steps to generate the peak-by-cell matrix can be different. SnapATAC generates a cell-by-bin count matrix first, then cells from the same cluster are pooled together, and MACS2 is used to call peaks for each cell type [
7]. ArchR calls peaks with MACS2 from the coverage files, then reads in fragments from each chromosome and computes the overlaps with the peaks from the same chromosome [
8]. Scasat [
9] uses MACS2 on aggregated BAM files and peaks that are open in at least one single cell remaining in the peak-by-cell matrix. scAND [
10] binarizes the peak-by-cell matrix and utilizes network diffusion methods to learn meaningful cell-type clusters. The peak-by-cell matrix can be aggregated into a peak-by-cluster matrix representing the binary peak accessibility for each cell type. Cell-type-specific analysis can be performed with this matrix to reveal the open chromatin characteristics for each cell type. For instance, we can identify motifs for each cell type by FIMO [
11] or CentriMo [
12]; predict
cis-regulatory DNA interactions and co-accessibility by Cicero [
13]; construct common or tissue-specific
cis-regulatory interactions by JRIM [
14]; etc.
Despite the continuous improvement in technology and analysis pipelines, two major challenges limit the application of scATAC-seq data to disease studies. First, unlike tissue-level sequencing where hundreds of millions of reads are mapped, the peak calling process in scATAC-seq experiments suffers from large amount of noise due to the sparsity of single-cell sequencing technologies, resulting in a fuzzy peak boundary and long peak length, which could lead to false-positive discoveries. For instance, most of the current scATAC-seq analysis pipelines yield low-resolution peaks with at least 501-bps [
8], but the true functional parts of these CREs—e.g., binding sites of the transcription factors (TF)—are only 5–31 bp, with a mean of 9.9 bp for eukaryotes [
15]. The large difference in region length reduces our statistical power in downstream analysis, e.g., CRE identification and disease-relevant variant mapping, and warrants the need to refine the call peaks in scATAC-seq. Second, after locating variants in functional CRE regions, a daunting question remains: how can we quantify variant impacts in a cell-type-specific manner to pinpoint key variants perturbating transcriptional regulatory mechanisms? Previous variant prioritizing methods fall into two main categories: (1) rank single nucleotide polymorphisms (SNPs) by the weighted integration of previous annotations for variants, e.g., FunSeq2 [
16] and Combined Annotation-Dependent Depletion (CADD) [
17]; (2) train a machine learning or deep learning classifier to identify functional SNPs, e.g., GWAVA [
18] and CASAVA [
19]. However, these methods were not designed for scATAC-seq and cannot evaluate variants in a cell-type-specific manner. There is an urgent need to develop a variant impact quantification method taking advantage of scATAC-seq data.
Convolutional neural networks (CNNs) are a variant of deep neural networks using a weight-sharing strategy to capture local patterns that can usually reach high performance regarding classification tasks. There has been a growing interest to analyze sequencing data and dissect epigenomic features via CNN. For instance, DeepSEA [
20], DanQ [
21] and DeepATT [
22] are deep learning methods with convolutional layers for predicting functions of non-coding sequences. Recent research in developing explainable deep learning models allows for the visualization of input regions with high-resolution details that are important for predictions. Gradient-weighted Class Activation Mapping (Grad-CAM) uses the gradients of any target concept flowing into the final convolutional layer to produce a localization map highlighting the important regions in the input for predicting the concept [
23]. DECODE [
24] uses Grad-CAM to condense the enhancer regions; AgentBind [
25] incorporates Grad-CAM to predict the TF binding status. These studies inspire us to utilize CNN and Grad-CAM on scATAC-seq data for enhancing peaks and identifying functional variants.
We proposed a weakly supervised deep learning method, scEpiLock (
Figure 1), to precisely locate CREs and quantify variant impacts in a cell-type-specific manner with three modules: (i) multi-label classifier module, (ii) object detection module, (iii) variant impact quantification module. First, the multi-label classifier module uses a convolutional neural network to make accessibility predictions on unseen sequences for each cell type. Then, we use Grad-CAM to produce a localization map to automatically highlight important regions in the predicted positive regions, which naturally refines the peaks’ boundaries and localizes the potential CREs. Lastly, the variant impact quantification module enables the computation of a base-wise accessibility score for wild type (WT), mutant, and their differences—the delta score. A high delta score indicates that the accessibility of given peak is sensitive to the variant.
To test the effectiveness of our scEpiLock, we applied the model on two public datasets: the 5k peripheral blood mononuclear cells (PBMC) from 10× genomics and the brain scATAC-seq [
26]. We showed that our scEpiLock’s multi-label classifier module outperforms other models in predicting whether given peaks are accessible for certain cell types. In addition, the object detection module improves the peaks’ resolution by condensing the total peak length to only ⅓ of the original size. We also used scEpiLock to quantify the candidate variants’ impact on brain diseases. Its variant impact quantification module identified SNPs that largely alter peak accessibility and potentially disturb the disease-related gene expression.
2. Materials and Methods
In this study, we proposed a novel weakly supervised learning scheme, named scEpiLock, for the scATAC-seq data. It can: (1) predict accessible peaks for each cell type via a multi-label classifier module, (2) refine the cis-regulatory element boundary via an object detection module, and (3) identify important variants via the variant impact quantification module.
2.1. Data Processing
We applied scEpiLock on two publicly available datasets—5k PBMC data with 8 cell types and brain data with 6 cell types—to evaluate the model’s performance on scATAC-seq data. We also tested its transfer learning scheme using Encyclopedia of DNA Elements (ENCODE) bulk ATAC-seq data.
2.1.1. PBMC Data from 10× Genomics
We downloaded the raw 5k PBMC scATAC-seq readout from the 10× Genomics website, mapped the reads to the hg19 reference genome, and generated the cell x peak matrix via SnapATAC [
7] following the pipeline tutorial. In the pre-processing step, we identified valid barcodes by requiring log10(UMI) to be between 3.5–5 and a promotor ratio within the range of 0.4 to 0.8. We removed unwanted chromosomes (e.g., chrM) and kept only chr1-22, chrX, and chrY. Next, we generated the cell-by-bin count matrix. The genome was segmented into uniform-sized bins (5k bp as default), and scATAC-seq profiles were represented by a cell-by-bin matrix. Each element indicated the number of sequencing fragments that had overlapped with a given bin in a certain cell. We further removed any cells with a bin coverage of less than 1000, as suggested by SnapATAC. In the end, we kept only the cell types with at least 200 cells to ensure the called peaks were reliable. As a result, 8 cell types had been identified, including 1486 CD14+ monocytes, 565 CD4 memory cells, 460 CD8 effectors, 349 CD4 naïve cells, 278 CD8 naïve cells, 278 pre-B cells, 261 double negative T cells, and 256 nature killer (NK) cells. A total of 114,538 peaks were identified, and each peak was accessible to at least one cell type. Among the called peaks, 53.1% of peaks (
n = 60,834 peaks) were unique to one cell type, while 15.9% of peaks (
n = 18,209 peaks) were shared among the 8 cell types. The rest of the peaks were shared between from two to seven cell types (
Figure S1).
2.1.2. Brain scATAC-seq Data
We downloaded the brain cell x peak matrix [
26]. The process to generate the file is detailed in the paper. Briefly, scATAC-seq was performed on 10 samples spanning the isocortex (
n = 3), striatum (
n = 3), hippocampus (
n = 2), and substantia nigra (
n = 2). The sequencing reads were mapped to the hg38 human reference genome. A total of 6 main brain cell types (excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, astrocytes, and oligodendrocyte progenitor cells (OPCs)) were identified, ending with 221,062 peaks, 77.9% (
n = 172,111 peaks) of which were specific to a single cell type (
Figure S2).
2.1.3. ENCODE Bulk ATAC-Seq Data
We used the ATAC-seq data on the ENCODE portal [
27] (
https://www.encodeproject.org/) (accessed on 1 March 2021) as the positive regions to train the transfer learning base model. By applying the filter of “ATAC-seq”, “Homo sapiens”, and “no perturbation”, we downloaded 379 bed files containing Irreproducibility Discovery Rate thresholded peaks (
Table S1). We then merged all the peaks together, which had a total coverage of 0.56 billion bp.
2.1.4. Genome Features as Model Inputs
To prepare the inputs for scEpiLock, we kept a fixed length of the peaks at 1000 bp as suggested by previous studies [
20,
21,
22] and then used one-hot encoding to format them into a 4 × 1000 binary matrix, with rows corresponding to A, G, C, and T, and the columns representing the peak length. Both forward and reverse complement strands were included in the training, which doubled the sample size. Each peak was paired with a multi-label vector, representing the accessible cell type label. For instance, if a given peak was accessible for two cell types, there would be two ones in the label vector at the corresponding cell type positions. The other cell types were labelled as zeros. The training, evaluation, and testing sets were randomly split by the ratio of 8:1:1.
2.1.5. Negative Regions
To ensure our model recognized general non-accessible regions, we incorporated a human reference genome as the negative regions during training and testing. For the PBMC data, it was hg19, and for the brain data, it was hg38. This is to match the reference genomes used to map the corresponding scATAC-seq data. The human reference genome was downloaded and cut into 1000 bp genomic fragments. To train the base model for transfer learning, fragments overlapped with any ENCODE peaks were excluded, leaving 1127k fragments. The label of these negative fragments was a null matrix. For scATAC-seq data, negative fragments overlapping with any scATAC-seq peaks were also excluded. Then, 10,000 fragments were randomly sampled to be included. The labels of each negative control fragment were vectors of zeros, indicating their inaccessibility for any cell types.
2.2. scEpiLock Module 1—Multi-Label Classifier Module
This module is designed to make multi-label predictions of the peak accessibility for each cell type via CNN. Given a sequence of peaks, the output is a binary label vector, representing the chromatin status of all cell types. Specifically, 0 means that the peak is inaccessible for certain cell types, while 1 means it is accessible.
2.2.1. Neural Network Architecture
As shown in
Figure 2, scEpiLock uses a multilayer CNN model, which contains four convolutional layers, two pooling layers, and two dense, fully connected layers. The filters in the convolutional layers are trained to recognize the CREs. The dense, fully connected layers combine the learnt information and output the cell-type-specific peak accessible score. The model is organized into a sequential layer-by-layer structure, with each layer acting as a functional transformation. The detailed model architectures and hyperparameters used in this study can be found in
Figure 2 and
Note S1.
Our model contains four convolutional layers; each convolutional filter has the following form:
In the equation, and represent the filter index; is the layer index; is the weight matrix; is the activation map; is the bias.
The rectified linear activation function (ReLU) is added to the end of each convolutional layer to transform the output into non-negative values. This mitigates the issue of vanishing gradients and allows the models to converge faster.
The first several convolutional layers are designed to extract features from high-dimensional data for each cell type, e.g., the TF binding motifs. Then, max-pooling layers are used to reduce the number of parameters and abstract features trained in the previous convolutional layers. The results are pooled feature maps that highlight the most present feature in the patch. To avoid overfitting, we added dropouts to randomly set a proportion of the neuron activations to a value of 0. At the end, there are two dense, fully connected layers used to integrate the signals extracted from the previous convolutional layers. The first dense layer has 925 neurons, and the neuron number of the second dense layer is the same as the cell type number of each dataset, which outputs the accessible scores for each cell type. The dense layer is a nonlinear transformation function, which can be expressed as follows:
A sigmoid function is used to obtain the different accessible probabilities of each given cell type. The prediction is scaled into the 0–1 range by the sigmoid function. Its formulation can be expressed as follows:
Multi-label classifier training configurations: the scEpiLock model was implemented using PyTorch [
28] version 1.7.1, and a NVIDIA GeForce RTX 3090 graphic computing card (GPU) was used to train our models. All weights were initialized by randomly drawing from a uniform distribution as PyTorch’s default. We optimized the binary cross entropy loss by gradient descent via Adaptive Moment Estimation (Adam) at a learning rate of 5 × 10
−4 and with a batch size of 64. The validation loss was evaluated at the end of each training epoch to monitor convergence. The data were split into 8:1:1 for training, validation, and testing. The model was fitted on the training set, and the hyper-parameters were evaluated on the validation set. After training, we selected the set of hyper-parameters that had the best performance on the validation set, which helped avoid overfitting. The final performance was reported on the unseen, independent test set.
In some cases, the number of cells included in the scATAC-seq experiment might be small, resulting in a small number of peaks and high data sparsity. An alternative strategy for random initiation is transfer learning—first, train a model with ENCODE bulk ATAC-seq data and use the weights that gave the best performance as the initial weights to train the model.
2.2.2. Multi-Label Classifier Performance Evaluations
We used two metrics to evaluate the model’s performance on the test data set. The first was the Area Under Receiver Operating Characteristic curve (AUROC), which was created by plotting the true positive rate against the false positive rate at various threshold settings. Given the high negative proportion in the dataset, we also included the Area Under Precision Recall curve (AUPR), which plotted the area under the precision curve against the recall curve. AUPR was a more balanced metric than AURPC in accessing the model’s performance on an imbalanced dataset. For evaluating the performance on the test data set, the predicted probability for each sequence was computed as the max of the probability predictions for the forward and reverse complement sequencing pairs.
2.2.3. Performance Benchmarking with Other Methods
For benchmark purposes, we also trained a DeepSEA model [
20], a DanQ model [
21], and a random forest (RF) model [
29], for comparison with the scEpiLock. Both DeapSEA and DanQ are deep learning models developed for making functional predictions of genomic sequences. These two models are designed to predict 919 chromatin effect features. To make cell-type-specific multi-label predictions for each input peak, we changed the last fully connected layer from outputting n × 919 to n × m, where n is the number of peaks and m is the number of cell types. The implementation of DeepSEA and DanQ can be found in the GitHub repository we provided. Unlike the deep learning model, which took multi-dimension data, RF could only take a one-dimensional input per sequence. Thus, we converted the 4 × 1000 binary matrix to 4000 binary sequences by concatenating the values together. The max depth and minimum samples per leaf were set to 40 and 20, respectively, to avoid overfitting. The other parameters were kept as the default parameters from scikit-learn’s (v0.24.2) [
30] Random Forest classifier function.
2.3. scEpiLock Module 2—Object Detection for CRE Boundary Refinement
2.3.1. Grad-CAM
We analogized the peak boundary localization problem as a classic object detection problem in the computer vision field and developed a weakly supervised learning scheme, Grad-CAM [
23], to refine the very coarse peak definitions from the scATAC-seq data. Specifically, we used the gradients of any target (e.g., the positive peak label for a certain cell type) flowing into the last convolutional layer to produce a coarse localization map indicating the important regions in the peak for predicting the target (
Figure 2). In turn, scEpiLock refined our peak annotations by reporting the most salient epigenomic regions, e.g., the TF binding sites, within the peaks as core regions. Specifically, the convolutional layers in scEpiLock were trained to identify the key regions within given peaks and retain spatial information which was lost in the fully connected layers. Thus, the last convolutional layer had the best balance between high-level representation and detailed spatial information. The gradient information flowing into the last convolutional layer of the CNN provided the necessary values for each neuron to predict cell-type-specific accessibility. Thus, we calculated the score’s gradient for the cell type
,
, with respect to each feature map activation
of a convolutional layer, i.e., (
). These gradients were global-average pooled to obtain the neuron importance weights (
)
To obtain the Grad-CAM score, we performed a weighted combination of the forwarded activation maps followed by an ReLU.
The output of the Grad-CAM maps was a 1D map with the size of the last convolutional feature maps (57). To assess the position-wise feature importance of the original 1 kb input fragment, we upscale the 1D map back to a length of 1000. Thus, one Grad-CAM value represented the region importance of 17.5 bp bin (1000/57 = 17.5). This perfectly covered one TF binding site. To refine our predictions, we used the Grad-CAM score to select a subset of the 17.5 bp bins with higher scores. Here, our cut-off was the 80th percentile of all the Grad-CAM scores computed. Such weakly supervised learning schemes increased our model’s interpretability by revealing and visualizing the process of decision making in our network.
2.3.2. Conservation Scores for Refined Regions
We compared the cross-species conservation scores of the refined vs. raw peaks to test whether scEpiLock could detect true functional regions, as these scores were strong functionality indicators [
31,
32]. Specifically, we downloaded 100-way PhastCons [
33] and calculated the averaged conservation scores per raw peak vs. scEpiLock’s refinements. A one-sided
t test was used to calculate the
p-values.
2.4. scEpiLock Module 3—Variant Impact Quantification
To identify important variants that altered a cell type’s chromatin accessibility, we used a previously trained scEpiLock multi-label classifier to predict the peak accessible probability (accessible score) for the WT and candidate mutants. We then computed the absolute difference between the WT and candidate mutants to generate the delta score. A higher delta score indicated the variant has a larger functional impact for a particular cell type.
2.4.1. GWAS Data Used and SNP Extraction
We downloaded a list of 930 putative disease-relevant SNPs for Alzheimer’s (AD) and Parkinson’s (PD). The criteria and process of selecting the SNPs are detailed in the original paper [
26], and the full list of SNPs can be found in their
Supplementary Table S2. We first trained our scEpiLock model on the brain scATAC-seq data. Then, we used the trained model to predict the cell-type accessibility probability (accessible score) of both the WT and mutant for each cell type. Functional SNPs were those with large accessible score differences between the WTs and mutants.
2.4.2. Sequencing Tracks
All sequencing tracks were created using the UCSC Genome Browser [
34] and shared the same x axis with the hg38 reference genome. The scATAC-seq, SNP, co-accessibility, and HiChIP tracks were custom tracks, while the other tracks were selected from the Genome Browser. The track of the scATAC-seq was the brain scATAC-seq cell-type-specific peaks. The SNP tracks of rs1237999, rs636317, and rs10769263 were added in the format of a Personal Genome SNP. The co-accessibility-based peak links were created by Cicero [
13] using the brain scATAC-seq peaks. Both the co-accessibility and the HiChIP data were downloaded from the paper’s Supplementary Data 9 [
26] and formatted as interactive tracks viewed in the Genome Browser.
4. Discussion
We present scEpiLock, a weakly supervised deep learning framework to predict and refine chromatin accessible peaks and localize functional genomic variants in a cell-type-specific manner. Our model has three main modules: a multi-label classifier, an object detection module, and a variant impact quantification module.
For the multi-label deep learning classifier, we trained a deep learning model on the chromatin accessible peaks of each cell type on scATAC-seq data. The CNN layers in the model have the capacity to learn genomic patterns and cell-type-specific regulatory interactions. The dense layers integrate the learned information and compute cell-type-specific accessible scores. We evaluated the model’s performance on two public datasets: PBMC scATAC-seq data and brain scATAC-seq dat. scEpiLock achieved state-of-the-art results on both datasets. With a pre-trained scEpiLock model, we can efficiently extract the complex gene regulatory patterns and predict the cell-type-specific accessibility for any given peaks if the cell type has been used to train the model. The model opens the door for low-cost personalized chromatin accessibility predictions—once trained on enough data, scEpilock can be used to predict the peak accessibility for multiple cell types, even though the peak sequence is patient-specific and has not been seen before. With more scATAC-seq datasets becoming available, we can expect scEpiLock to be used in many more different cell types.
In addition to multi-label predictions, the scEpiLock model also incorporates a boundary detection module via Grad-CAM. The position-wise importance scores can be used to refine the coarse peaks and identify the core functional regions. We showed that our condensed peaks have high PhastCons scores and are highly conserved during evolution. Moreover, scEpiLock has a variant impact quantification module to predict putative disease-associated SNPs. It calculates the accessible scores between WTs and mutants to identify SNPs that have a high impact on cell-type-specific chromatin accessibility.
There are several future directions to explore. One direction is to further improve the multi-label prediction module. For instance, bi-directional long short-term memory (bi-LSTM) layers can be incorporated to help capture interactions with long distances. Attention=based models can also be evaluated. On the other hand, Grad-CAM can be replaced by other object boundary detection methods, such as Grad-CAM++ [
44], FullGrad [
45], etc. We reason scEpiLock can serve as a base model to predict and refine scATAC-seq peak and evaluate functional SNPs, while the specific model structure or refinement method can be changed.
Taken together, we present a deep learning tool that could be widely deployed for cell-type-specific open chromatin identification. In addition, scEpiLock can refine the open chromatin peaks by boundary detection and predict disease-related SNPs. scEpilock is written as a Python package, and can easily be incorporated into existing acATAC-seq analysis pipelines, e.g., SnapATAC [
7], ArchR [
8].