A Non-Binary Approach to Super-Enhancer Identification and Clustering: A Dataset for Tumor- and Treatment-Associated Dynamics in Mouse Tissues

Osintseva, Ekaterina D.; Ashniev, German A.; Orlov, Alexey V.; Nikitin, Petr I.; Zaitseva, Zoia G.; Volkov, Vladimir V.; Orlova, Natalia N.

doi:10.3390/data10050074

Open AccessData Descriptor

A Non-Binary Approach to Super-Enhancer Identification and Clustering: A Dataset for Tumor- and Treatment-Associated Dynamics in Mouse Tissues

by

Ekaterina D. Osintseva

¹,

German A. Ashniev

^1,2,3

,

Alexey V. Orlov

^1,*

,

Petr I. Nikitin

^1,4

,

Zoia G. Zaitseva

¹,

Vladimir V. Volkov

^1,4 and

Natalia N. Orlova

^1,*

¹

Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia

²

Faculty of Biology, Lomonosov Moscow State University, Leninskiye Gory, MSU, 1-12, 119991 Moscow, Russia

³

Institute for Information Transmission Problems RAS, 127051 Moscow, Russia

⁴

National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), 31 Kashirskoe Shosse, 115409 Moscow, Russia

^*

Authors to whom correspondence should be addressed.

Data 2025, 10(5), 74; https://doi.org/10.3390/data10050074

Submission received: 28 February 2025 / Revised: 1 May 2025 / Accepted: 12 May 2025 / Published: 14 May 2025

(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Super-enhancers (SEs) are large clusters of highly active enhancers that play key regulatory roles in cell identity, development, and disease. While conventional methods classify SEs in a binary fashion—super-enhancer or not—this threshold-based approach can overlook significant intermediate states of enhancer activity. Here, we present a dataset and accompanying framework that facilitate a more nuanced, non-binary examination of SE activation across mouse tissue types (mammary gland, lung tissue, and NMuMG cells) and various experimental conditions (normal, tumor, and drug-treated samples). By consolidating overlapping SE intervals and capturing continuous enhancer activity metrics (e.g., ChIP-seq signal intensities), our dataset reveals gradual transitions between moderate and high enhancer activity levels that are not captured by strictly binary classification. Additionally, the data include extensive functional annotations, linking SE loci to nearby genes and enabling immediate downstream analyses such as clustering and gene ontology enrichment. The flexible approach supports broader investigations of enhancer landscapes, offering a comprehensive platform for understanding how SE activation underpins disease mechanisms, therapeutic response, and developmental processes.

Dataset: Data are contained within the article or Supplementary Materials.

Dataset License: CC BY 4.0

Keywords:

super-enhancers; non-binary classification; ChIP-seq; tumor biology; enhancer clustering; epigenetics; gene regulation; functional annotation

1. Summary

Super-enhancers (SEs) are large regulatory regions characterized by exceptionally strong enhancer activity and extensive transcription factor occupancy [1,2,3,4,5]. They have been associated with the regulation of key cell identity genes and dynamic transcriptional programs across diverse biological contexts [6,7,8,9,10]. While their exact mechanisms remain an area of active research, SEs represent intriguing elements of genome regulation with potential implications for development, disease, and cellular plasticity [3,5,11,12,13]. A widely used tool for identifying SEs is the ROSE (Rank Ordering of Super-Enhancers) algorithm, which distinguishes SEs from typical enhancers by integrating ChIP-seq signals and merging closely spaced enhancer regions [4,14,15,16]. This binary classification—where an enhancer is either designated as a super-enhancer or not—has been instrumental in highlighting the existence of SEs and their impact on gene expression in both normal and pathological states [17,18,19].

However, while the binary SE concept has been successful [20,21,22,23], a more nuanced perspective can provide deeper insights into tumor- and treatment-associated enhancer landscapes. For instance, in a normal tissue sample, a genomic locus may exhibit substantial enhancer activity but still fall just below the threshold required for SE classification by ROSE. Conversely, in a tumor sample, the same locus may surpass the threshold and be categorized as an SE. Strict presence/absence classification may therefore overlook intermediate states of enhancer activation. A non-binary perspective enables the capture of a broader spectrum of enhancer activity changes, covering not only on/off states but also transitions between moderate and high activity levels.

In this study, we present a dataset designed to facilitate such a non-binary approach to SE characterization, clustering, and functional annotation. Our dataset incorporates multiple tissue types and experimental conditions, including normal mammary gland, mammary tumors under various treatments, NMuMG cells treated with TGF-β at different time points, and lung samples encompassing postnatal, tumor, and adenocarcinoma phenotypes. By consolidating overlapping SE calls across samples and collecting continuous enhancer activity metrics (e.g., ChIP-seq signal intensities), we provide a platform for studying dynamic enhancer architecture changes. Additionally, our dataset includes information on nearby target genes, enabling downstream functional analyses such as gene set enrichment or pathway exploration.

In this work, we show an approach that moves beyond conventional binary SE classification, capturing gradual or partial shifts in enhancer activity to provide deeper insights into SEs as key regulatory elements in gene regulation and cellular processes. Our analysis focuses specifically on loci that are classified as SEs in one sample at least. While we recognize the limitations of such loci exclusion that exhibit only typical enhancers across analyzed samples, we aim to balance the analysis complexity with the demonstration of the main ideas and overall concept of our study. The following sections describe our data organization, feature tables, clustering approaches, and functional annotations, outlining a framework for investigating SE activity dynamics and their functional relevance at the gene level.

2. Data Description

Each Sample Set consists of a structured collection of subsets that systematically organize and represent the data. The structure of these subsets is consistent across all Sample Sets, ensuring a standardized format for data analysis. Each Sample Set contains a specific group of mouse-derived samples, which differ in biological origin and experimental conditions, while the organization of the corresponding subsets remains the same. A detailed schematic of the full data-processing pipeline is provided in Figure S1. We selected a limited set of mouse tissues to provide a clear proof-of-concept demonstration of the non-binary framework; the same workflow can be applied to ChIP-seq peak set from other species or tissues.

The following mouse Sample Sets are included in this study: Mammary Gland Tissue Samples, including both normal and tumor tissue samples (Set 1); Lung Tissue Samples, including postnatal, tumor, and adenocarcinoma samples (Set 2); NMuMG Cell Line: untreated and TGF-β-treated (4 h, 24 h) (Set 3). The detailed description of all included samples is provided in Section 3.1. To illustrate the structure of the subsets used in each Sample Set, we describe the organization of Set 2: Lung Tissue Samples as an example. This choice does not affect the general applicability of the described structure, as the same data organization is maintained across all Sample Sets.

2.1. Subset 1: SE Locus Consolidation Information

This subset contains the list of consolidated super-enhancer (SE) loci identified after merging overlapping or closely located SEs (within 12.5 kb) from multiple samples of the same tissue type. The subset provides information on the genomic position of each consolidated SE locus, the original SEs contributing to the locus, and their presence across different samples.

The columns are structured as follows:

Locus ID (se_locus_id): A unique identifier assigned to each consolidated SE locus.
Genomic Coordinates (chr, start, end): The chromosomal position (chromosome, start, end) of the consolidated SE locus.
Original SEs (se_id_list): A comma-separated list of SE identifiers that contributed to the consolidated locus, along with their originating samples.
Sample-Specific Presence Indicators (SE_11_0077, SE_12_0118, SE_12_0449, SE_12_0450): Binary values (0 or 1) indicating whether the SE locus is present (1) or absent (0) in each sample.

For example, in se_region_1, the consolidated locus on chr1 (20832643–20924232) is formed by SE_12_045000567 and is only present in Sample_12_0450 (0,0,0,1). In contrast, se_region_2 on chr1 (23159877–23215737) is formed by SEs from two different samples (SE_12_045000698, SE_12_044900640) and is present in both Sample_12_0450 and Sample_12_0449 (0,0,1,1).

2.2. Subset 2: Presence of Super-Enhancers and Typical Enhancers in Consolidated SE Loci

This subset provides information about all SE-locus-associated enhancer elements: both super-enhancers and elements classified as typical enhancers (TEs) by the ROSE algorithm, found within the identified consolidated super-enhancer loci. The TE category includes both individual enhancers and stitched enhancer elements with sub-threshold signals that do not meet SE criteria.

For each consolidated SE locus, we checked every individual sample to determine whether it contained SE or TE elements. This allows us to identify enhancer elements that, while not classified as super-enhancers, still map to SE loci identified in other samples.

Each row in the subset represents all SE-locus-associated enhancer elements found within a consolidated SE locus and includes details about its genomic location, ranking, and activity level.

The columns are structured as follows:

Locus ID (se_locus_id): The identifier of the consolidated SE locus where the enhancer is located.
Sample ID (cell_id): The originating sample in which the enhancer is detected.
Enhancer ID (ste_id): A unique identifier for the SE/TE within the sample.
Genomic Coordinates (ste_chr, ste_start, ste_end): The chromosomal position (chromosome, start, end) of the enhancer.
Rank (ste_rank): The ranking of the enhancer within the consolidated SE locus, where a lower rank typically indicates higher activity.
ChIP-seq Signal (avg_rpm_diff): The average signal intensity of the enhancer, normalized against the control, representing enhancer activity.
Overlap (overlap): The extent of overlap between the enhancer and the consolidated SE locus.
Weight Within Locus (ste_weight_within_locus): The relative contribution of the enhancer to the weighted average signal intensity of the consolidated locus, calculated as the ratio of the enhancer’s overlap with the locus to the total sum of overlaps between the locus and all enhancers in the sample.

For example, in se_region_1, a consolidated SE locus was identified. Within this locus, enhancer elements were derived from multiple samples, resulting in a total of seven entries in the subset:

From Sample_12_0450: one super-enhancer (SE_12_045000567).
From Sample_12_0449: three typical enhancers (TE_12_044906558, TE_12_044906059, TE_12_044900944).
From Sample_12_0118: three typical enhancers (TE_12_011809209, TE_12_011810352, TE_12_011800952).
From Sample_11_0077: no enhancer elements were detected within this SE locus.

Each SE-locus-associated element is represented as a separate row in the subset, with details about its genomic coordinates, ranking, and signal intensity. This structure allows us to track all SE-locus-associated elements (e.g., TE elements that, while not classified as SEs in their respective samples, still belong to a larger SE locus identified in another sample).

2.3. Subset 3: Features of Consolidated Super-Enhancer Loci

This subset contains computed features for each consolidated super-enhancer (SE) locus per sample, helping to characterize their regulatory activity. Each row represents a specific SE locus in a given sample, including its activity measurements and presence indicators.

The columns are structured as follows:

Locus ID (se_locus_id): The identifier of the consolidated SE locus.
Sample ID (cell_id): The originating sample in which the SE locus is analyzed.
Max ChIP-seq Signal (avg_rpm_diff__max): The maximum signal intensity (avgRPM) observed within the SE locus in the given sample.
Weighted Mean ChIP-seq Signal (avg_rpm_diff__weighted): The weighted average ChIP-seq signal (avgRPM) across all enhancers within the SE locus, where weights are determined by the enhancer-locus overlap.
Max Enhancer Rank (max_rank): The highest (worst) rank among all enhancers within the SE locus in the given sample.
Min Enhancer Rank (min_rank): The lowest (best) rank among all enhancers within the SE locus in the given sample.
Binary SE Presence Indicator (active_SE): A binary value (1 or 0) indicating whether the SE locus contains at least one element classified as a super-enhancer by the ROSE algorithm in the given sample.
Active SE Count (active_SE_count): The number of super-enhancers detected within the SE locus in the given sample.
Active TE Count (active_TE_count): The number of typical enhancers identified within the SE locus in the given sample.

For example, in se_region_1, Sample_12_0450 exhibits the highest max ChIP-seq signal (40015.23) and is classified as an active SE (active_SE = 1), while the same region in Sample_12_0118 has a much lower max signal (6946.63) and is not classified as an active SE (active_SE = 0) but contains three stitched enhancers. Similarly, se_region_3 in Sample_12_0449 is classified as an active SE (active_SE = 1) with a max ChIP-seq signal of 35437.43, whereas the same region in Sample_12_0450 remains inactive (active_SE = 0) despite the presence of three stitched enhancers.

2.4. Subset 4: Preprocessing for Clustering

This subset is designed for clustering analysis, transforming the dataset into a structured format suitable for unsupervised learning techniques. Each row corresponds to a consolidated super-enhancer (SE) locus, while the columns represent activity measures across different samples before and after normalization. This preprocessing step facilitates comparative analysis by standardizing signal intensity values across samples, allowing for a more robust clustering approach.

The columns are structured as follows:

Locus ID (se_locus_id): The identifier of the consolidated SE locus.
Binary Presence Indicators (SE_11_0077_is, SE_12_0118_is, SE_12_0449_is, SE_12_0450_is): Binary values (0 or 1) indicating whether the SE locus is present (1) or absent (0) in each sample.
Raw ChIP-seq Signal (SE_11_0077, SE_12_0118, SE_12_0449, SE_12_0450): The original activity values (avgRPM) of the SE locus across the given samples.
Median-Normalized Signal (SE_11_0077_medianNormalized, SE_12_0118_medianNormalized, SE_12_0449_medianNormalized, SE_12_0450_medianNormalized): The activity values after median normalization, which adjusts the distribution to reduce sample-specific biases.
Imputed Median-Normalized Signal (SE_11_0077_medianNormalized_imputed, SE_12_0118_medianNormalized_imputed, SE_12_0449_medianNormalized_imputed, SE_12_0450_medianNormalized_imputed): Median-normalized values with imputed data to replace missing values (if any).
Log-transformed Normalized Median Signal (SE_11_0077_medianNormalized_imputed_log1p, SE_12_0118_medianNormalized_imputed_log1p, SE_12_0449_medianNormalized_imputed_log1p, SE_12_0450_medianNormalized_imputed_log1p):

The median-normalized values after data imputation, transformed using log(1 + x).

Z-scaled Log-transformed Normalized Median Signal (SE_11_0077_medianNormalized_log1p_zscaled, SE_12_0118_medianNormalized_log1p_zscaled, SE_12_0449_medianNormalized_log1p_zscaled, SE_12_0450_medianNormalized_log1p_zscaled): The log-transformed imputed median-normalized values further standardized using Z-score normalization, ensuring that each sample has a mean of 0 and standard deviation of 1 for comparability across datasets.

For example, the se_region_10 locus is present in three out of four samples (SE_12_0118, SE_12_0449, and SE_12_0450), while absent in SE_11_0077, as indicated by the binary presence indicators (0,1,1,1). The raw ChIP-seq signal intensity varies significantly between samples: 1,830.14 in SE_11_0077, 16,544.05 in SE_12_0118, 33,916.11 in SE_12_0449, and 34,832.26 in SE_12_0450. After median normalization, the values adjust to 0.3979, 1.3920, 1.2751, and 0.9226. The subsequent log(1 + x) transformation further stabilizes variance, resulting in values of 0.3350, 0.8721, 0.8220, and 0.65370. Finally, the Z-score normalization standardizes the log-transformed values to −0.6782, 0.4236, 0.3564, and −0.1114, centering the data around zero for improved comparability across samples.

2.5. Clustering and Gene Associations

This subset presents the clustering results of consolidated super-enhancer loci based on unsupervised learning techniques, along with their functional annotation. Each row corresponds to a specific SE locus, detailing its clustering assignment and the closest active gene associated with the SE (according to SEdb terminology). Notably, this table was generated only for the second dataset (SE_11_0077, SE_12_0118, SE_12_0449, SE_12_0450) to demonstrate clustering and gene association patterns.

The columns are structured as follows:

Locus ID (se_locus_id): A unique identifier assigned to each consolidated SE locus.
Cluster Assignment (louvain_module): The cluster number assigned to the SE locus using the Louvain community detection algorithm, which groups SE loci based on shared enhancer activity patterns.
Degree Centrality (degree_centrality): The proportion of other loci that the SE locus is directly connected to within the network.
Betweenness Centrality (betweenness_centrality): A measure of how often the SE locus lies on the shortest path between other loci, indicating its importance in network connectivity.
Closeness Centrality (closeness_centrality): The inverse of the average shortest path distance from the SE locus to all other loci, representing how central it is within the network.
Gene Associations (SE_11_0077__gene_closest_active, SE_12_0118__gene_closest_active, SE_12_0449__gene_closest_active, SE_12_0450__gene_closest_active): If an SE was associated with the locus in a sample, its closest active gene is indicated in the subset.

3. Methods

3.1. Raw Data Collection

To evaluate our methodology and generate datasets derived from its application, we utilized publicly available mouse data from the SEdb 2.0 database [24]. We selected three mouse dataset groups based on tissue type and experimental conditions to demonstrate the applicability of our approach. Each dataset comprises individual mouse samples from SEdb 2.0, which we grouped into structured sets for analysis.

A summary of the Sample Sets along with their corresponding SEdb 2.0 IDs is presented in Table 1.

Each Sample Set consists of multiple samples that share a common biological context (e.g., mammary gland or lung tissue). These sets form the basis of the final dataset structure, as described in Section 2, where individual samples are referenced by their unique identifiers (e.g., Sample_12_0136).

3.2. Consolidation of Super-Enhancer Loci Across Sample Groups

Within each Sample Set (mammary gland tissue, NMuMG cell line, lung tissue), all super-enhancer intervals from the included samples were aggregated into a single dataset. These intervals were then consolidated and sorted by chromosome and start position. To ensure a unified representation of super-enhancer loci, overlapping or closely spaced intervals—defined as those within 12,500 bp of each other—were further consolidated using the “-d 12500” parameter in BEDTools merge (v2.30.0) [25]. This process standardized the enhancer regions within each group, facilitating further comparative analyses across Sample Sets.

3.3. Identification of Enhancer-Based Elements Within Consolidated Super-Enhancer Loci

The original typical enhancer intervals from samples within the corresponding Sample Set were aggregated into a single dataset, keeping information about Sample ID and Enhancer ID, similarly to super-enhancer intervals. SE and TE genomic intervals were concatenated and enriched with features characterizing enhancer activity, namely, average RPB by bp difference between case and input as signal intensity measure and enhancer rank according to ROSE. Enhancers located within consolidated super-enhancer loci were detected using BEDTools intersect with the “-wo -f 0.5 -F 1 -e” parameters, requiring the typical enhancer with at least half of its length overlapping the locus or fully covering it. Thus, a table was generated, describing enhancers represented in the consolidated SE loci for each sample.

The activity of the consolidated loci in each sample was assessed using two approaches. In the first approach, the locus activity was determined by the activity of the enhancer with the maximum signal intensity among all enhancers within the given locus. In the second approach, the activity was defined as the weighted average signal intensity of elements within the locus, where the weight of each enhancer was the ratio of the overlap size between the locus and the enhancer to the total sum of overlaps between the locus and all enhancers. A consolidated locus was classified as a super-enhancer region for a given sample if it overlapped with at least one original super-enhancer interval from that sample. The final classification results were stored in a summary table, where each row represented a unique consolidated super-enhancer locus, and dedicated columns indicated locus activity and enhancer status for each sample. This structured representation enabled a comparative assessment of enhancer activity across different Sample Sets.

3.4. Feature Matrix Construction and Dimensionality Reduction

A feature matrix was constructed to represent each consolidated super-enhancer locus across all relevant samples. Each row corresponded to a single consolidated locus, and each column represented a specific measurement or indicator for one sample. This matrix included binary indicators denoting the presence or absence of super-enhancers in each sample, as well as continuous measures of enhancer activity. This structured representation enabled the integration of both binary and continuous variables for further computational analysis.

To mitigate the influence of technical variation between samples, the signal intensity in each sample was normalized to the median signal intensity of the consolidated SE loci within that sample, in which activity was detected. The absence of a signal may indicate weak activity within the locus that is not detectable by ChIP-seq. All missing values were imputed with a very small positive number, such as 10⁻⁶, which, in addition to considering the biological context, reduces the influence of zeroes on the signal distribution within the sample. Prior to imputation, the proportion of loci with missing values was assessed for each sample, since values higher than 50% significantly skewed distribution and may indicate poor sample quality, which is likely to affect further analysis. To mitigate the influence of extremal values, all continuous variables in the feature matrix were transformed using the log(1 + x) function and then standardized using z-scaling for further clustering of consolidated SE locus [26].

3.5. Clustering and Gene Association

For exploratory data analysis, Principal Component Analysis (PCA) with default parameters was applied to the scaled dataset, and the results were visualized in the coordinates of the first two principal components, which captured the highest variance [27]. To determine the optimal number of clusters for K-means clustering, the elbow method and the Davies–Bouldin index were used [28]. The K-means algorithm from scikit-learn 1.2.0 was then executed 1000 times with six clusters, using the k-means++ initialization method, a maximum of 300 iterations, and 10 different initializations. A consensus matrix was computed by aggregating clustering results across iterations [29]. To assess the presence of a well-defined block structure at the specified number of clusters, a clustermap was generated using seaborn, visualizing the distance matrix, defined as one minus the consensus matrix. If necessary, the number of clusters was adjusted based on these results. To further refine clustering, a graph representation was constructed from the consensus matrix using NetworkX 3.4.2 [30]. Nodes representing consolidated super-enhancer loci were clustered using the Louvain algorithm, which optimizes modularity to identify community structures. For dimensionality reduction, the scaled signal matrix was projected into two dimensions using Uniform Manifold Approximation and Projection (UMAP, v0.5.7) [31] with default parameters. The resulting UMAP coordinates were used to visualize consolidated SE loci, with coloring corresponding to Louvain cluster assignments. To associate consolidated loci with reference genes, we selected the closest active gene as indicated in SEdb 2.0.

3.6. Functional Analysis of Protein Interaction Network

To investigate functional associations, we focused on the 1st module identified during clustering. From this module, we extracted only the genes associated with SEs from sample SE_12_0449. This refined gene list was then subjected to a multiple protein search in the STRING protein–protein interaction database [32] to explore potential functional interactions among the associated proteins.

3.7. Software Environment

All interval-based operations, such as identifying stitched enhancers within consolidated loci, were performed using bedtools intersect (BEDTools v2.31.1). Python-based analyses were conducted using PCA (scikit-learn v1.6.1: sklearn.decomposition.PCA) to reduce dimensionality for visualization and clustering input, UMAP (umap-learn v0.5.7: umap.umap_.UMAP) to generate a low-dimensional embedding while preserving local and global data structure, and KMeans (scikit-learn v1.6.1: sklearn.cluster.KMeans) to cluster loci based on enhancer activity profiles. The full workflow and codebase can be accessed at: https://github.com/kaaspen/NBa-SEic-dataset (accessed at 1 May 2025).

4. User Notes

This proposed approach and dataset are intended to support a more nuanced investigation of enhancer activity than is typically offered by binary classification methods. By consolidating super-enhancer cells across multiple lung tissue samples and incorporating non-binary features, users can observe gradual transitions between different enhancer activity states, rather than focusing exclusively on “present vs. absent” thresholds. For instance, a genomic locus that is weakly active in postnatal lung tissue but significantly upregulated in lung tumors may fall below the conventional super-enhancer cutoff in one sample but exceed it in another. Capturing such variations may help refine hypotheses regarding tumor-specific enhancer activation or the mechanistic effects of chromatin-modifying treatments.

To illustrate this multi-level approach, we provide a visualization of the clustering results for the Lung Sample Set (Figure 1). Each point represents a consolidated SE locus positioned in a two-dimensional embedding, with color indicating cluster membership. Figure 1 presents the distribution of median-normalized values transformed using log(1 + x) for the SE loci in cluster 1 identified in this study across the four samples: Sample_11_0077 (Lung normal), Sample_12_0118 (Lung tumor), Sample_12_0449 (Lung adenocarcinoma cell, Nkx2-1-positive), and Sample_12_0450 (Lung adenocarcinoma cell, Nkx2-1-negative).

The dataset also includes functional annotations for each SE locus, facilitating downstream analyses of gene regulation. For example, Figure 2a presents GO enrichment analysis for biological processes within the protein–protein interaction network, while Figure 2b depicts enriched KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways. By correlating SE activity patterns with these functional pathways, researchers can identify key regulatory elements potentially involved in oncogenic processes or treatment responses. This level of functional insight is particularly valuable for detecting disease-relevant enhancers that might be overlooked by strictly binary classification approaches.

Notably, although a comprehensive biological interpretation is beyond the scope of the current study, the clustering shown below for the lung set illustrates one possible application of the resource; the same analytical steps can be applied without modification to the mammary gland and NMuMG datasets.

Beyond these examples, the dataset accommodates various alternative workflows. Users can adopt different normalization strategies, clustering algorithms, or dimensionality reduction techniques; integrate multi-omic data (e.g., RNA-seq expression levels or DNA methylation patterns); or explore enhancer dynamics across different lung cell types and tumor subtypes. The modular table structure allows for flexible filtering based on criteria such as partial vs. complete activation, treatment-specific changes, or enhancer proximity to known oncogenes. Furthermore, because the consolidation and scaling procedures are designed for scalability, this approach can be extended to larger cohorts, different tissue types, or even single-cell epigenomic datasets.

Altogether, these user notes highlight both the flexibility of the dataset structure and the biological questions it can address. While the binary concept of SEs has been instrumental in many discoveries, exploring the gradual changes in enhancer activity often reveals additional layers of regulatory complexity. We anticipate that researchers studying lung development, tumor evolution, or therapeutic interventions will find this resource and workflow useful for generating novel hypotheses about enhancer-mediated gene regulation.

The non-binary representation of SE activity is intended as a practical roadmap for functional studies. By ranking loci on continuous H3K27ac intensity and clustering them across conditions, the workflow pinpoints context-specific regulatory ‘hot spots’ as prime candidates for CRISPR-based perturbation or reporter validation. The same pipeline can be applied iteratively to additional datasets to generate expanding panels of testable hypotheses.

5. Conclusions

In conclusion, this manuscript presents a generalizable workflow for intermediate-level super-enhancer identification, together with example datasets that demonstrate its applicability across multiple tissue types and treatment conditions. By integrating continuous signals, consolidating overlapping intervals, and employing clustering techniques, our approach illuminates subtle functional transitions between typical and super-enhancer states that would otherwise be lost in strictly binary classifications. This approach offers a versatile approach for SE investigation and has the potential to provide deeper insights into SE-driven emergence and progression of disease, effects of various treatment types, and tissue-specific regulatory patterns. Looking ahead, the modular design of our datasets and workflows facilitates application to a wider range of research contexts, including single-cell epigenomics and multi-omics integration. By providing quantitative measures of the enhancer activity and robust functional annotations, this dataset can support investigations of the dynamic roles of SEs in gene regulation, from developmental trajectories to therapeutic interventions. We anticipate that these multitudinous insights into enhancer landscapes will deepen our understanding of cellular identity and disease progression, allowing us to develop novel strategies for diagnosis, prognosis, and treatment. Thus, unlike conventional ROSE-based pipelines that impose a binary threshold, our workflow keeps the continuous signal measurements (e.g., H3K27ac ChIP-seq signal) and ranks each locus within a merged window, thereby preserving intermediate states of enhancer activation and revealing gradual, context-specific transitions between typical and super-enhancer status. This non-binary framework offers a novel quantitative lens for prioritizing regulatory loci across tumors and treatments, enabling functional hypotheses that remain inaccessible under an on/off classification scheme.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/data10050074/s1, Figure S1, “Overall data-processing workflow for non-binary super-enhancer identification and clustering; Set 1. Mammary Gland Tissue Samples.zip: Contains data subsets for mammary gland tissue samples, including normal and tumor conditions under different treatments; Set 2. Lung Tissue Samples.zip: Includes data subsets for lung tissue samples, covering postnatal lung, lung tumors, and lung adenocarcinoma subtypes; Set 3. NMuMG Cell Line.zip: Provides data subsets for the NMuMG cell line (including under different TGF-β treatment time points).

Author Contributions

Conceptualization, E.D.O., G.A.A., A.V.O. and N.N.O.; Methodology, E.D.O., G.A.A., A.V.O., Z.G.Z., P.I.N., V.V.V. and N.N.O.; Formal Analysis, E.D.O., G.A.A., A.V.O., Z.G.Z., P.I.N., V.V.V. and N.N.O.; Investigation, E.D.O., G.A.A., A.V.O., Z.G.Z., P.I.N., V.V.V. and N.N.O.; Project Administration, A.V.O. and N.N.O.; Resources, P.I.N. and N.N.O.; Visualization, E.D.O., G.A.A., A.V.O., Z.G.Z., P.I.N., V.V.V. and N.N.O.; Supervision, G.A.A., A.V.O. and N.N.O.; Data Curation, E.D.O., G.A.A., A.V.O., Z.G.Z., V.V.V. and N.N.O.; Writing—original draft, E.D.O., G.A.A., A.V.O., Z.G.Z., P.I.N. and N.N.O.; Writing—review and editing, E.D.O., G.A.A., A.V.O., Z.G.Z., P.I.N., V.V.V. and N.N.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Russian Science Foundation, grant number 22-74-10053, https://rscf.ru/en/project/22-74-10053/, accessed on 16 February 2025.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Youngblood, M.W.; Erson-Omay, Z.; Li, C.; Najem, H.; Coșkun, S.; Tyrtova, E.; Montejo, J.D.; Miyagishima, D.F.; Barak, T.; Nishimura, S.; et al. Super-Enhancer Hijacking Drives Ectopic Expression of Hedgehog Pathway Ligands in Meningiomas. Nat. Commun. 2023, 14, 41926. [Google Scholar] [CrossRef] [PubMed]
Koutsi, M.A.; Pouliou, M.; Champezou, L.; Vatsellas, G.; Giannopoulou, A.-I.; Piperi, C.; Agelopoulos, M. Typical Enhancers, Super-Enhancers, and Cancers. Cancers 2022, 14, 4375. [Google Scholar] [CrossRef]
Whyte, W.A.; Orlando, D.A.; Hnisz, D.; Abraham, B.J.; Lin, C.Y.; Kagey, M.H.; Rahl, P.B.; Lee, T.I.; Young, R.A. Master Transcription Factors and Mediator Establish Super-Enhancers at Key Cell Identity Genes. Cell 2013, 153, 307–319. [Google Scholar] [CrossRef] [PubMed]
Hnisz, D.; Abraham, B.J.; Lee, T.I.; Lau, A.; Saint-André, V.; Sigova, A.A.; Hoke, H.A.; Young, R.A. Super-Enhancers in the Control of Cell Identity and Disease. Cell 2013, 155, 934–947. [Google Scholar] [CrossRef]
Lovén, J.; Hoke, H.A.; Lin, C.Y.; Lau, A.; Orlando, D.A.; Vakoc, C.R.; Bradner, J.E.; Lee, T.I.; Young, R.A. Selective Inhibition of Tumor Oncogenes by Disruption of Super-Enhancers. Cell 2013, 153, 320–334. [Google Scholar] [CrossRef]
Shin, H.Y. Targeting Super-Enhancers for Disease Treatment and Diagnosis. Mol. Cells 2018, 41, 506–514. [Google Scholar] [CrossRef] [PubMed]
Grosveld, F.; van Staalduinen, J.; Stadhouders, R. Transcriptional Regulation by (Super)Enhancers: From Discovery to Mechanisms. Annu. Rev. Genom. Hum. Genet. 2021, 22, 127–146. [Google Scholar] [CrossRef]
Blobel, G.A.; Higgs, D.R.; Mitchell, J.A.; Notani, D.; Young, R.A. Testing the Super-Enhancer Concept. Nat. Rev. Genet. 2021, 22, 749–755. [Google Scholar] [CrossRef] [PubMed]
Gartlgruber, M.; Sharma, A.K.; Quintero, A.; Dreidax, D.; Jansky, S.; Park, Y.G.; Kreth, S.; Meder, J.; Doncevic, D.; Saary, P.; et al. Super-Enhancers Define Regulatory Subtypes and Cell Identity in Neuroblastoma. Nat. Cancer 2021, 2, 114–128. [Google Scholar] [CrossRef]
Kai, Y.; Li, B.E.; Zhu, M.; Li, G.Y.; Chen, F.; Han, Y.; Cha, H.J.; Orkin, S.H.; Cai, W.; Huang, J.; et al. Mapping the Evolving Landscape of Super-Enhancers during Cell Differentiation. Genome Biol. 2021, 22, 196. [Google Scholar] [CrossRef]
Sengupta, S.; George, R.E. Super-Enhancer-Driven Transcriptional Dependencies in Cancer. Trends Cancer 2017, 3, 268–281. [Google Scholar] [CrossRef]
Bal, E.; Kumar, R.; Hadigol, M.; Holmes, A.B.; Hilton, L.K.; Loh, J.W.; Dreval, K.; Wong, J.C.H.; Vlasevska, S.; Corinaldesi, C.; et al. Super-Enhancer Hypermutation Alters Oncogene Expression in B Cell Lymphoma. Nature 2022, 607, 808–815. [Google Scholar] [CrossRef]
Jia, Q.; Chen, S.; Tan, Y.; Li, Y.; Tang, F. Oncogenic Super-Enhancer Formation in Tumorigenesis and Its Molecular Mechanisms. Exp. Mol. Med. 2020, 52, 713–723. [Google Scholar] [CrossRef]
Wang, X.; Cairns, M.J.; Yan, J. Super-Enhancers in Transcriptional Regulation and Genome Organization. Nucleic Acids Res. 2019, 47, 11481–11496. [Google Scholar] [CrossRef]
Li, G.; Kang, Y.; Feng, X.; Wang, G.; Yuan, Y.; Li, Z.; Du, L.; Xu, B. Dynamic Changes of Enhancer and Super-Enhancer Landscape in Degenerated Nucleus Pulposus Cells. Life Sci. Alliance 2023, 6, e202201854. [Google Scholar] [CrossRef]
Yamagata, K.; Nakayamada, S.; Tanaka, Y. Critical Roles of Super-Enhancers in the Pathogenesis of Autoimmune Diseases. Inflamm. Regen. 2020, 40, 25. [Google Scholar] [CrossRef]
He, Y.; Long, W.; Liu, Q. Targeting Super-Enhancers as a Therapeutic Strategy for Cancer Treatment. Front. Pharmacol. 2019, 10, 361. [Google Scholar] [CrossRef]
Niederriter, A.R.; Varshney, A.; Parker, S.C.J.; Martin, D.M. Super-Enhancers in Cancers, Complex Disease, and Developmental Disorders. Genes 2015, 6, 1183–1200. [Google Scholar] [CrossRef]
Qu, J.; Ouyang, Z.; Wu, W.; Li, G.; Wang, J.; Lu, Q.; Li, Z. Functions and Clinical Significance of Super-Enhancers in Bone-Related Diseases. Front. Cell Dev. Biol. 2020, 8, 534. [Google Scholar] [CrossRef]
Liu, S.; Dai, W.; Jin, B.; Jiang, F.; Huang, H.; Hou, W.; Lan, J.; Jin, Y.; Peng, W.; Pan, J. Effects of Super-Enhancers in Cancer Metastasis: Mechanisms and Therapeutic Targets. Mol. Cancer 2024, 23, 122. [Google Scholar] [CrossRef]
Tang, S.C.; Vijayakumar, U.; Zhang, Y.; Fullwood, M.J. Super-Enhancers, Phase-Separated Condensates, and 3D Genome Organization in Cancer. Cancers 2022, 14, 2866. [Google Scholar] [CrossRef] [PubMed]
Qian, H.; Zhu, M.; Tan, X.; Zhang, Y.; Liu, X.; Yang, L. Super-Enhancers and the Super-Enhancer Reader BRD4: Tumorigenic Factors and Therapeutic Targets. Cell Death Discov. 2023, 9, 171. [Google Scholar] [CrossRef] [PubMed]
Kravchuk, E.V.; Ashniev, G.A.; Gladkova, M.G.; Orlov, A.V.; Vasileva, A.V.; Boldyreva, A.V.; Burenin, A.G.; Skirda, A.M.; Nikitin, P.I.; Orlova, N.N. Experimental Validation and Prediction of Super-Enhancers: Advances and Challenges. Cells 2023, 12, 1191. [Google Scholar] [CrossRef] [PubMed]
Jiang, Y.; Qian, F.; Bai, X.; Liu, Y.; Wang, Q.; Ai, B.; Han, X.; Shi, S.; Zhang, J.; Li, X.; et al. SEdb: A Comprehensive Human Super-Enhancer Database. Nucleic Acids Res. 2019, 47, D235–D243. [Google Scholar] [CrossRef]
Quinlan, A.R.; Hall, I.M. BEDTools: A Flexible Suite of Utilities for Comparing Genomic Features. Bioinformatics 2010, 26, 841–842. [Google Scholar] [CrossRef]
Goh, W.W.-B.; Wong, L. Advanced Bioinformatics Methods for Practical Applications in Proteomics. Brief. Bioinform. 2019, 20, 346–359. [Google Scholar] [CrossRef]
Abdi, H.; Williams, L.J. Principal Component Analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
Lancichinetti, A.; Fortunato, S. Consensus Clustering in Complex Networks. Sci. Rep. 2012, 2, 336. [Google Scholar] [CrossRef]
Hagberg, A.A.; Schult, D.A.; Swart, P.J. Exploring Network Structure, Dynamics, and Function using NetworkX. In Proceedings of the 7th Python in Science Conference, Pasadena, CA, USA, 21 August 2008. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Saul, N.; Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 2018, 3, 861. [Google Scholar] [CrossRef]
STRING Database. Available online: https://string-db.org/ (accessed on 28 February 2025).

Figure 1. Clustering and signal distribution of super-enhancer loci in Lung Samples: two-dimensional embedding of consolidated SE loci colored by cluster membership (left); distribution of median-normalized log-transformed SE activity values for cluster 1 across four lung samples: Sample_11_0077 (Lung normal), Sample_12_0118 (Lung tumor), Sample_12_0449 (Lung adenocarcinoma cell, Nkx2-1-positive), and Sample_12_0450 (Lung adenocarcinoma cell, Nkx2-1-negative) (right).

Figure 2. Functional annotation of super-enhancer loci through gene ontology and pathway analysis: gene ontology enrichment analysis for biological processes within the protein–protein interaction network (a); enriched KEGG pathways associated with SE loci (b).

Table 1. Overview of selected samples from SEdb 2.0.

Sample Set	Sample Type	Tissue Type	Species	SEdb 2.0 ID	Experimental Condition
Set 1	Tissue	Mammary gland	Mouse	Sample_12_0136	Normal mammary gland (untreated)
Set 1	Tissue	Mammary gland	Mouse	Sample_12_0137	Mammary tumor (untreated)
Set 1	Tissue	Mammary gland	Mouse	Sample_12_0138	Mammary tumor (DMSO-treated)
Set 1	Tissue	Mammary gland	Mouse	Sample_12_0139	Mammary tumor (C646-treated)
Set 2	Tissue	Lung	Mouse	Sample_11_0077	Lung postnatal (day 0)
Set 2	Tissue	Lung	Mouse	Sample_12_0118	Lung tumor
Set 2	Tissue	Lung	Mouse	Sample_12_0449	Lung adenocarcinoma cell (Nkx2-1-positive)
Set 2	Tissue	Lung	Mouse	Sample_12_0450	Lung adenocarcinoma cell (Nkx2-1-negative)
Set 3	Cell line	NMuMG cells	Mouse	Sample_12_0763	Untreated
Set 3	Cell line	NMuMG cells	Mouse	Sample_12_0764	TGF-β-treated (4 h)
Set 3	Cell line	NMuMG cells	Mouse	Sample_12_0765	TGF-β-treated (24 h)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Osintseva, E.D.; Ashniev, G.A.; Orlov, A.V.; Nikitin, P.I.; Zaitseva, Z.G.; Volkov, V.V.; Orlova, N.N. A Non-Binary Approach to Super-Enhancer Identification and Clustering: A Dataset for Tumor- and Treatment-Associated Dynamics in Mouse Tissues. Data 2025, 10, 74. https://doi.org/10.3390/data10050074

AMA Style

Osintseva ED, Ashniev GA, Orlov AV, Nikitin PI, Zaitseva ZG, Volkov VV, Orlova NN. A Non-Binary Approach to Super-Enhancer Identification and Clustering: A Dataset for Tumor- and Treatment-Associated Dynamics in Mouse Tissues. Data. 2025; 10(5):74. https://doi.org/10.3390/data10050074

Chicago/Turabian Style

Osintseva, Ekaterina D., German A. Ashniev, Alexey V. Orlov, Petr I. Nikitin, Zoia G. Zaitseva, Vladimir V. Volkov, and Natalia N. Orlova. 2025. "A Non-Binary Approach to Super-Enhancer Identification and Clustering: A Dataset for Tumor- and Treatment-Associated Dynamics in Mouse Tissues" Data 10, no. 5: 74. https://doi.org/10.3390/data10050074

APA Style

Osintseva, E. D., Ashniev, G. A., Orlov, A. V., Nikitin, P. I., Zaitseva, Z. G., Volkov, V. V., & Orlova, N. N. (2025). A Non-Binary Approach to Super-Enhancer Identification and Clustering: A Dataset for Tumor- and Treatment-Associated Dynamics in Mouse Tissues. Data, 10(5), 74. https://doi.org/10.3390/data10050074

Article Menu

A Non-Binary Approach to Super-Enhancer Identification and Clustering: A Dataset for Tumor- and Treatment-Associated Dynamics in Mouse Tissues

Abstract

1. Summary

2. Data Description

2.1. Subset 1: SE Locus Consolidation Information

2.2. Subset 2: Presence of Super-Enhancers and Typical Enhancers in Consolidated SE Loci

2.3. Subset 3: Features of Consolidated Super-Enhancer Loci

2.4. Subset 4: Preprocessing for Clustering

2.5. Clustering and Gene Associations

3. Methods

3.1. Raw Data Collection

3.2. Consolidation of Super-Enhancer Loci Across Sample Groups

3.3. Identification of Enhancer-Based Elements Within Consolidated Super-Enhancer Loci

3.4. Feature Matrix Construction and Dimensionality Reduction

3.5. Clustering and Gene Association

3.6. Functional Analysis of Protein Interaction Network

3.7. Software Environment

4. User Notes

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI