The Idiopathic Pulmonary Fibrosis-Associated Single Nucleotide Polymorphism RS35705950 Is Transcribed in a MUC5B Promoter Associated Long Non-Coding RNA (AC061979.1)

LncRNAs are involved in regulatory processes in the human genome, including gene expression. The rs35705950 SNP, previously associated with IPF, overlaps with the recently annotated lncRNA AC061979.1, a 1712 nucleotide transcript located within the MUC5B promoter at chromosome 11p15.5. To document the expression pattern of the transcript, we processed 3.9 TBases of publicly available RNA-SEQ data across 27 independent studies involving lung airway epithelial cells. Epithelial lung cells showed expression of this putative pancRNA. The findings were independently validated in cell lines and primary cells. The rs35705950 is found within a conserved region (from fish to primates) within the expressed sequence indicating functional importance. These results implicate the rs35705950-containing AC061979.1 pancRNA as a novel component of the MUC5B expression control minicircuitry.


Introduction
Human DNA consists of protein-coding regions and non-coding regions. Proteincoding genomic regions are abundantly transcribed, evolutionarily conserved, mutationally sensitive sequences which impact cellular phenotype. These constitute approximately 1% of the human genome [1]. Non-coding regions of DNA, on the other hand, are more complex and can be divided into at least five structural types: (i) binding motifs for regulatory proteins, (ii) non-coding RNAs (ncRNAs), (iii) transposable elements, (iv) highly repetitive DNA-essential in gene regulation and chromosome maintenance, and (v) pseudogenes [2,3]. In 2003, the ENCODE (Encyclopedia of DNA Elements) project was launched to identify and classify functional elements in the human genome including non-coding transcripts. The project continues to grow with the results being made available on Ensembl and UCSC genome browser for both human and mouse [4].
Most of the ncRNAs predicted by ENCODE are expressed at low levels [4]. However, their abundance is not a proxy for their functionality [5]. For example, the predicted lncRNA ENST00000567151 or viability enhancing in lung Cancer transcript (VELUCT) was found at only 0.01 copies per cell. Despite its low copy number, VELUCT expression was reported to be upregulated by 5.2 fold in lung cancer cells, and its knockdown reduces the viability of multiple lung-cancer cell lines by as much as 90% [6].

Rediscovery of the Non-Coding Transcript AC061979.1 in the Promoter Region of the MUC5B Gene
To identify a putative non-coding transcript in a "dark" intergenic region on the pterminus of chromosome 11 on the human genome in the context of lung epithelia, we manually collected and interrogated a total of 3.9 TBases of publicly available RNA-SEQ data involving epithelial lung cells (see Table S1) to generate a summative transcriptional signal of the AC061979.1 locus (see Figure 1A). Concomitantly, the GENCODE [45] release 32 (GRCh38) described the putative transcript AC061979.1 (chromosome 11:1,218,530-1,220,242) mapping to the same region (see Figure 2). Interestingly, this novel lncRNA is reported to be subject to splicing, with rs35705950 mapping to the second nucleotide of exon 2 and, therefore, possibly altering AC061979.1 splicing; however, no evidence of splicing was immediately apparent through our analysis (see Figure 1) and no rs35705950-containing RNA-SEQ data from lung epithelia were found among the surveyed studies, with the exception of three donor samples across two independent studies (SRP096589/GSE93526 and SRP102483/GSE97036) where the expression of AC061979.1 was documented (see Table S1).
Manual inspection of each dataset indicated that almost all samples representing lung epithelial cell lines or primary lung epithelial cells showed evidence of expression in the locus. Of particular interest, however, was a dataset (SRP082973) that isolated only basal cells from the epithelium. Thus, within the same study, we compared basal cell and the epithelium (mix of basal, ciliated, columnar and secretory cells) RNA expression [46]. Interestingly, whilst reads from epithelial cell extracts mapped liberally to the AC061979.1 locus, sporadic alignments of only a couple of reads were detected among basal-cell RNA ( Figure 1). Given that basal cells are a sub-type of human-airway epithelial cells not involved in mucous production, which act as stem cells for the other sub-types (ciliated, columnar and secretory cells) [47], these results potentially show the activation of ncRNA AC061979.1 after cell differentiation. Analysis of data from IPF studies (see Figure 1C) involving lung tissue (GSE52463 or SRP033095), primary cells (GSE116086 or SRP151008), and at single-cell level (GSE124685 or SRP175341) demonstrated only limited coverage of the locus in line with the low expression level indicated elsewhere, and confirmed the absence of AC061979.1 expression in fibroblasts. Taken together, these results suggest that expression at AC061979.1 is detectable in lung epithelia irrespective of the biological origin of the data or the precise sequencing protocol used, minimising the risk of batchassociated effects.
To determine the evolutionary importance of the DNA sequence harbouring the rs35705950 polymorphism, the human reference genome (GRCh38.p13) was aligned against nine vertebrate genome seuqences: six mammals (Rhesus monkey, baboon, marmoset, pig, sheep, rat, mouse), one fish (zebrafish) and one bird (chicken). High similarity was observed in exonic regions across primates, with phylogenetically distant mammals showing conservation only at the 5 end of the second exon (see Figure 3A). This region appears to harbour at least three conserved loci, including a FOXA2 binding motif and four SMAD binding motifs, two of which reside in the putative intron (see Figure S1) and are found across mammalian species (see Figure 3B).
Although AC061979.1 transcript abundance in other species is limited by the lack of RNA-SEQ datasets to interrogate, taken together these results indicate a functional significance for this pancRNA, with the G/T rs35705950 SNP possibly being involved in differential splicing of AC061979.1.

MUC5B pancRNA Expression Validation
To independently validate the expression of AC061979.1, we first designed probe hydrolysis RT-qPCR assays for the putative spliced variant and holotranscript. These assays exhibited amplification efficiencies of 128.8% and 90.6% when tested against serial dilutions of a spliced AC061979.1 geneblock or A549 cell DNA extracts, respectively. Next, we obtained RNA extracts from adenocarcinoma human alveolar type-II epithelial cells (A549 cells) and cystic-fibrosis bronchial epithelial cells (CFBE41o-) representing alveolar and bronchial epithelial cells, respectively. To account for the potential impact of contact inhibition effects on AC061979.1 expression, total RNA was extracted at low (<30%) and high (>70%) confluence, and expression of the two AC061979.1 variants was assessed against 18S rRNA and MUC5B, across serial dilutions of total RNA. These analyses indicated that only the AC061979.1 full transcript was detectable, albeit at a very low copy number (see Figure 4). Thus, at 50 ng of RNA input per RT-PCR reaction, in A549 cells the paRNA ∆Ct to 18S was 22.44 (±5.94) at high confluence vs 24.12 (±3.16) at low confluence, whereas in CFBE41o-the ∆Ct was 21.92 (±5.53) at high confluence vs 28.59 (±0.81) at low confluence (n = 3). Of note is that where RNA extraction resulted in higher Ct values for 18S, the capacity to detect the pancRNA transcript was lost as concentrations dropped below the assay limit of detection, justifying the very high load of RNA template in the RT-PCR reactions. The same assays were perfomed with undifferentiated (basal) and ALI differentiated HAEpCs. The basal cells were cultivated to a confluence of 40-50% before extracting total RNA. Of the two AC061979.1 transcripts, only the full variant was detectable (36.19 ± 0.29), whereas the spliced variant was not detectable. Interestingly, MUC5B levels were below the assay detection limit; 18S was detected at a Ct of 10.63 (±0.14). ALI-differentiated HAEpCs in control conditions showed similar values to the basal cells (36.84 ± 1.38), whereas in IL-13-stimulated cells both AC061979.1 variants were below the detection limit. In differentiated epithelia, Cts for 18S were low for both controls and IL-13-treated cells (7.97 ± 0.06 and 8.53 ± 0.08, respectively). Similarly, to the basal cells, MUC5B was not detectable. Efforts to define the 5 and 3 ends of the transcript by RLM-RACE failed to produce sequencing-grade amplicons, probably due to the low expression level of the transcript.
To ascertain the relevance of the FOXA2 binding motif in exon 2 of AC061979.1, we examined FOXA2 expression levels in A549 cells and HAEpCs. This analysis indicated no statistically significant difference between A549 cells at different confluence levels (p = 0.1000), but a 10.6 ± 2.77-fold reduction in FOXA2 levels after IL-13 stimulation (p < 0.0001) consistent with the loss of AC061979.1 (see Figure S2).

Discussion
MUC5B dysregulation presently appears to be mechanistically involved in the development of the underlying pathology, particularly in the context of the IPF-associated SNP rs35705950. It contributes to mucus overproduction and expression in the alveolar microenvironment, leading to micro-injuries to alveolar epithelium and, across the lifetime of a carrier, excessive cell death and fibrosis [48]. Whilst in one study the polymorphism was found in 51% of the patients with IPF, but in only 23% of the control group [18], it is unclear at present if the onset of disease among rs35705950 positive controls is a matter of time, lifestyle, or additional genetic variability. However, the strong association and high incidence rate of the polymorphism in IPF make a compelling case for lifestyle management and preventative chronic or genome modifying treatments targeting MUC5B expression repression and IPF, such as small interfering RNA, antisense or genome/prime editing. The four putative SMAD binding sites within the AC061979.1 locus, four of which reside in the proposed intron, suggest complex interplay between SMAD as an inducer and FOXA2 as a repressor of MUC5B.
Helling et al. (2017) reported a binding motif for FOXA2, located 32 bp downstream of rs35705950, which overlaps with the second putative exon of the pancRNA AC061979.1, as reported in GENCODE v32 (see Figure 2). The protein-coding gene for FOXA2 originates on chromosome 20, p11.21, between the lncRNA LINC00261 and LNCNEF. Whilst LINC00261 (a.k.a. DEANR1 [49], FALCOR [50], and LCAL62 [51]) is widely studied for its role in non-small cell lung cancer, no study exists on LNCNEF to date. LINC00261 is an endoderm-associated lncRNA which recruits SMAD2/3 to induce the expression of FOXA2 [49,51]. FOXA2 transcription factor is known to have a role in lung development and homeostasis [52], MUC5B expression and IPF [53]; however, research in this area is limited.
Our RNA-SEQ-verse survey did not return an explicit splicing signal in line with RNA-SEQ observations associated with high copy number RNAs (see Figure 5); However, if the proposed splicing event is confirmed, the location of the SNP raises the possibility that aberrant AC061979.1 splicing might be occurring in the context of the rs35705950 SNP. In turn, improper AC061979.1 splicing could be driving aberrant biochemistry on the locus such as the FOXA2 association demonstrated by Helling et al. (2017). Such a finding would introduce the additional option of a splice-correcting treatment in preventing the onset of IPF among rs35705950 carriers. Importantly, this oligonucleotide therapeutic modality has been approved for clinical use in Duchenne's muscular dystrophies (eteplirsen) and spinal muscular atrophy (nusinersen) without the need for drug-delivery solutions that otherwise plague efficacious oligonucleotide therapies for the lung [54]. LncRNAs can form complex biological systems by binding to other RNA molecules, regulatory proteins, or DNA. FENDRR is an lncRNA expressed in the nascent lateral mesoderm, in the promoter of Forkhead Box F1 (FOXF1), where it forms a triple helix with double-stranded DNA and increases the occupancy of the Polycomb repressive complex 2 (PRC2) at this site. Rescue experiments on FENDRR-knockdown cells wherein a construct expressing the lncRNA was placed randomly in the genome showed its biological role and that the transcript acts in trans [55]. Similarly, LINC00261-null cells were rescued by viruses expressing FOXA2, in the transcriptional activation of FOXA2, which is upstream of LINC00261 [49]. It is, thus, possible that the mechanism behind MUC5B regulation involves an assembly between the pancRNA AC061979. showed that phosphorylated SMAD levels are low in mucosecretory cells, and the inflammatory TGF-beta-dependendent SMAD signaling inhibition enhanced mucin expression, as well as goblet-cell metaplasia and hyperplasia, supporting a role for SMAD proteins in MUC5B expression regulation [56]. Interactions with SMAD2/3 in the promoter of MUC5B and AC061979.1 are indeed possible due to the presence of the canonical SMAD binding element (SBE) CAGAC within the intronic region of the pancRNA, and the newly described GGC(GC)(CG) motif also known as 5GC SBE [57] within the first exon of AC061979.1 [56]. Moreover, SMAD2/3 does not necessarily need to occupy either of these SBEs on chromatin, because SMAD2/3 does not occupy the SBEs located within the LINC00261 gene [49]: instead, it interacts with LINC00261 directly at least under some experimental conditions [50]. LINC00261 is, therefore, an example of cis-acting ncRNA, whereas other lncRNAs such as EMT-associated lncRNA induced by TGFbeta1 (ELIT-1) act in trans to bind SMAD to SMAD binding elements (SBEs) such as the CAGAC box [58]. Disruption of this proposed SMAD2/3, AC061979.1, and possibly the FOXA2 ribonucleoprotein and chromatin interaction network at the MUC5B promoter by rs35705950, for example, due to aberrant splicing, could explain MUC5B overexpression in IPF, given the pivotal role of SMAD proteins in resolving goblet-cell metaplasia and hyperplasia in inflammatory pulmonary disease [56].
To date, the FOXA2 binding site 32 bp downstream of rs35705950 has been shown to bind FOXA2 in episomal reporter systems but not by genome editing or CHIP-SEQ [18]. Our own genome-editing efforts with three separate single-guide RNAs to introduce the rs35705950 G/T transversion at Chr11:1,219,991 in A549 cells in support of CHIP-SEQ, RIP-SEQ, and proteomic experiments to resolve the MUC5B transcriptional complex, have so far proven to irreparably affect cell viability or fail in generating any detectable editing either by T7-EI or sequencing assays. Furthermore, no verified G/T or T/T lung epithelial cell line is currently available to support such mechanistic studies. As lncRNA-protein interaction is a hot research topic, recent studies have focused on developing computational methods for predicting these complex networks [59][60][61][62]. It is, thus, anticipated that with increasing understanding of lncRNA biology and characterisation of lncRNA structures and families, additional insights into AC061979.1 function might be obtained.
In this study, we developed a simple-to-use method for the targeted mining of the RNA-SEQ dataverse for lncRNA transcripts irrespective of their polyadenylation status. Our method is achievable on a public server in Galaxy (galaxyproject.org) with an extensive easyto-follow guide available (see Scheme S1). It takes as input Sequence Read Archive (SRA) codes and the output is a .TXT file reporting the depth of coverage per position making end-user memory requirements compatible with standard desktop/laptop computers or even smartphones. However, it can be adapted to run on a cluster without a graphical user interface (GUI). Using this method, we have been able to amass evidence through the analysis of 3.9 TBase of RNA-SEQ data across 27 publications documenting the expression of a novel pancRNA overlapping the IPF-associated rs35705950 SNP implicated in MUC5B overexpression, annotated as AC061979.1 by GENCODE. The results were replicated by qRT-PCR in A549 cells and CFBE41o submerged cultures as well as in pHAECs.

RNA-SEQ Data Processing for Novel ncRNA Detection
To determine the existence of a MUC5B pancRNA, we manually collected publicly deposited RNA-SEQ data from 27 independent studies involving alveolar and bronchial samples from primary human tissue and in vitro experiments (see Table S1). RNA-SEQ reads above Q20 were mapped to the human reference genome GRCh38.p13 using HISAT2 [63]. Mapped reads were filtered with samtools view [64] and only read pairs mapping to chromosome 11, region 1,202,000-1,220,500, were kept. Subsequently, the depth of coverage per base was extracted from all datasets and collapsed. The results were visualised in R Studio (ggplot2). The pipeline can be performed in Galaxy (galaxyproject.org). An extensive step-by-step guide is available as a supplementary file (Scheme S1).

Multiple Sequence Alignment
To demonstrate the evolutionary importance of the region overlapping the promoter polymorphism rs35705950, we compared the human ncRNA with nucleotide sequences of 10 other species from fish to primates. Rhesus monkey (Macaca mulatta), baboon (Papio anubis), white-tufted-ear marmoset Callithrix jacchus, pig (Sus scrofa), sheep (Ovis aries), Norvegian rat (Rattus norvegicus), house mouse (Mus musculus), chicken (Gallus gallus) and zebrafish (Danio rerio). Alignments of genome sequences were undertaken using AVID and Shuffle-LAGAN programs implemented through mVISTA (http://genome.lbl. gov/vista/mvista/submit.shtml, accessed on 10 January 2022) [65] with a match criterion of 70% identity over 50bp [66]. All sequences used in analysis are included in Table S2. Subsequently, we aligned the same genomic sequences with ClustalOmega for nucleotideby-nucleotide approach.
Primary human airway epithelial primary cells (pHAECs) from several donors (n = 1 basal cells, n = 4 ALI differentiated cells) were isolated from fresh tissues that were obtained during tumor resections or lung transplantation with the full consent of patients (ethics approval: ethics committee Medical School Hannover, project no. 2701-2015).
pHAECs basal cell for air liquid interface (passage 2) were expanded as above in T75 flasks till 90% confluency. The cells were than trypsinized and seeded into Transwell filters (6.5 mm diameter, 4 µm pore size, Corning Costar, Kaiserslauten, Germany). Filters, prior to cell seeding, were coated with 100 µL collagen solution (StemCell Technologies, Saint Égrève, France), and left to dry under sterile hood overnight. Subsequently, the filters were exposed to UV light for 30 min and stored at 4 • C.
Cells were resuspended in growth medium, and 200 µL containing 4 × 10 4 cells were added apically to each filter, an additional 600 µL of the medium were added basolaterally. The medium was replaced every 48 h until 100% confluence was reached. Growth medium was then removed from apical side and on the basolateral side it was replaced with ALI differentiation medium ±10 ng/ml IL-13 (IL012; Merck Millipore). Once the ALI interface was established, medium was exchanged every second day till day 25-28 on ALI. At the endpoint of cultivation, RNA extraction was performed directly on the filter.

RNA Extraction
RNA extraction was carried out using the miRNeasy mini kit (Qiagen, Manchester, UK). Briefly, cells were detached by trypsinisation then resuspended in 0.7 mL Qiazol Lysis reagent with subsequent steps according to the supplier's total RNA extraction protocol.
For pHAECs, cells were detached by trypsinisation then resuspended in 2.1 mL Lysis Solution RL from my-Budget RNA Mini Kit (BioBudget, Krefeld, Germany), RNA isolation was performed following the manufacturer protocol. If not used immediately after lysis, the samples were stored at −80 • C. For pHAECs ALI cultures, 100 µL of Lysis Solution RL from the same kit was added to the filters apically and the samples were immediately frozen at −80 • C.

DNase Treatment and cDNA Synthesis
Total RNA was DNase treated using the Precision TM DNase kit Primer Design (Southampton, UK), following the manufacturer's protocol. cDNA synthesis was carried out using the High Capacity cDNA reverse Transcription Kit (Thermo Fisher scientific (ThermoFisher Scientific) following the manufacturers protocol. A total of 1000 ng of total RNA was loaded into each 20 µL cDNA synthesis reaction.
For pHAEC cells, the cDNA synthesis from the extracted was performed using Super-Script VILO cDNA Synthesis Kit (Thermo Fisher) following the manufacturer protocol. A total of 400 ng of RNA were used for each reaction.

Real-Time Quantitative PCR
Custom primers and probes (Table S3) were designed using the PrimerQuestTM tool (Integrated DNA Technologies BVBA, Leuven, Belgium) and validated against an AC061979.1 geneblock (Integrated DNA Technologies) corresponding to the predicted spliced transcript. Inventoried predesigned assays for 18S and MUC5B were purchased from Thermo Fisher Scientific, and for FOXA2 from Qiagen. Real-time quantitative PCR was performed in 10 µL reactions containing 5L TaqMan Fast Advanced Master Mix (2×) (ThermoFisher Scientific), 900 nM forward primer, 900 nM reverse primer and 250nM probe per reaction and 1µL template on a StepOnePlusTM real-time PCR system (Thermo Fisher Scientific). After a UNG incubation at 50 • C for 2 min, initial denaturation at 95 • C for 2 min was followed by 40 cycles of 95 • C denaturation for 1 s and 60 • C anneal extension for 20 s. Gene expression was calculated according to the delta Ct method [67]. Statistical analyses on gene expression were performed on data expressed as a fold difference to high confluence A549 samples, and control samples in a paired sample fashion for HAEpCs, respectively. GraphPad Prism v.9.4.1 (GraphPad Software. LLC, San Diego, CA, USA) was used for Kolmogorov-Smirnov tests for cell-line results and paired t tests for HAEpC results. For RNA ligase-mediated 5 and 3 rapid amplification of cDNA ends (RLM-RACE), the FirstChoice RLM-RACE kit was used according to the manufacturer's instructions (Thermo Fisher) using pancRNA gene-specific primers for RT-PCR.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/ncrna8060083/s1, Table S1: List of the studies used in the RNA-SEQ meta-analysis; Table S2: Overview of the intergenic region between MUC5AC and MUC5B and the rs35705950-MUC5B transcription-start-site (TSS) distance across four species; Table S3: Primers and probes used in the study; Scheme S1: Extensive workflow of the RNA-SEQ meta-analysis; Figure S1: AC061979.1-DNA Sequence. Uper case letters-exons; lower case letters-intron; green highlight-SMAD2/3 binding motif; pink highlight-FOXA2 binding motif; Figure S2: FOXA2 expression levels in A549 and HAEpC's. The levels of FOXA2 mRNA were determined in A549 cells at low and high con uence levels and HAEpC's with and without IL-13 stimulation. Expression, normalised to 18S rRNA levels, was calculated relative to high con uence A549s or paired unstimulated HAEpC's. ****: p < 0.0001.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
Publicly available datasets were analyzed in this study. The SRA/GSM accession numbers for each study can be found in Supplementary Table S2.

Conflicts of Interest:
The authors declare no conflict of interest.