LncRNA-SNPs in a Brazilian Breast Cancer Cohort: A Case-Control Study

Long noncoding RNAs (lncRNAs) are a class of non-coding RNAs that contain more than 200 nucleotides and exhibit a versatile regulatory capacity. Genomic alterations in lncRNAs have already been investigated in several complex diseases, including breast cancer (BC). BC is a highly heterogeneous disease and is the most prevalent cancer type among women worldwide. Single nucleotide polymorphisms (SNPs) in lncRNA regions appear to have an important role in BC susceptibility; however, little is known about lncRNA-SNPs in the Brazilian population. This study used Brazilian tumor samples to identify lncRNA-SNPs with a biological role in BC development. We applied a bioinformatic approach intersecting lncRNAs that are differentially expressed in BC tumor samples using The Cancer Genome Atlas (TCGA) cohort data and looked for lncRNAs with SNPs associated with BC in the Genome Wide Association Studies (GWAS) catalog. We highlight four lncRNA-SNPs—rs3803662, rs4415084, rs4784227, and rs7716600—which were genotyped in Brazilian BC samples in a case-control study. The SNPs rs4415084 and rs7716600 were associated with BC development at higher risk. These SNPs were also associated with progesterone status and lymph node status, respectively. The rs3803662/rs4784227 haplotype GT was associated with BC risk. These genomic alterations were also evaluated in light of the lncRNA’s secondary structure and gain/loss of miRNA binding sites to better understand its biological functions. We emphasize that our bioinformatics approach could find lncRNA-SNPs with a potential biological role in BC development and that lncRNA-SNPs should be more deeply investigated in a highly heterogeneous disease population.


Introduction
Breast cancer (BC) is the most common cancer among women worldwide, representing almost 26% of the total number of new cases of cancer diagnosed in 2020 [1]. Several risk factors have been associated with disease development, including environmental factors such as a sedentary lifestyle and alcohol consumption and intrinsic factors related to genetics [2].
Genetic susceptibility to BC can be related to germline mutations with high penetrance, such as those of the BRCA1 and BRCA2 genes, and due to the presence of single nucleotide polymorphisms (SNPs) in coding and non-coding regions [3].
Incorporating SNPs into risk prediction models, in combination with classical risk factors, can contribute to improving risk-based BC screening. Population-based screening programs aim to detect the disease at an early stage since effective treatment at this stage can lead to improved disease outcomes and lower mortality rates [4].
To identify genomic regions associated with BC, genome-wide association studies (GWAS) have already identified approximately 170 loci associated with BC risk, with the vast majority of GWAS identifying SNPs located in noncoding regions [5]. Among them, several long noncoding RNA (lncRNA) SNPs have been associated with BC, although this area remains relatively unexplored. lncRNAs are a class of noncoding RNAs that have more than 200 nucleotides with a versatile regulatory ability [6]. lncRNAs have been described as participating in breast cancer hallmarks, such as inducing proliferation, repressing apoptosis, and contributing to subtype differentiation [7,8].
In order to deepen the understanding of lncRNA-SNPs associated with BC risk, we conducted a GWAS data mining analysis to select some potential SNP candidates and then evaluated the selected lncRNA-SNPs in a Brazilian BC cohort.

GWAS Data Mining and lncRNAs Selection
As the first step, we performed a differential expression analysis using RNA-Seq HTSeq-FPKM from the TCGA Breast Cohort using as cut-offs fold change |>1.5| and p < 0.001, and then identified the ones that were classified as lncRNA. This analysis was made by comparing the tumor sample and its normal counterpart, and data were downloaded from XenaBrowser (https://xenabrowser.net/, accessed on 11 November 2022).
After the selection of differentially expressed (DE) lncRNA, we retrieved data from the GWAS Catalog (https://www.ebi.ac.uk/gwas/, accessed on 11 November 2022) using "breast carcinoma" as the keyword and downloaded the SNPs associated with each report [5]. Based on genomic coordinates, we mapped regions that contained lncRNAs. We selected the lncRNAs that both contain an associated SNP and were differentially expressed in breast tumors relative to normal tissue. After this filtering based on differentially expressed lncRNAs, we selected lncRNA-SNPs that exhibited minor allele frequencies higher than 0.20 in order to exclude rare alleles.

Study Cohort
The case-control analyses were performed using tumor samples of 291 patients with sporadic breast cancer (including all immunohistochemical subtypes) from the Hospital Nossa Senhora das Graças (HNSG), located in Curitiba, Paraná state, in southern Brazil. As a control group, we used peripheral blood samples of 370 women who were recruited on a voluntary basis and who declared they had no personal or family history of cancer with a mean age of 47.66 ± 4.69 years.
Patients and control samples were from the same region in the south of Brazil, most living in the metropolitan region of Curitiba. In the control group, ancestry information was obtained from self-reported patients' records, with 84.7% white, 10.7% black or mixed-race, and 1.9% others. A subset of patients (n = 15) was genotyped using an SNP chip Illumina Infinium QC Array (Illumina Inc., San Diego, CA, USA), which contains 15,949 markers, including~3000 ancestry informative markers (AIMs). Based on these ancestry data [9], the patients of this study clustered with the European-defined group from the 1000 Genome Project and with the Admixed Americans main group, mainly composed of Colombians and Mexicans, highlighting the marked ancestral heterogeneity of the Brazilian population [9].
The samples were collected with the approval of the Human Research Ethics Committee of the Health Sciences Sector of UFPR (CAAE: 67029617.4.0000.0102). All samplings and experiments followed relevant guidelines, Brazilian regulations, and ethical principles for human research in the Declaration of Helsinki. The project was described to all participants, and a written informed consent and epidemiological questionnaire were obtained from all participants enrolled in the study. The clinical characteristics of the patients an-alyzed in this study are available in Supplementary Table S2. In addition, we organized  the data regarding environmental/genetic issues obtained from the analyzed patients in  Supplementary Table S3. 2.3. Sample Genotyping DNA extraction was performed by the phenol-chloroform method in tissue samples. The peripheral blood DNA from women with no cancer was extracted by the salting-out method [10].

Statistical Analysis
For the genotypic frequency tests of the control and patient groups, we used the test of deviations in the proportions of the Hardy-Weinberg theorem by Chi-square. Additionally, we used the odds ratio (OR) calculation and the Chi-square test to assess whether the variables (breast cancer and SNPs) were independent. We also performed a multivariate analysis considering BC prognostic markers, such as age, estrogen/progesterone status, tumor grade, and presence/absence of lymph node metastasis.
Statistical analyses were performed with R software with the Nortest and readxl packages [11,12]. For all tests described above, p-values < 0.05 were considered significant.

lncRNA-SNP Selection
In order to highlight lncRNAs that are DE and have an SNP associated with BC risk, we performed a two-step analysis. As the first step, we performed a differential expression analysis using RNA-Seq data from the TCGA breast cohort, comparing the sequences from cancer patients and the control group. Among all the lncRNAs, we found 1334 lncRNAs that were DE based on the filter criteria of fold change |>1.5| and p < 0.001 (Supplementary Figure S1).
Based on this, we compared DE lncRNAs and results from previously published GWAS [5] which evaluated 137,045 BC samples and 119,079 controls and found 38,134 BC risk SNPs. As a filter criterion, we also applied the selection of SNPs with minor allele frequency (MAF) > 0.20 to avoid rare alleles. This analysis resulted in 23 lncRNA-SNPs (Supplementary Table S1).
Since BC prognosis and treatment vary substantially across different BC subtypes, we carried out a final filter in which, from the 23 lncRNA-SNPs, we searched for ones that were DE among different BC subtypes. To accomplish this, we compared patients using TCGA RNA-Seq data based on molecular classifications, namely: luminal A, luminal B, HER2enriched, and basal-like. As a result, we highlighted 14 lncRNA-SNPs that exhibited the

Selected lncRNA-SNPs
Based on the described methodology, among the 14 lncRNA-SNPs, we selected four lncRNA-SNPs with the highest OR values to conduct a case-control study in a Brazilian cohort (Table 1). It is important to highlight that none of the lncRNA-SNPs mentioned here have been previously evaluated in a Brazilian cohort.
The lncRNA-SNP rs7716600 exhibited the highest OR value (1.24), located at the lncRNA named AC093297.2. Based on data in the literature, this SNP had already been associated with BC in different studies [13][14][15]. In our analysis, this lncRNA was upregulated in tumor samples compared to the control group. It is also downregulated in estrogen-positive subtypes and upregulated in HER2-positive subtypes.
Another selected lncRNA, LINC02224, was upregulated in tumor samples compared to the control group and is downregulated in estrogen-positive subtypes. The SNP rs4415084, significantly associated with a GWAS cohort, has already been associated with BC risk in different populations [16,17] but has never been studied in a Brazilian cohort.
Two SNPs (rs4784227/rs3803662) located in the lncRNA cancer susceptibility 16 (CASC16) have been investigated in BC in different populations [18][19][20][21], indicating their possible association with BC risk. Based on our expression analysis, CASC16 is upregulated in tumor samples (logFC = 1.92) and is specifically upregulated in estrogen-positive subtypes. The SNPs exhibit linkage disequilibrium in CEU (Utah Residents from North and West Europe) (D' = 0.97 and R 2 = 0.91), making them interesting candidates for exploration in haplotypes.

Hardy-Weinberg Equilibrium and Allele Frequency Comparison
Both case and control samples were in Hardy-Weinberg Equilibrium (HWE) (p-value < 0.05) for all the evaluated lncRNA-SNPs in our Brazilian cohort.
Gathering the allele frequency data together, we compared the observed frequency for the minor allele for the four lncRNA-SNPs of our cohort relative to the control group and the estimates of the 1000 Genomes Database for European and African populations. We organized these data in Table 2 below.

lncRNA-SNPs and BC Susceptibility
In order to evaluate the impact of the lncRNA-SNPs and BC susceptibility, we performed a case-control association study based on odds ratio (OR) calculation (Tables 3-6). Considering rs4415084, we found a significant association value considering the dominant model (Table 4), and for rs7716600 a significant case-control association was observed in the recessive model ( Table 6). Both of the lncRNA-SNPs-associated genotypes were related to high BC risk.  For rs3803662 (Table 3) and rs4784227 (Table 5), when analyzed individually, we did not find significant results testing the genotypes for codominant, dominant, recessive, and over-dominant models. It is important to highlight that we performed haplotype analysis (Table 7) for rs3803662/rs4784227 since these polymorphisms exhibit a linkage disequilibrium (LD), with rs3803662 considered as the Tag SNP. Using the LDlink database from NCBI (https://ldlink.nci.nih.gov/, accessed on 12 November 2022), the R 2 estimated for these two lncRNA-SNPs is 0.91. Regarding this analysis, we found that GT haplotype, which is the rare one, exhibited a significant association p-value.

lncRNA-SNPs and BC Clinical Parameters
After evaluating the association of lncRNA-SNPs in a case-control study, we divided the BC case samples according to some important clinical parameters in prognosis and treatment: age at diagnosis, estrogen/progesterone receptors status, tumor grade, and lymph node metastasis.
To evaluate age, we classified the samples into two groups: (1) higher than the mean age of diagnosis (57.90 ± 14.51) and (2) lower than the mean age of diagnosis. Estrogen, progesterone, and lymph node metastasis were stratified into two groups (1) positive and (2) negative, and tumor grade was divided into (1) grade I, (2) grade II, and (3) grade III. We found a significant association between rs4415084 and progesterone receptor status (Table 8). Given the recessive and over-dominant models of the genotypes, the presence of the C allele (CC/CT) is associated with the positivity of the progesterone receptor. This could be considered a good prognosis marker since therapy for hormone-positive BC cases is available. We also found a clinical correlation between rs7716600 and lymph node status (Table 9). In the dominant model, the presence of allele A is related to the absence of lymph node infiltration. This is also a good prognosis marker.

lncRNA TCGA Expression Data
We evaluated the expression data of the selected lncRNAs LINC02224, CASC16, and AC093297.2 using TCGA cohort data presented in the TANRIC database (https://ibl. mdanderson.org/tanric/_design/basic/main.html, accessed on 12 November 2022). We focused our analysis on the same clinical parameters evaluated in the association study. LINC02224 is downregulated in the progesterone-negative subtype (Figure 1), and based on our results, the presence of the C allele (CC/CT genotypes) is associated with progesterone status positivity.

lncRNA TCGA Expression Data
We evaluated the expression data of the selected lncRNAs LINC02224, CASC16, and AC093297.2 using TCGA cohort data presented in the TANRIC database (https://ibl.mdanderson.org/tanric/_design/basic/main.html, accessed on 12 November 2022). We focused our analysis on the same clinical parameters evaluated in the association study.
LINC02224 is downregulated in the progesterone-negative subtype (Figure 1), and based on our results, the presence of the C allele (CC/CT genotypes) is associated with progesterone status positivity.

lncRNA-SNPs Regulatory Effects
Using lncRNASNP2, we looked for potential miRNA site gain/loss in the analyzed lncRNA-SNPs. This information is important considering the regulatory roles of miRNAs and lncRNAs in ceRNA networks. We only found data for rs3803662 in the database, which exhibited a miRNA site gain for hsa-miR-196a-3p and miRNA site loss for hsa-miR-4524a-3p. Based on expression data of the TCGA BRCA cohort, miR-196a-3p is upregulated in tumor samples compared to the control group; moreover, miR-4524a-3p is shown to be upregulated in tumor samples with high grades compared to low-grade ones.
Another important aspect considering lncRNA function is based on its secondary structure. This can directly impact target interaction and can result in the gain/loss of target sites. With this in mind, using the RNAfold prediction database and the lncRNA-SNPs sequences, we investigated whether these genomic alterations could lead to structural variations. For this analysis, we used the parameter minimum free energy (MFE), and the result can be observed in Figure 2 below.

lncRNA-SNPs Regulatory Effects
Using lncRNASNP2, we looked for potential miRNA site gain/loss in the analyzed lncRNA-SNPs. This information is important considering the regulatory roles of miRNAs and lncRNAs in ceRNA networks. We only found data for rs3803662 in the database, which exhibited a miRNA site gain for hsa-miR-196a-3p and miRNA site loss for hsa-miR-4524a-3p. Based on expression data of the TCGA BRCA cohort, miR-196a-3p is upregulated in tumor samples compared to the control group; moreover, miR-4524a-3p is shown to be upregulated in tumor samples with high grades compared to low-grade ones.
Another important aspect considering lncRNA function is based on its secondary structure. This can directly impact target interaction and can result in the gain/loss of target sites. With this in mind, using the RNAfold prediction database and the lncRNA-SNPs sequences, we investigated whether these genomic alterations could lead to structural variations. For this analysis, we used the parameter minimum free energy (MFE), and the result can be observed in Figure 2 below. This information can be relevant considering the RNA partners these lncRNAs might have in a cellular context. For example, the lncRNA CASC16 is predicted to have several RNA Binding Protein (RBP) domains for TATA-Box Binding Protein Associated Factor 15 (TAF15), FUS, and UPF1. These proteins have essential regulatory roles and are also related to the BC context. Thus, disrupting the wild-type secondary structure can impact these protein bindings, deregulating some cellular processes. This information can be relevant considering the RNA partners these lncRNAs might have in a cellular context. For example, the lncRNA CASC16 is predicted to have several RNA Binding Protein (RBP) domains for TATA-Box Binding Protein Associated Factor 15 (TAF15), FUS, and UPF1. These proteins have essential regulatory roles and are also related to the BC context. Thus, disrupting the wild-type secondary structure can impact these protein bindings, deregulating some cellular processes.

Discussion
In recent years, lncRNAs have gained increased attention due to their versatile regulatory roles and use as biomarkers. These molecules have been implicated in several diseases, including BC [22]. In order to elucidate the role of lncRNA in the BC context, experiments are being conducted to evaluate lncRNA expression, epigenetic profiles, and genomic alterations, including SNPs.
A previous study from our group found a significant association of rs527616 (C > G), located in the lncRNA AQP4-AS1, with BC in a Brazilian population. We found that CG heterozygotes were above the expected, and the over-dominant model is the best one to explain our results (OR: 1.70, IC 95%: 1.23-2.34, p < 0.001). Furthermore, the SNP was associated with age at BC diagnosis, and the risk genotype was more frequent in the older age group [23]. Building on these results, we conducted a comprehensive integrative study to find lncRNA-SNPs associated with BC.
Taking together expression and genomic data, we selected lncRNA-SNPs that also exhibited significant differential expression between tumor and non-tumor tissue in the TCGA BRCA cohort and looked for ones that had been previously associated with BC susceptibility using the GWAS database. We chose the four lncRNA-SNPs with the highest OR values in the GWAS study for evaluation in a Brazilian BC population.
We observed that the lncRNA-SNPs rs4415084 and rs7716600 exhibited a significant

Discussion
In recent years, lncRNAs have gained increased attention due to their versatile regulatory roles and use as biomarkers. These molecules have been implicated in several diseases, including BC [22]. In order to elucidate the role of lncRNA in the BC context, experiments are being conducted to evaluate lncRNA expression, epigenetic profiles, and genomic alterations, including SNPs.
A previous study from our group found a significant association of rs527616 (C > G), located in the lncRNA AQP4-AS1, with BC in a Brazilian population. We found that CG heterozygotes were above the expected, and the over-dominant model is the best one to explain our results (OR: 1.70, IC 95%: 1.23-2.34, p < 0.001). Furthermore, the SNP was associated with age at BC diagnosis, and the risk genotype was more frequent in the older age group [23]. Building on these results, we conducted a comprehensive integrative study to find lncRNA-SNPs associated with BC.
Taking together expression and genomic data, we selected lncRNA-SNPs that also exhibited significant differential expression between tumor and non-tumor tissue in the TCGA BRCA cohort and looked for ones that had been previously associated with BC susceptibility using the GWAS database. We chose the four lncRNA-SNPs with the highest OR values in the GWAS study for evaluation in a Brazilian BC population.
We observed that the lncRNA-SNPs rs4415084 and rs7716600 exhibited a significant association p-value in the dominant and recessive models, respectively (Tables 4 and 6). Both are related to an increased risk of BC development in the analyzed population. The rs4415084 has already been highlighted as relevant in the development of BC in Caucasian Slavic [24], Chinese [25,26], and European populations [27]. Additionally, lncRNA-SNP rs4415084 was significantly associated with progesterone receptor status considering the recessive and over-dominant models (Table 8), the presence of the C allele (CC/CT) being associated with the positivity of the progesterone receptor. This may be a good prognosis marker since hormone-based therapy is available.
The rs7716600 is less explored in the literature. In our study, this lncRNA-SNP was associated with high BC risk (Table 6), also lymph node negative status (Table 9). Kim and colleagues [28] studied a Korean population and found that rs7716600 was significantly associated with breast cancer risk, and Quigley et al. [13] found an association with this lncRNA-SNP in estrogen-positive tumors.
The lncRNA-SNPs rs4784227/rs3803662 are in strong LD and were evaluated in haplotypes. We found a significant result for the haplotype GT in case-control study. These lncRNA-SNPs have already been investigated in the BC context. In a Chinese population, rs3803662 was associated with a protective role in breast cancer risk, while rs4784227 increased breast cancer susceptibility at age > 50 years [29].
Looking for lncRNA-SNPs rs4784227/rs3803662 and secondary structure changes, we found that rs4784227 alleles exhibited significantly distinct structures (Figure 1). This alteration can lead to the gain/loss of target interaction sites. CASC16, which contains both rs4784227 and rs3803662, seems to have several RBP sites. According to CLIP data, CASC16 interacts strongly with TAF15 [30], UPF1 [31], and FUS [32]. TAF15 plays a role in RNA polymerase II gene transcription as a component of a distinct subset of multisubunit transcription initiation factor TFIID complexes [33]. UPF1 is a protein that is part of a post-splicing multiprotein complex involved in both mRNA nuclear export and mRNA surveillance, and data evaluating UPF1 and other lncRNA interactions provide a fundamental basis for cell transformation and tumorigenic growth [34]. Finally, FUS is a multifunctional protein component of the heterogeneous nuclear ribonucleoprotein (hnRNP) complex, which is associated with triple-negative BC progression [35].
Looking deeper into the potential regulatory role of these alleles in the BC context, we found that rs3803662 exhibited a miRNA site gain for hsa-miR-196a-3p and miRNA site loss for hsa-miR-4524a-3p. miR-196a-3p is considered an estrogen-regulated miRNA and is a robust prognostic factor for patients with advanced and post-menopausal ER+ disease [36,37]. Molecular aspects of miR-4524a-3p have never been investigated in BC; however, based on expression data, this miRNA is upregulated in high-grade tumor samples compared to low-grade ones. Nevertheless, the case-control study failed to find a correlation in the Brazilian population.
Gathering all these data together, we can emphasize that the strategy of integrative analysis (differential expression and GWAS) in mapped regions of lncRNAs is a good strategy to suggest candidates for deeper analysis. These results suggest that lncRNA-SNPs have potential utility as prognostic markers of BC and must be investigated in greater detail, focusing on their molecular role in the cancer context.

Conclusions
SNP-type genomic changes in lncRNAs play an important role in disease development. In this work, based on an integrated bioinformatics methodology, we highlight four lncRNAs-SNPs with a potential role in breast carcinogenesis. We identified the high risk of developing the disease by analyzing the rs4415084 and rs7716600 in case-control study. These same SNPs also showed an association with important prognostic parameters of the disease. When analyzing the haplotypic region containing the SNPs of lncRNA CASC16, it was also possible to observe a significant association.
This study is the first to demonstrate the role of these lncRNA-SNPs in the Brazilian population, emphasizing the need to expand research in the area.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/genes14050971/s1, Figure S1: Heatmap expression of differently expressed lncRNAs in TCGA BRCA cohort data; Table S1: Information about the 23 lncRNAs-SNPs filtered in the study; Table S2: Clinical information about the samples analyzed in the study; Table S3: Social/genetic information of the patients analyzed in the study.