In the following section, we provide details on how to call functions and R commands of seq2R, which can be loaded into R using library(seq2R).
4.1. Human Mitogenome (Homo sapiens)
We can use the function read.genbank to retrieve the human mitochondrial sequence from the GenBank database. Below is an excerpt of this character vector
> library(seq2R)
> mtDNAhuman <- read.genbank("NC_012920")
> mtDNAhuman
[[1]]
[1] "g" "a" "t" "c" "a" "c" "a" "g" "g" "t" "c" "t" "a" "t" "c" "a"
[17] "c" "c" "c" "t" "a" "t" "t" "a" "a" "c" "c" "a" "c" "t" "c" "a"
[33] "c" "g" "g" "g" "a" "g" "c" "t" "c" "t" "c" "c" "a" "t" "g" "c"
[49] "a" "t" "t" "t" "g" "g" "t" "a" "t" "t" "t" "t" "c" "g" "t" "c"
…
[[2]]
[1] "NC_012920"
attr(,"species")
[1] "Homo_sapiens"
To begin the analysis, it is necessary to convert our sequence into binary variables, where a value of 1 denotes the presence of a nucleotide at position X in the sequence, while 0 indicates the absence of a nucleotide. To do this, the following input command should be typed
> DNA <- transform(mtDNAhuman)
> DNA
$AT
$AT[[1]]
X A T
[1,] 2 1 0
[2,] 3 0 1
[3,] 5 1 0
[4,] 7 1 0
…
$CG
$CG[[1]]
X C G
[1,] 1 0 1
[2,] 4 1 0
[3,] 6 1 0
[4,] 8 0 1
…
To carry out compositional analyses, we used the following main function
> seq1 <- find.points(DNA)
Results are printed on the screen using
> seq1
Call:
find.points(x = DNA)
Number of A-T base pairs:9218
Number of C-G base pairs:7350
Number of binning nodes: 300
Number of bootstrap repeats: 100
Bandwidth: 0.5862069 0.1379310
Exists any critical point? TRUE
It should be noted that there is at least a critical point, as shown in the last logical argument of this short summary. To illustrate where the critical point is,
seq2R package also provides plots for the analyses of the skews. First, the estimate of
skew and its first derivative are displayed in
Figure 1 (upper and lower panel, respectively) with 95% bootstrap confidence intervals. These plots can be obtained with the following command:
> plot(seq1, der = 0, base.pairs = "CG", CIcritical = TRUE,
ylim = c(0.08,0.67))
> plot(seq1, der = 1, base.pairs = "CG", CIcritical = TRUE,
ylim = c(-0.0005,0.00045))
At this point, a main question that arises is: What are the positions of the detected change points in our sequence? To answer this, we used the following input commands.
> critical(seq1, base.pairs = "CG")
Critical Low_CI Up_CI
[1,] 721.3478 499.7023 1275.462
[2,] 3325.6823 2966.8935 3713.562
[3,] 5542.1371 5127.9370 5845.514
[4,] 7647.7692 7041.0144 8091.060
[5,] 10861.6288 10639.9833 11027.863
[6,] 13465.9632 12801.0268 14020.077
[7,] 15017.4816 14878.9532 15622.851
From the compositional analyses of the human mitogenome specifically focusing on the
content, we made the following observations: the two typical origins of replication, OH and OL, are located in regions where critical points were also identified (indicated by red lines in
Figure 1). In the human mitogenome, OL is a very small genomic region located at positions 5730 to 5760. Interestingly, one of the critical points identified in the
analysis is located precisely in that same region. Additionally, the OH, located at the beginning (1–576) and terminus (16,024–16,569) of the genomic sequence, was also identified in our analysis.
Secondly, for the
analyses, we can obtain the estimated curve and its first derivative with 95% bootstrap confidence intervals in the same way as for the
skew (
Figure 2, upper and lower panel, respectively). Additionally, we can locate the change points using the same approach as for the
analysis.
> plot(seq1, der = 0, base.pairs = "AT", CIcritical = TRUE)
> plot(seq1, der = 1, base.pairs = "AT", CIcritical = TRUE)
> critical(seq1, base.pairs = "AT")
Critical Low_CI Up_CI
[1,] 3049.258 1996.569 3963.435
[2,] 8755.940 8645.130 8922.154
[3,] 13077.505 11914.007 13963.980
Additionally, the
analysis (
Figure 2) appears to detect alternative origins of replication: one in the region upstream to the OL (i.e., before position 5000) and another one around position 15,000. The latter inferred OL (around position 15,000) may be homologous to the OL recently mapped in the mouse mitogenome [
54]. If the identified compositional bias reflects the replication process and if the change points correspond to the origins of replication, it is possible that more than one replication mechanism (using different origins of replication) may exist in the human mitogenome, since we find evolutionary signatures of more than one potential mechanism. Indeed, the most recent biochemical findings support this hypothesis [
55,
56].
4.2. Nematode Mitogenome (Radopholus similis)
Nematodes, or roundworms, are a highly diverse phylum of animals. While some species are free-living, others are parasitic and responsible for causing diseases in both animals and plants. The nematode mitogenome analyzed in this study is plant-parasitic. However, little is known about its replication mechanism. Compositional analyses of the relative amounts of T and G versus A and C reveal a pattern consistent with the existence of two different replication origins [
13] located around positions 4100 and 11,000. Additionally, the authors of [
13] highlighted a compositional change after position 1000, which was suggested to be related to the replication mechanism but did not correspond to a replication origin. Instead, it was caused by a collision between DNA polymerases during mitogenome replication [
13].
We can obtain the nematode mitogenome from the GenBank dataset with accession number NC_013253. Below is an excerpt of this sequence. We convert our character vector into binary variables and then calculate the estimates of the skew profile using the following input commands:
> library(seq2R)
> nematode <- read.genbank("NC_013253")
[[1]]
[1] "t" "a" "a" "a" "g" "a" "a" "a" "a" "t" "a" "t" "t" "t" "t" "a"
[17] "a" "t" "t" "t" "t" "a" "g" "a" "a" "t" "g" "t" "t" "t" "c" "a"
[33] "t" "t" "g" "t" "t" "a" "a" "t" "g" "a" "a" "a" "a" "g" "g" "t"
[49] "t" "t" "t" "t" "t" "c" "t" "t" "t" "g" "a" "t" "a" "t" "t" "a"
…
[[2]]
[1] "NC_013253"
attr(,"species")
[1] "Radopholus_similis"
> nem <- transform(nematode)
> (seq2<-find.points(nem, kbin = 450, n.bandwidths = 10))
Call:
find.points(x = nem, kbin = 450, n.bandwidths = 10)
Number of A-T base pairs:14340
Number of C-G base pairs:2451
Number of binning nodes: 450
Number of bootstrap repeats: 100
Bandwidth: 0.3333333 0.8888889
Exists any critical point? TRUE
For illustration purposes, we report the estimate and its first derivative of the
skew with 95% bootstrap confidence intervals (
Figure 3). The change points values with 95% bootstrap confidence intervals are shown by typing the following code.
For illustrative purposes, we report the estimate and first derivative of the
skew, along with 95% bootstrap confidence intervals, in
Figure 3. The values of change points, along with their 95% bootstrap confidence intervals, can be obtained by using the following code:
> par(mfrow=c(2,1))
> plot(seq2, der = 0, base.pairs = "AT", CIcritical = TRUE)
> plot(seq2, der = 1, base.pairs = "AT", CIcritical = TRUE,
ylim = c(-0.0002,0.0002))
> critical(seq2, base.pairs = "AT")
Critical Low_CI Up_CI
[1,] 561.5676 169.1703 561.5676
[2,] 3756.8030 2579.6110 4596.2530
[3,] 8325.4290 7891.6898 8325.4290
[4,] 12669.8280 11662.2076 13495.9646
In the case of
skew, we can obtain the estimated curve and its first derivative with 95% bootstrap confidence intervals in the same way as for the
skew (
Figure 4, upper and lower panel, respectively). Our method can detect two change points at positions 4264 and 11,186, which overlap with the change points located in the
skew, as also reported in [
13].
> par(mfrow = c(2,1))
> plot(seq2, der = 0, base.pairs = "CG", CIcritical = TRUE)
> plot(seq2, der=1, base.pairs = "CG", CIcritical = TRUE)
> critical(seq2, base.pairs = "CG")
Critical Low_CI Up_CI
[1,] 4264.553 4124.436 4264.553
[2,] 11186.326 8524.105 13456.219
After applying our methodology to this nematode mitogenome, we were able to identify three significant change points, which correspond to the three points highlighted in [
13]. Therefore, we not only confirmed the presence of change points related to the replication process but also provided statistical support and confidence intervals for their location.
4.4. Compositional Analysis with Another R Package: Seqinr
The aim of this section is to compare our results with those obtained using the
seqinr R package, which is considered the most complete R package for genomic sequence analysis and nucleotide compositional analysis. To accomplish this, we analyzed the same mitogenomes as in the previous subsections (human and nematode sequences, as discussed in
Section 4.1 and
Section 4.2, respectively) using
seqinr.
Here, we focus our discussion on the change points associated with the origins of replication. In the human mtDNA, two origins of replication were identified—one for each complementary strand. The first is located around position 5730, and the other is in the major non-coding region located at the beginning and the end of the circular genome (joining the start of the genome (positions 1 to 576) with the end of the genome (positions 16,024 to 16,569)). As we mentioned before, our method is able to detect change points that overlap with these two origins of replication; in fact, the lower limit of the confidence intervals of the first skew critical point overlaps with the major non-coding region. In addition, the third skew change point includes the other origin of replication.
After applying the
seqinr to this human sequence, it is possible to associate a minor change detected near position 6000 with the first origin of replication. However, the method is not able to detect the other origin of replication at the start/end of the genome because, as mentioned in
Table 2, this analysis can only be performed in coding regions (i.e., regions of protein-coding genes), but this origin of replication is located in a major non-coding region.
The nematode genome has three major change points located at positions 1100, 4100, and 11,100, with the last two identified as the two origins of replication found in this genome [
13]. Our method is able to detect change points that overlap with the replication origins. Specifically, the confidence intervals of the second
skew change point contain the change point around position 4100 but miss the other change point/replication origin by approximately 500 nucleotides. Interestingly, these two change points are detected by the
skew analysis, with two critical points identified at position 4264 (the confidence interval of which contains the position 4100) and 11,186. In contrast, the cumulative composite skew index analysis using
seqinr fails to detect any of the change points associated with the origins of replication.
Overall, it seems that our method shows good performance in the identification of change points associated with the origins of replication, unlike seqinr, which, in these examples, misses some of them.