Chloroplast Genome Annotation Tools: Prolegomena to the Identification of Inverted Repeats

The development of next-generation sequencing technology and the increasing amount of sequencing data have brought the bioinformatic tools used in genome assembly into focus. The final step of the process is genome annotation, which works on assembled genome sequences to identify the location of genome features. In the case of organelle genomes, specialized annotation tools are used to identify organelle genes and structural features. Numerous annotation tools target chloroplast sequences. Most chloroplast DNA genomes have a quadripartite structure caused by two copies of a large inverted repeat. We investigated the strategies of six annotation tools (Chloë, Chloroplot, GeSeq, ORG.Annotate, PGA, Plann) for identifying inverted repeats and analyzed their success using publicly available complete chloroplast sequences of taxa belonging to the asterid and rosid clades. The annotation tools use two different approaches to identify inverted repeats, using existing general search tools or implementing stand-alone solutions. The chloroplast sequences studied show that there are different types of imperfections in the assembled data and that each tool performs better on some sequences than the others.


Introduction
Chloroplasts are organelles of plants and green algal cells specialized for photosynthesis. They are likely descended from freshwater cyanobacteria that enter a eukaryotic cell by endosymbiosis [1,2]. Accordingly, chloroplasts have their own genome, known as the plastome. The chloroplast genome is a valuable source of data for assessing plant evolution and also has potential for plant breeding programs to meet the needs of a growing world population for food, fiber, energy, and new medicines [3]. Chloroplast DNA is generally described as circular, but in some cases, it has been described as a multimeric circular, linear, or branched double-stranded molecule [4][5][6], typically 120 to 170 kbp long and consisting of 120 to 130 genes (e.g., [7,8]). It is mainly maternally inherited, but biparental and paternal inheritance is also possible. Maternal inheritance is the most common form of inheritance in ferns [9], horsetails [10], most angiosperms, cycads, and gnetophytes [11]. On the other hand, paternal inheritance has been observed in conifers [12], while biparental inheritance has been noted in the eudicot families Geraniaceae [13], Campanulaceae, and Fabaceae [14].
Four specific regions can be well distinguished in the chloroplast genome maps. The large (LSC) and small single copies (SSC) contain only single gene copies and are separated by inverted repeats (IRa and IRb) that contain completely identical genes, but in opposite or reversed orientations [15]. In most angiosperms, IRs are 20 to 30 kb long, but there are many lineages of seed and nonseed plants that have shorter or longer IRs [15][16][17]. In some extreme cases, IRs as short as 114 bp (Cryptomeria [18]) or as long as 76 kb (Pelargonium [19]) have to expect. It is likely that both short and long substitutions and indels will occur during assembly. Two other points concerning implementation are as follows: (a) the chloroplast genome is usually circular, but the sequence is represented in linear form, and (b) there may be ambiguous base pairs in the sequence. If oversight is made in the implementation, the results may be of lower quality. In most cases, such implementation errors can be easily detected and corrected. Although studies have been performed on the quality and consistency of gene annotation of several cpDNA annotation tools [64,65], we are not aware of any studies that have analyzed and compared the methods used to identify IRs.
The aim of this study was to analyze the methods used to identify IRs in six existing cpDNA annotation tools. The tools analyzed were Chloë [66], Chloroplot [67], GeSeq [59], ORG.Annotate (https://git.metabarcoding.org/org-asm/org-annotate, accessed 31 December 2021), PGA [68], and Plann [69]. We investigated how annotation tools overcome problems in identifying IRs. The methods were analyzed in two ways: through qualitative analysis of the program code and the articles in which it is described and through quantitative analysis of the results obtained with these methods. We tested the results on more than 4000 complete chloroplast genomes from the asterid and rosid clades.

Comparative Code Analysis
For each annotation tool, we extracted the description of the IR identification strategy from the respective publications and analyzed the source code to find the exact strategy that was implemented. In general, bioinformatic solutions for identifying IR can be implemented by using existing general search tools or as stand-alone solutions. Table 1 provides descriptions of the IR identification strategies described in the original publication and what we deduced from the code analysis.
Three tools (Chloë, Chloroplot, GeSeq) implement stand-alone solutions. The strategies of Chloë and Chloroplot are very similar. They merge small blocks (up to 100 bp) of exact inverted repeats into larger regions and tolerate small differences. GeSeq implements a different strategy by finding relatively large blocks (at least 2000 bp) of identical inverted repeats that define the IR ends and considers the regions in between as IRs. Chloë and Chloroplot focus on the circularity of the chloroplast genome resulting from code analysis. In the case of GeSeq, it is not possible to deduce how it handles genome circularity. As mentioned earlier, GeSeq uses the OGDRAW tool to identify IR. However, OGDRAW does not handle circularity by itself. With appropriate formatting of the input data (e.g., the input of a duplicated cpDNA genome) or with additional processing of the OGDRAW results (e.g., expansion of the regions found), it is possible to cover genome circularity, but we do not know the exact application because no code is available for GeSeq.
The method implemented in Chloë identifies IRs with three numbers: start position of IRa, start position of IRb, and IR length. Thus, it is not possible for the IRa and IRb annotated by Chloë to differ in length. The results of the GeSeq method depend strongly on the input data. It can allow very large differences in the inner region of IRs, but it can also truncate IRs found, even with a single base difference or a single ambiguous character. From the GeSeq source code, it appears that end blocks of at least 2000 base pairs in length must not contain ambiguous characters.
The original publication describing GeSeq [59] states that the strategy is to identify identical IR pairs, while the method used tolerates mismatches. This is because GeSeq now uses the OGDRAW method for IR identification.
The strategy implemented in ORG.Annotate involves detecting approximate IR positions and then refining the results based on querying a database included in the package, which contains a set of 45 LSC and 72 SSC regions. The implementation uses repseek [70] and blast [60] for IR identification and to process the results of both methods. Both repseek and blastn queries are performed on the original cpDNA genome (i.e., not on the duplicated genome as in PGA) and therefore ORG. Annotate ignores the circularity of the genome. Table 1. The description of IR identification strategies implemented in six chloroplast annotation tools presented in publications and derived from code analysis.

Method Description in Publication Code Analysis
Chloë Does not mention IR identification. Chloë assembles reverse-complement blocks of~20 bp in size into long sequences allowing up to a few bases (≤10 bp) of mismatches.

Chloroplot
The publication contains a description of the main steps. The method can detect nonidentical IR regions allowing small stretches of mismatches.
Chloroplot assembles reverse-complement blocks of size 100 bp into long sequences, sorts the assembled sequences by position, and finds the longest gap between sequences. Assembled sequences downstream of the gap are merged into IRa, and sequences upstream of the gap are merged into IRb. Identified IRa and IRb are checked to ensure that they do not differ in length by more than 1000 bp. In addition, mismatches between the regions are calculated and checked to support the length differences between the regions. If the procedure fails, the same procedure is attempted with blocks of 1000 bp in length.

GeSeq
The minimum length for IR annotation is 200 bp. GeSeq only annotates the longest identical IR pair that matches within a submitted sequence.
GeSeq finds leftmost and rightmost exact matches with a size of at least 2000 bp and designates the regions between the match positions as IRa and IRb.
ORG.Annotate -ORG.Annotate uses repseek to find approximate long matches, scores these matches according to the results of a query of the input sequence in a database containing 45 LSC and 72 SSC regions, and considers the highest scoring matches as IRs.

PGA
The IR boundary annotation is performed with a self-blastn search. One parameter can be adjusted to determine IR boundaries: the minimum allowed IR length (default = 1000).
PGA performs self-blasts (-perc_identity 99) on a duplicated sequence to find the maximum inverted match with a length of at least 1000 bp.

Plann
Does not mention IR identification.
Plann performs self-blasts (-evalue 1 × 10 −200 ) on an original sequence to find the maximum inverted match with a length between 10,000 and 50,000 bp.
Two tools (PGA, Plann) use the blastn program to identify repeats in a complete chloroplast sequence via a self-blastn search. Both programs use very similar arguments, except that PGA implements the search for repeats in the duplicated genome, while Plann restricts the search to the original genome. Therefore, similarly to ORG.Annotate, Plann ignores genome circularity.
Regarding the flexibility of programs in tolerating sequence differences between IRa and IRb, methods using existing software to identify IR (ORG.Annotate, PGA, Plann) are expected to allow differences in a consistent manner, as this depends on the argument values passed to specialized tools (blastn, repseek). For the other methods, it is difficult to determine exactly how much difference they tolerate, as this is neither stated in the documentation nor clear from the code.
Each method has a theoretical minimum length of the putative IRs. For Chloë, Chloroplot, and GeSeq, the minimum length is equal to the length of the block size used, i.e., 100, 1000, and 2000 bp, respectively. ORG.Annotate does not filter repeats by size, so the minimum length is a few dozen bp, which corresponds to the shortest blastn output. When searching for IRs, PGA and Plann exclude repeats smaller than 1000 and 10,000 bp, respectively. proportion of different outcomes of IR identification: six annotation tools produced the same outcome (green), two different outcomes (yellow), and three or more different outcomes (orange). Outcomes were treated as the same if the lengths of the identified IR differed by less than 10 bp.

Comparative Results Analysis: Overview
As expected, most complete chloroplast sequences have been published recently, with more than 85% of all sequences published in the last five years ( Figure 2). Continuous sequencing efforts were observed in most families (Figure 1b), especially in the families Asteraceae (asterids, 382 sequences) and Fabaceae (rosids, 384 sequences), with some exceptions where almost all sequences were published within a short period of time. Examples are the families Campanulaceae (asterids, 121 sequences) and Chrysobalanaceae (rosids, 50 sequences), where most sequences were produced in 2016 and 2017, respectively, probably as a result of a single, rather isolated research effort. proportion of different outcomes of IR identification: six annotation tools produced the same outcome (green), two different outcomes (yellow), and three or more different outcomes (orange). Outcomes were treated as the same if the lengths of the identified IR differed by less than 10 bp.
Identical results of all six annotation tools were observed in 75.16% of the sequences, while two and three or more different results were obtained in 17.97% and 6.87% of the sequences, respectively (Figure 1d). Identical results per family dataset ranged from 40% to 100% of the sequences, with the lowest agreement among the six annotation tools observed in the Geranicaceae (rosids).

Figure 2.
Number of complete chloroplast sequences added to GenBank per year (leftmost chart) and the relative ratio of the three types of sequences identified using six annotation tools (and those annotated in NCBI): blue-identical IRs, orange-different IRs, and red-no IRs.

Comparative Results Analysis: IRL dataset
The number of sequences in the IRL dataset in which IRs were identified is shown in Table 2, while detailed identification results can be found in Table S1.
ORG.Annotate identified IRs in almost all sequences, Chloroplot identified IRs in 32% of sequences, while the other methods identified IRs in 20% or fewer sequences. In most cases, however, the regions identified as IRs were between a few hundred and a few thousand bases long. The number of sequences in which the regions of at least 10,000 bp in length ranged from four to eleven, depending on the method ( Table 2). As expected, no sequence containing IRs longer than 10,000 bp was identified in most families included in the IRL dataset. The exceptions were the family Ericaceae (asterids) and the genus Erodium (Geraniaceae, rosids).
The family Ericaceae was represented by 16 complete chloroplast sequences of species belonging to nine genera. IRs were identified in eight of 16 sequences using three annotation tools (Chloë, ORG.Annotate, Plann). The genera without identified IRs were The genus Erodium was reported to lack IRs because of evidence of loss of IR in E. carvifolium Boiss. & Reut. [38] and E. texanum A.Gray [39], for which complete chloroplast sequences were available at that time. We confirmed that IRs were missing in an The median of the complete chloroplast sequence length ranged from 149,100.5 (Gentianaceae) to 165,063 (Campanulaceae) in the asterids and from 151,362 (Passifloraceae) to 173,148 (Thymelaeaceae) in the rosid clade ( Figure 1c). In general, the interquartile range (IQR) length within a family was low, ranging from 59.5 (Vitaceae, rosids) to 8463 (Apiaceae, asterids). However, five families were clear outliers with IQRs of 14,852 (Passifloraceae, rosids), 24,753.5 (Geraniaceae, rosids), 26,909.5 (Fabaceae, rosids), 64,236.5 (Orobanchaceae, asterids), and 75,590.5 (Convolvulaceae, asterids).
Identical results of all six annotation tools were observed in 75.16% of the sequences, while two and three or more different results were obtained in 17.97% and 6.87% of the sequences, respectively ( Figure 1d). Identical results per family dataset ranged from 40% to 100% of the sequences, with the lowest agreement among the six annotation tools observed in the Geranicaceae (rosids).

Comparative Results Analysis: IRL Dataset
The number of sequences in the IRL dataset in which IRs were identified is shown in Table 2, while detailed identification results can be found in Table S1. ORG.Annotate identified IRs in almost all sequences, Chloroplot identified IRs in 32% of sequences, while the other methods identified IRs in 20% or fewer sequences. In most cases, however, the regions identified as IRs were between a few hundred and a few thousand bases long. The number of sequences in which the regions of at least 10,000 bp in length ranged from four to eleven, depending on the method ( Table 2). As expected, no sequence containing IRs longer than 10,000 bp was identified in most families included in the IRL dataset. The exceptions were the family Ericaceae (asterids) and the genus Erodium (Geraniaceae, rosids).
The family Ericaceae was represented by 16 complete chloroplast sequences of species belonging to nine genera. IRs were identified in eight of 16 sequences using three annotation tools (Chloë, ORG.Annotate, Plann). The genera without identified IRs were Allotropa  Table 2. Identification of inverted repeats in sequences of the inverted repeat-lacking (IRL) dataset and the list of species with IRs larger than 10 kb detected using six annotation tools (and those annotated in NCBI). Species (asterids/Ericaceae) with IRs larger than 10 kb

Comparative Results Analysis: IR Dataset
The IR dataset, containing 3996 complete chloroplast sequences, was processed to analyze the following properties of the six annotation tools: (a) type of IR regions identified, (b) treatment of circularity of the chloroplast genome, (c) treatment of sequences with ambiguous bases, and (d) differences between IRa and IRb. Finally, we analyzed the length distribution of the identified regions at different taxonomic levels. The detailed identification results for the IR dataset can be found in Table S2.

Type of IR Regions Identified
After processing 3996 sequences of the IR dataset using six annotation tools (and those annotated in NCBI), we classified them into three types: (a) identical IRs, (b) different IRs, and (c) no IRs, as explained in the Section 4. The IR identification results in the IR dataset are shown in Table 3.
The Chloë and Chloroplot methods found a large number of identical IRs and a relatively small number of different IRs, which is related to their strategies tolerating only small differences. The result of the GeSeq method was similar, but this is due to the property of the method that IRs must have ends that match by at least 2000 bp. The ORG.Annotate method annotated almost all sequences with 69.6% identical IRs and 30.3% different IRs. This is because the tool follows the most liberal strategy for tolerating differences. The PGA and Plann methods found a relatively small number of identical IRs and a large number of different IRs, which is related to their similar use of blastn. It is expected that the recently added complete chloroplast sequences will be of better quality than the older sequences because sequencing technologies and assembly software have rapidly improved. Figure 2 shows the relative ratio of sequences with identified IRs (identical and different) and without IRs per year of sequence publication. Nevertheless, the performance of all annotation tools was consistent, and newer sequences were annotated at a similar ratio to older ones. This is especially true for the last five years, when more than 85% of the sequences in the entire dataset were added. However, a large proportion of the complete chloroplast sequences stored in NCBI does not yet have annotated IRs (43.37%; Table 3).

Treatment of Circularity of the Chloroplast Genome
Another bioinformatic problem for the annotation tools is how to deal with the circularity of the chloroplast genome, since the complete chloroplast sequences are represented in linear form. Therefore, in some cases, the sequence must be wrapped to identify IRs. To check how the different annotation tools handle circularity, we counted how many cases identified IRs at the first attempt (without wrapping) and with wrapping (Table 4). The methods implemented in ORG.Annotate and Plann did not find additional sequences with IRs because these tools cannot handle the circularity of the chloroplast genome, as also shown by code analysis.
Chloë, Chloroplot, and PGA performed similarly well, finding IRs in an additional 500 to 700 sequences by wrapping. GeSeq annotated significantly fewer IRs by wrapping. As mentioned earlier, it is not possible to infer the properties of GeSeq when dealing with chloroplast sequences that need to be wrapped. The results show that all four methods (Chloë, Chloroplot, GeSeq, PGA) annotated a very similar number of sequences (~30) that needed to be wrapped for more than 2000 bp (Table S2). In the case of sequences that needed to be wrapped less than 2000 bp, GeSeq identified IRs in only 41, while Chloë, Chloroplot, and PGA annotated more than 500 sequences. This outcome suggests that GeSeq does not use a duplicated cpDNA genome as an input to ODGRAW methods, but extends the found regions around the sequence origin.

Treatment of Sequences with Ambiguous Characters
In the IR dataset, there were 526 (13.16%) complete chloroplast sequences that contained ambiguous characters ( Table 5). As expected, for all methods, the proportion of sequences with different IRs or no IRs was considerably larger in the dataset containing ambiguous characters than in the dataset without ambiguous characters. All methods performed similarly well, failing to identify IRs in only 0 to 2.09% of sequences, with the exception of GeSeq. The GeSeq method did not annotate IRs in 43.2% of sequences, consistent with the results of the code analysis that IR end blocks (~2000 bp) may not contain ambiguous characters.

Differences between IRa and IRb
All analyzed annotation tools can identify different IRs. The number of sequences with different IRs ranged from 264 (Chloë) to 1212 (ORG.Annotate) ( Table 6). We analyzed differences in lengths, between IRa and IRb, in identified IRs. The annotation tools differed significantly in the number of sequences identified, where the IRs differed in length. The Chloë method did not annotate IRs with different lengths in agreement with the results of the code analysis. The remaining annotation tools can be divided into two groups. The first group, consisting of Chloroplot and GeSeq, annotated relatively few IRs of different lengths compared to the second group (ORG.Annotate, PGA, Plann). This can be explained by the fact that the methods of the first group implemented a stand-alone identification strategy, whereas the methods of the second group used specialized software to identify repeats (repseek, blastn).

Method Agreement
Six annotation tools were compared based on the agreement in identified IRs. For a pair of methods, we counted the number of sequences where the resulting IRs of the first method were longer than those of the second method. To exclude very similar results, we assumed that one IR was longer than the other if its length differed from the other by more than 10 bp (Table 7).
Pairwise comparisons showed that the annotation tools gave the same result for 3150 to 3663 of 3996 sequences in the IR dataset. This result is consistent with the results presented in Figure 1d, which shows that the overall agreement between the methods is 75.16%. The number of longer IRs identified ranged from 10 (0.25%) sequences in the case of Chloë/PGA to 833 (20.85%) sequences when comparing PGA and GeSeq. PGA identified the longest IRs compared to any other tool.
For all method pairs, there were sequences where using one method resulted in longer IRs than using another, suggesting that each method covers some instances of input data better than another.

IR-Length Statistics
Box plots of the IR length distributions in complete chloroplast genome sequences grouped by type of IRs identified (identical or different) using six annotation tools are shown in Figure 3. The length distributions of the identical IRs were very similar for all methods. The medians ranged from 25,992 to 26,034 bp, and the interquartile range (IQR) was between 827.75 and 865.75 bp, with Chloë having a slightly wider IQR (968.25 bp), whereas the Plann method did not annotate any IRs longer than 50 kb, in agreement with the results of the code analysis. For the different IRs, the length distribution of Chloë differed considerably from the other methods. The median for Chloë was 25,494.5 bp, whereas for the other methods, the medians ranged from 25,807.5 to 26,050.5 bp. Similarly, the IQR for Chloë was much wider (7926.5 bp) than that of the other methods (1008.00-1678.25 bp). Using all methods, the length distribution of identical IRs was narrower than the distribution of different IRs. This was expected, since IRs that differ were more likely to be problematic, and these annotations are probably shorter than they should be in the case of perfect matches. This was most pronounced for Chloë, where the length distribution of different IRs was considerably stretched toward lower lengths.
The distributions show two distinguishable groups of sequences with longer IRs as outliers, the first with IRs longer than 70 kb and the second with lengths between 43 and 52 kb.
The second group of outliers contained sequences from the genera Cyphia (Campanulaceae) and Passiflora (Passifloraceae). The genus Cyphia was represented by nine sequences, seven of which contained IRs longer than 43 kb, and two contained slightly shorter IRs (40-42 kb). There were 48 sequences from the genus Passiflora, of which five contained IRs of 44 kb or more, while most of the other sequences had IRs between 20 and 30 kb.

Discussion
Most cpDNA genomes have a quadripartite structure. This structure is established with two copies of a large inverted repeat (IR) separating large (LSC) and small (SSC) single copy regions [15,73]. In the sequencing of the cpDNA genome, the identification of IRs is performed in the annotation step.
In general, the identification of IRs in the genome is considered a simple task, but in some cases, the results obtained may not be satisfactory. There are two reasons for this.
The first reason is related to the treatment of the biological properties of the chloroplast genome, including the circularity of the genome and the decision on the minimum and the maximum length of the inverted repeat. This treatment is not always adequately implemented in annotation tools. The second reason is related to data quality, which is very difficult to assess or even detect because there are no gold standards for genome assembly [74]. Biologically, IRs should be regions of exact inverted complements. However, it would be advisable to allow cases where small differences are found between regions rather than being too rigid. It is quite difficult to specify how many differences can be tolerated, and the same is true for ambiguous characters. From a bioinformatic point of view, these problems are usually relatively easy to find and correct once thresholds are established.
The existence of specialized chloroplast annotation tools that do not annotate IRs at all (AGORA, CpGAVAS2, and MFannot) suggest that knowledge of chloroplast structure is less important than knowledge of the genes present in cpDNA. This indicates that gene data are more widely used in subsequent research. In the IR dataset, which consists of Three methods identified IRs of 72 kb in the sequence of Vitis romanetii Rom.Caill. The genus Vitis contained 52 sequences in the dataset. All sequences, except V. romanetii and V. yunnanensis C.L.Li, were of very similar length (161,100 ± 300 bp) and contained uniform IRs of 26,340 ± 50 bp. The sequences of V. romanetii and V. yunnanensis were 232 and 167 kb long, containing IRs of 72 and 32 kb, respectively. Similar results were demonstrated by [72].
The second group of outliers contained sequences from the genera Cyphia (Campanulaceae) and Passiflora (Passifloraceae). The genus Cyphia was represented by nine sequences, seven of which contained IRs longer than 43 kb, and two contained slightly shorter IRs (40-42 kb). There were 48 sequences from the genus Passiflora, of which five contained IRs of 44 kb or more, while most of the other sequences had IRs between 20 and 30 kb.

Discussion
Most cpDNA genomes have a quadripartite structure. This structure is established with two copies of a large inverted repeat (IR) separating large (LSC) and small (SSC) single copy regions [15,73]. In the sequencing of the cpDNA genome, the identification of IRs is performed in the annotation step.
In general, the identification of IRs in the genome is considered a simple task, but in some cases, the results obtained may not be satisfactory. There are two reasons for this.
The first reason is related to the treatment of the biological properties of the chloroplast genome, including the circularity of the genome and the decision on the minimum and the maximum length of the inverted repeat. This treatment is not always adequately implemented in annotation tools. The second reason is related to data quality, which is very difficult to assess or even detect because there are no gold standards for genome assembly [74]. Biologically, IRs should be regions of exact inverted complements. However, it would be advisable to allow cases where small differences are found between regions rather than being too rigid. It is quite difficult to specify how many differences can be tolerated, and the same is true for ambiguous characters. From a bioinformatic point of view, these problems are usually relatively easy to find and correct once thresholds are established.
The existence of specialized chloroplast annotation tools that do not annotate IRs at all (AGORA, CpGAVAS2, and MFannot) suggest that knowledge of chloroplast structure is less important than knowledge of the genes present in cpDNA. This indicates that gene data are more widely used in subsequent research. In the IR dataset, which consists of sequences expected to contain IRs, IR annotations are missing in 43.37% of the cases in the NCBI data, including 40.74% of the sequences published in 2021. This suggests that there is room for improvement in the stored data, and that caution should be exercised when submitting a sequence to GenBank.

State of the Art
Chloroplast annotation tools identify IRs with different strategies. The main reason for trying different approaches to address a relatively simple bioinformatic problem is the awareness that IRs can be found that are not exact copies. In general, annotation tools allow for small differences between IRa and IRb. This is usually omitted from the articles describing the tools. The only exception is the documentation for Chloroplot, which clearly refers to the possibility of detecting nonidentical IRs and whose web application visualizes the differences found [75], but without documenting the parameters that control the process.
Classification of the identified IR regions by type (identical IRs, different IRs, no IRs) reveals two groups of annotation tools. The first group implements standalone solutions (Chloë, Chloroplot, GeSeq), while the second group uses blastn (ORG.Annotate, PGA, Plann). The annotation tools of the first group identify a higher percentage of identical IRs (~90%) than those of the second group (70-80%). This is because the tools of the first group use specific matching techniques with a limited number of possible misalignment patterns, while the tools of the second group use existing alignment solutions based on the mismatch score, which results in tolerating more different misalignment patterns.
Two (ORG.Annotate, Plann) of the six annotation tools do not treat sequences as circular, and one does so only partially (GeSeq). This suggests that implementation problems are more common than we might expect. It is also possible that at the time these tools were developed, there were few sequences to test, whereas currently, there are many complete cpDNA data to work with and test different input data.
There is a notable difference in the results of IR identification for sequences with and without ambiguous characters. All annotation tools, but especially the GeSeq method, have a lower success rate in dealing with sequences with ambiguous characters. This is to be expected, as ambiguous characters lead to uncertainties in the data, and this situation is more difficult to handle.
Among the six tools analyzed (Chloë, Chloroplot, GeSeq, ORG.Annotate, PGA, and Plann), there are six different approaches that have their own expectations about the kinds of differences that might be found. Each of these approaches addresses some types of differences, but none of them addresses all differences that can reasonably be tolerated. Considering these findings, it is not possible to state the best strategy for identifying IRs, since all strategies cover some situations with cpDNA data.

Further Research
Although there are many reports of taxa without inverted repeats, further research is clearly needed on the mechanism of IR loss and its taxonomic implications, as well as on the relationship between taxonomic relatedness and IR sequence length. In general, the analysis of taxa without inverted repeats confirmed all previous results [38,39,[76][77][78][79][80][81][82][83][84] The distributions of the lengths of the identified IRs show that, in general, all annotation tools give similar results, especially in cases where identical IRs are detected. The median length of the IRs was~26 kb with an interquartile range (IQR) of~800 bp. Further analysis of IR length distributions at lower taxonomic levels (i.e., within families or even genera) would be required to assess the relationship between taxonomic relatedness and the IR length of sequences.
Finally, in view of the standardization of cpDNA sequence data and the possible influence of nonstandardized data on phylogenetic analysis, it is advisable to standardize these sequences according to [50].
Although sequencing and assembling the complete chloroplast genome is a far simpler task than assembling nuclear DNA [47], the sequences obtained are not error-free. This problem could be further investigated by checking the length differences between IRa and IRb. All annotation tools are able to identify IRs of different lengths, sometimes by even more than 100 bp. Additional studies can be performed to determine the types of differences between the regions. Since IRs should be biologically identical, these differences may indicate possible errors in sequencing or assembly. Similarly, the quality of an assembly is checked by mapping to a reference genome [85]. The differences are either substitutions or indels. It is likely that short substitutions indicate sequencing errors, whereas indels and long substitutions indicate assembly errors.
If a new annotation tool is to be introduced, we think it best to adopt ideas from all existing strategies. First, self-alignment covers all cases of identical or nearly identical IRs (e.g., PGA and Plann). If IRs cannot be identified in this way, the self-aligned parts can be merged by checking the intervening sequences for allowable differences (e.g., Chloë and ORG.Annotate). If merging does not result in IRs, an inspection of the larger self-aligned fragments can find the junction points (e.g., Chloroplot and GeSeq) of the IRs, and at least indicate the possible location of the IRs. Further data analysis can be performed when developing a new method to determine what thresholds should be used for the alignment parameters and what types of differences are acceptable. Fortunately, a growing number of complete cpDNA sequences are available for the analysis and verification of possible solutions.

Data Acquisition
The complete chloroplast genome sequences were downloaded from NCBI (http: //www.ncbi.nlm.nih.gov, accessed on 31 December 2021). We acquired all genomes for asterid and rosid clades available at this time and created two datasets. The inverted repeat-lacking (IRL) dataset contained the sequences of taxa for which the loss of IR had been documented. The remaining sequences, which we assumed to have the standard quadripartite structure, formed the IR dataset.
A total of 4137 sequences were used for the analyses, of which 1924 were from asterids and 2213 were from rosids. From the data, we created IRL and IR datasets containing 141 and 3996 sequences, respectively ( Table 8). The list of sequences included in the IRL dataset with their taxonomic information and references is given in Table 9. The genus Monsonia (Geraniaceae, rosids) was included in the IRL dataset because [82] found no IR in M. emarginata (L.f.) L'Her. or M. marlothii (Engl.) F.Albers. Although [39] identified IR in M. speciosa L., it was reduced to only 7 kb.
On the other hand, the genus Passiflora was included in the IR set, although [42] reported that two of 46 species of the genus lacked IR, namely, P. capsularis L. and P. costaricensis Killip.

Comparative Code Analysis
We examined chloroplast annotation tools that were available at the time of writing. Other tools implemented as web applications, such as CGAP [86], CpGAVAS [87], and Verdant [88], are no longer functional (as of 23 November 2021). In addition, the software DOGMA [58], which has been widely used over the past decade, is "being sunsetted after 15 years and will not be available for use in the near future," according to its website (https://dogma.ccbb.utexas.edu/). Some tools are limited to gene annotation and do not annotate inverted repeats (accessed on 23 November 2021), such as AGORA [89], CpGAVAS2 [90], and MFannot [91]. Note that in [87] on CpGAVAS, the predecessor of CpGAVAS2, it is stated that "the inverted repeats are identified using the vmatch software tool with default parameters" while in [90] on CpGAVAS2, there is no mention of the IR identification strategy implemented.
To investigate how annotation tools accomplish the identification of IRs, we focused our investigation on six annotation tools (Chloë, Chloroplot, GeSeq, ORG.Annotate, PGA, and Plann) and collected the relevant articles and available source codes. We also scanned IRs from NCBI sequences using airpg [51], a tool that includes functions to identify and parse the IR annotations of plastid genomes regardless of their format. Detailed information and references on the annotation tools used in the analysis can be found in Table 10. For each annotation tool, we reviewed the implementation code to determine the general strategy for identifying IRs. From the general strategy and implementation specifics, we inferred the behavior of the method as a function of a range of possible inputs, particularly with respect to the treatment of genome circularity and the presence of ambiguous characters in the input data. We also investigated how possible differences in IR regions are handled and what types of mismatches are tolerated.

Comparative Results Analysis
To annotate cpDNA sequences with all tools in a standardized way, we implemented a set of scripts, one for each tool, that have the same format for inputs and results. The input is a fasta or GenBank file with a sequence to be annotated. The results are four integer positions of IR junction point locations (i.e., start and end points of IRa and IRb), or none if no IRs were identified. The scripts were implemented as stand-alone executables in the project (https://github.com/CroP-BioDiv/irs_wrappers, accessed on 27 June 2022).
We processed the acquired cpDNA sequences using six annotation tools to identify the locations of IR. We analyzed the obtained results for properties related to identification problems, genome circularity, and mismatches in the detected regions.
For the sequences in the IRL dataset, we examined the number of identified and unidentified IRs and how the results corresponded to the documented IR loss. For the IR dataset, we analyzed the results for the following properties: (a) type of IR regions identified, (b) treatment of the circularity of the chloroplast genome, (c) treatment of sequences with ambiguous bases, and (d) differences between IRa and IRb. In addition, we investigated the distribution of the lengths of the identified regions.
Regarding the type of IR regions, we classified the sequences as follows: (a) the sequences in which the IR regions were identified and verification showed that the IRa and IRb were identical (identical IRs), (b) the sequences in which the IR regions were identified but differed (different IRs), and (c) the sequences in which the IR regions were not identified (no IRs).
The pipeline to perform the analysis was implemented in Python using the Biopython package [95]. The code is maintained in a public repository (https://github.com/CroP-BioDiv/zcitools; accessed on 31 December 2021). Figures were generated using the Python library Matplotlib [96].

Conclusions
Long inverted repeats are present in most chloroplast genomes. The location of IRs within the sequence is valuable information because cpDNA data have been used extensively in studies of plant phylogenetics. Bioinformatically, the problem of identifying IRs is easy to solve, as there are a variety of software programs designed to identify exact or similar long reverse-complement repeats. Working with public cpDNA genome sequences shows us that the assembled data are not always perfect and that there are situations where we are very confident that IRs exist, but two exact or similar long repeats cannot be found easily. The annotation tools analyzed (Chloë, Chloroplot, GeSeq, ORG.Annotate, PGA, Plann) use two approaches to solve the problem, using existing general search tools (ORG.Annotate, PGA, Plann) or implementing stand-alone solutions (Chloë, Chloroplot, GeSeq). The development of annotation tools that implement stand-alone solutions to this problem is evidence that researchers encountered data-imperfection problems before. In terms of results, all methods differ from each other, and for each combination of methods there are examples of sequences where one performs better than the other. This suggests that there is room for improvement in the identification of IRs and that a pool of ideas for this can be found in the implementation of existing annotation tools.