Next Article in Journal
Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation
Next Article in Special Issue
Effective Approach for Classifying EMG Signals Through Reconstruction Using Autoencoders
Previous Article in Journal / Special Issue
From Algorithm to Medicine: AI in the Discovery and Development of New Drugs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs

by
Dimitris Angelakis
*,
Dionisis Cavouras
,
Dimitris Th. Glotsos
,
Spiros A. Kostopoulos
,
Emmanouil I. Athanasiadis
,
Ioannis K. Kalatzis
and
Pantelis A. Asvestas
Department of Biomedical Engineering, University of West Attica, 122 43 Athens, Greece
*
Author to whom correspondence should be addressed.
Submission received: 10 December 2025 / Revised: 9 January 2026 / Accepted: 10 January 2026 / Published: 14 January 2026
(This article belongs to the Special Issue Transforming Biomedical Innovation with Artificial Intelligence)

Abstract

This study presents an in silico proof of concept exploring whether deep learning models can perform conditional mitochondrial DNA (mtDNA) sequence prediction across species boundaries. A CNN–BiLSTM model was trained under a leave-one-species-out (LOSO) scheme on complete mitochondrial genomes from 21 vertebrate species, primarily archosaurs. Model behavior was evaluated through multiple complementary tests. Under context-conditioned settings, the model performed next-nucleotide prediction using overlapping 200 bp windows to assemble contiguous 2000 bp fragments for held-out species; the resulting high token-level accuracy (>99%) under teacher forcing is reported as a diagnostic of conditional modeling capacity. To assess leakage-free performance, a two-flank masked-span imputation task was conducted as the primary evaluation, requiring free-running reconstruction of 500 bp interior spans using only distal flanking context; in this setting, the model consistently outperformed nearest-neighbor and demonstrated competitive performance relative to flank-copy baselines. Additional robustness analyses examined sensitivity to window placement, genomic region (coding versus D-loop), and random initialization. Biological plausibility was further assessed by comparing predicted fragments to reconstructed ancestral sequences and against composition-matched null models, where observed identities significantly exceeded null expectations. Using the National Center for Biotechnology Information (NCBI) BLAST web interface, BLASTn species identification was performed solely as a biological plausibility check, recovering the correct species as the top hit in all cases. Although limited by dataset size and the absence of ancient DNA damage modeling, these results demonstrate the feasibility of conditional mtDNA sequence prediction as an initial step toward more advanced generative and evolutionary modeling frameworks.

1. Introduction

Ancient DNA (aDNA) has revolutionized evolutionary biology by enabling the reconstruction of phylogenetic relationships, population histories, and adaptive traits of extinct species. However, the recovery of aDNA is severely constrained by post-mortem degradation processes, including hydrolysis, oxidation, and depurination, which fragment and chemically modify DNA over time. Mitochondrial DNA (mtDNA) has an estimated half-life of ~521 years under temperate, optimal conditions [1] limiting the recovery of authentic sequences to specimens typically younger than 1–2 million years. Consequently, genomic data from Mesozoic organisms such as non-avian dinosaurs, which became extinct over 66 million years ago, remain inaccessible to direct sequencing. Classical paleogenomic studies have primarily focused on Pleistocene and Holocene samples, where partial DNA recovery is still feasible. These efforts rely on next-generation sequencing (NGS), read mapping, and phylogenetic inference to reconstruct partial genomes or impute missing regions based on closely related species [2,3,4]. Traditional approaches to ancestral sequence reconstruction (ASR) employ maximum likelihood (ML) or Bayesian frameworks, implemented in tools such as PAML [5] and BEAST [6]. While robust, these methods depend on the availability of closely related extant sequences and assume predefined evolutionary models, potentially limiting their predictive scope for deeply divergent or extinct lineages. In parallel, deep learning has emerged as a powerful paradigm in computational biology, capable of learning complex, high-dimensional sequence patterns without explicit evolutionary modeling. Convolutional neural networks (CNNs) and recurrent architectures, including bidirectional long short-term memory (BiLSTM) networks, have been successfully applied to promoter prediction, enhancer detection, and variant effect analysis [7,8,9]. More recently, deep learning has achieved breakthroughs in protein structure prediction [10] and RNA secondary structure modeling [11]. Despite this progress, applications to ancestral DNA reconstruction or paleogenomic inference remain largely unexplored. Here, this study presents a proof-of-concept in silico feasibility study that explores whether deep learning models can learn non-random, biologically structured patterns in mitochondrial genomes The study compiled a dataset of 21 complete vertebrate mtDNA sequences, including archosaurs (birds and crocodilians) and non-archosaur outgroups (turtles, lepidosaurs, and amphibians). The mtDNA was selected as the focus because of its higher copy number per cell, lack of recombination, and relatively small, circular genome, which make it more amenable to recovery and complete assembly than nuclear DNA, especially in degraded or low-yield samples A CNN-BiLSTM model was trained in a LOSO scheme to perform next-nucleotide prediction on overlapping windows, and the resulting predicted 2000 bp fragments were compared to ancestral sequences inferred with MUSCLE alignments in MEGA X. In this setup, the model predicts each base using the true preceding context from the held-out species; this is not unconditional de novo sequence generation, but a test of whether the learned cross-species sequence dependencies transfer to unseen species. This study is purely computational and it does not involve wet-lab experiments or the recovery of actual aDNA. Instead, it provides an initial computational assessment of whether deep learning can generate sequences that retain detectable evolutionary patterns. By bridging machine learning and classical phylogenetics, this exploratory framework introduces AI-assisted paleogenomic inference as a complementary direction for future studies. This work is a preliminary proof of concept with a deliberately limited taxonomic and sequence scope, intended to assess feasibility rather than provide immediate paleogenomic reconstruction. While purely computational and not involving the generation or analysis of authentic aDNA, the ability to recover non-random, biologically structured patterns in short mtDNA fragments suggests potential future applications in paleogenomics. Therefore, the core is not the model’s accuracy as a ‘completer’, but its ability to generalize these learned rules to a species it has never seen, which is validated by the LOSO design.
Beyond its phylogenetic scope, AI-driven reconstruction of short mtDNA fragments could support several domains of biomedical research, including the recovery of evolutionary signal in highly fragmented samples, the study of selective pressures that shape mtDNA variants implicated in human disease, and the augmentation of bioinformatics workflows that operate on incomplete or low-coverage mitochondrial data.
Real-world deployment would require the integration of simulated or empirical aDNA, including post-mortem damage patterns and high fragmentation, which remains a direction for future work.

Related Work

Deep learning has been increasingly applied to sequence modeling in genomics, with early work exploring generative frameworks such as GANs and activation maximization for synthetic DNA design [12] and, more recently, latent diffusion models tailored for discrete nucleotide sequences [13]. In the field of ancestral sequence reconstruction (ASR), recent studies have proposed autoregressive deep models for proteins, capturing epistatic constraints and improving reconstruction accuracy over maximum-likelihood baselines [14,15]. However, these generative ASR efforts have focused almost exclusively on proteins rather than nucleotide sequences. For mtDNA, deep learning has been primarily applied to sequence classification and variant effect prediction, often using hybrid convolutional–recurrent architectures such as DanQ [16], which predicts functional elements like regulatory regions, but not for generative reconstruction of missing fragments. Transformer-based approaches for phylogenetic inference, including Phyloformer [17] and Fusang [18], predict evolutionary distances or tree topologies from alignments but do not generate explicit ancestral or taxon-specific sequences. The potential for deep learning to capture complex biological and environmental patterns is increasingly recognized across diverse scientific domains. In hyperspectral imaging (HSI), Yang et al. (2021) demonstrated that standard CNNs often produce “discontinuous” features due to fixed receptive fields, a problem they solved with the Enhanced Multiscale Feature Fusion Network (EMFFN), which integrates spectral and spatial information across multiple parallel scales [19]. Building on these principles of high-dimensional data processing [20], presented HyperSIGMA, a vision transformer-based foundation model scalable to over one billion parameters. HyperSIGMA utilizes a Sparse Sampling Attention (SSA) mechanism to overcome data redundancy. This poses a challenge directly analogous to capturing the intricate dependencies and redundant motifs within mitochondrial DNA (mtDNA) sequences by intelligently sampling the most informative contextual features [20].
While these advancements highlight the generative and predictive capacity of deep learning in high-dimensional data, their application to evolutionary biology remains a critical frontier. Recent genomic perspectives on the adaptation of Galápagos iguanas illustrate the importance of identifying specific selective sweeps and ancestral signals to understand lineage evolution and divergence [21].
Motivated by these gaps, this study applies a lightweight CNN–BiLSTM architecture to explore the feasibility of conditional mtDNA sequence prediction across species boundaries. Rather than performing phylogenetic inference or de novo genome reconstruction, the proposed framework evaluates whether multiscale feature extraction and context-aware modeling can support the completion of missing mtDNA fragments under controlled conditioning. The model is assessed using a leakage-free evaluation design and alignment-based supporting sanity checks to verify biological plausibility and non-random evolutionary similarity, without claiming phylogenetic placement or inference.
The major contributions of this study are as follows:
(i) a computational proof of concept demonstrating context-conditioned mtDNA sequence prediction under a leave-one-species-out (LOSO) training scheme;
(ii) evidence that deep learning models can capture transferable sequence regularities under controlled conditioning that are shared across species;
(iii) a leakage-free evaluation using masked-span imputation that consistently improves over simple nearest-neighbor heuristics and performs competitively with flank-copy baselines; and
(iv) a validation framework employing composition-matched null models and ancestral alignments as sanity checks for non-random sequence similarity.

2. Materials and Methods

2.1. Data Collection

Complete mitochondrial genome sequences from 21 extant vertebrate species were retrieved from the National Center for Biotechnology Information (NCBI, National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894, USA) GenBank database in FASTA format. The dataset was designed to include primarily archosaur species (birds and crocodilians) alongside non-archosaur reptiles and a single amphibian outgroup to provide phylogenetic context for model evaluation. The species included were as follows in Table 1.
Only complete, circularized mitochondrial genomes were included to ensure consistency in sequence length for alignment and modeling. The dataset of 21 species was selected based on the availability of complete, high-quality mitochondrial genomes representing archosaurs (birds and crocodilians) and key non-archosaur outgroups (turtles, lepidosaurs, and a single amphibian, Xenopus laevis). This selection was designed to capture both in-clade conservation and outgroup contrast, providing a phylogenetically informative yet computationally manageable starting point for resource constrained proof-of-concept study on standard workstation hardware.

2.2. Data Preprocessing

All raw mtDNA sequences were processed using Biopython v1.83 to ensure compatibility with the deep learning workflow. Sequences were converted to uppercase to maintain uniformity, and any ambiguous nucleotide characters (N) were removed to avoid uncertainty in downstream modeling. Following sequence cleaning, each mtDNA genome was segmented into overlapping windows of 200 bp with a stride of 50 bp to generate partially overlapping fragments that preserved local sequence continuity These short windows served as the basic inputs for the deep learning model, enabling it to capture local sequence dependencies while increasing the effective dataset size. To generate model predictions at the desired fragment length, the deep learning pipeline was designed to reconstruct 2000 bp predicted sequences for the species left out under the LOSO evaluation scheme. Input windows were one-hot encoded in an L × 4 matrix representation (where L = 200 bp and columns correspond to A, C, G, and T) for CNN-BiLSTM model ingestion. This preprocessing procedure produced a species-dependent set of training windows standardized for model input, enabling the CNN-BiLSTM to learn predictive sequence patterns across all taxa while accommodating interspecies variability. The window size of 200 bp was selected to balance computational efficiency with the capture of sufficient local sequence context. This length is consistent with established deep learning frameworks in genomics, such as DanQ [16] and DeepSEA [22], which utilize 200 bp intervals to model regulatory chromatin profiles effectively. The choice of a 200 bp window size with a 50 bp stride for training was a trade-off between computational feasibility and the ability to capture local mtDNA sequence motifs. A 200 bp context is sufficient to encompass common mitochondrial features such as conserved coding segments and structural motifs, while the 50 bp overlap maintains local sequence continuity without excessive data redundancy. The 2000 bp predicted fragment length was chosen as a fixed comparison window to ensure uniform evaluation across species and to keep Smith–Waterman alignments computationally tractable. While the corresponding MEGA X ancestral sequences were typically much longer, restricting the analysis to the first 2000 bp allowed for consistent benchmarking of all predictions without incurring the exponential runtime and memory costs associated with aligning entire mitochondrial genomes. The average sequence length was approximately 17,191 bp (range: 15,181 bp for Sphenodon punctatus–18,905 bp for Boa constrictor). Given the circular nature of the mitochondrial genome, the sliding window yielded a total of 7220 input fragments across the entire dataset. This volume of data provided sufficient coverage for the model to learn localized sequence patterns under the LOSO framework.

2.3. Feature Extraction

Each mtDNA window was transformed into a one-hot encoded matrix of size L × 4, where L is the nucleotide sequence length and the four columns represent the bases A, C, G, and T. This encoding treats nucleotides as categorical variables and is widely used in deep learning applications in genomics. No biologically engineered features (e.g., GC content or codon bias) were included, in line with the study’s primary objective: to assess whether intrinsic sequence patterns can be learned directly from raw nucleotide data, without incorporating explicit evolutionary assumptions or handcrafted inputs. To explore this, a CNN-BiLSTM model was implemented using TensorFlow/Keras v2.15. The input consisted of 200 bp one-hot encoded windows (L = 200), generated as described in Section 2.2. For each window, the model was trained to predict the next nucleotide in the sequence: the input comprised all nucleotides except the final position, while the target output corresponded to the nucleotide immediately following the window, enabling next-base prediction across the mtDNA.

2.4. Model Development

Model evaluation followed a LOSO scheme to assess generalization to unseen species. For each species in the dataset (21 complete mtDNA genomes in our runs), all remaining species were used for training, and the excluded species served as the independent test set. Within the training data, 10% of windows were held out for validation, yielding a train/validation/test split in each LOSO iteration: 90% of windows from the training species for fitting, 10% from those same species for validation, and 100% of the left-out species for testing. Input windows were 200 bp in length and one-hot encoded over {A, C, G, T}. Training windows were generated with a 50 bp stride, whereas test windows were generated with a 100 bp stride to reduce redundancy during evaluation. A 200 bp context was chosen to balance local mtDNA motif capture with computational feasibility on a modest workstation. The network architecture was a compact CNN-BiLSTM: a 1D convolutional layer (64 filters, kernel size = 5, ReLU) for local motif extraction; a bidirectional LSTM layer (64 units, returning sequences) to capture bidirectional context; a dropout layer (rate = 0.3) for regularization; and a dense softmax output (4 units) providing per-nucleotide predictions. Models were trained with categorical cross-entropy and Adam (learning rate = 0.001) for up to 10 epochs per LOSO fold, using Early Stopping (patience = 3, restore best weights) and ReduceLROnPlateau (factor = 0.5, patience = 2). For each LOSO test species, within-window sequence predictions (positions 2–200) were concatenated and truncated to 2000 bp to form the computational fragment. Performance metrics included nucleotide-level accuracy, precision, recall, macro-averaged F1, and percent identity between the predicted fragment and (i) the ground-truth sequence from the left-out species, and (ii) for an initial all-species snapshot, MUSCLE multiple sequence alignments were paired with maximum likelihood (ML) ancestral reconstruction in MEGA X to provide a well understood baseline that is easy to audit and reproduce from a graphical environment. For fair leave-one-species-out (LOSO) evaluation, MAFFT was used to re-align the remaining 20 taxa per fold due to its speed and robustness in batch mode, followed by an attachment-node ancestral reconstruction on the reduced set. Tree inference in the LOSO pipeline used a simple, reproducible guide tree (for example, neighbor-joining with midpoint rooting) that is sufficient to locate the ancestral attachment point for the held-out lineage in this proof of concept. (Section 2.5). This configuration corresponds to the primary teacher-forced CNN–BiLSTM used for 200 bp next-nucleotide prediction and 2000 bp fragment assembly. A separate, compact CNN–BiLSTM configuration is used for the two-flank masked-span imputation task and is described explicitly in Section 2.7.
To generate a predicted mtDNA fragment of length 2000 bp for a held-out species, we employ a deterministic sliding-window procedure under teacher-forced conditioning. Given the full mitochondrial genome of the held-out species, overlapping windows of fixed length 200 bp are extracted using a constant sampling stride.
For each window, the trained CNN–BiLSTM model performs sequence-to-sequence prediction within the window: the true nucleotide sequence at positions 1–199 is provided as input, and the model predicts the nucleotide identities at positions 2–200. Importantly, the true genomic sequence is always used as input context (teacher forcing), and predicted nucleotides are never fed back into subsequent predictions.
The predicted subsequences (199 bases per window) are concatenated in genomic order to form a long predicted sequence, which is then truncated to obtain a final fragment of exactly 2000 bp. As a result, the reported fragment represents an aggregate of position-wise predictions generated under true-context conditioning across sampled windows, rather than a free-running autoregressive reconstruction. The window sampling strides reported elsewhere (50 bp during training and 100 bp during testing) control the degree of redundancy among windows used for model fitting and evaluation, while fragment assembly itself is performed by concatenating all predicted within-window outputs prior to truncation. Algorithm 1 formalizes this reconstruction procedure.
Algorithm 1: Context-Conditioned mtDNA Fragment Prediction
Input:
  • G: full mtDNA genome sequence of the held-out species (string)
  • L_w: window length (200 bp)
  • S: window sampling stride for the held-out species (e.g., 100 bp in testing)
  • L_out: desired output fragment length (2000 bp)
  • M: trained CNN–BiLSTM model (sequence-to-sequence predictor over window positions)
       Output:
  • P: predicted mtDNA fragment of length L_out
       Procedure:
  • Initialize empty list P_preds
  • Extract overlapping windows from the held-out genome:
    For start positions i = 0, S, 2S, … while i + L_w ≤ |G|:
    W = G[i : i + L_w]
    One-hot encode W as X_full ∈ {0,1}^{L_w×4}
    Form teacher-forced input sequence (true context within window):
    X_in = X_full [0 : L_w−1] (positions 1..199)
    Predict per-position next-base distribution within the window:
    Ŷ = M(X_in), where Ŷ has length L_w−1 (199 outputs)
    Convert Ŷ to discrete bases (e.g., argmax per position):
    ŷ_seq = decode(Ŷ) (a 199-nt string)
    Append ŷ_seq to P_preds
  • Concatenate predictions in window order:
    P_long = concatenate(P_preds)
  • Truncate (or pad, if needed) to required length:
    P = P_long[0 : L_out]
  • Return P

2.5. Biological Validation

To evaluate the biological plausibility and database consistency of the CNN–BiLSTM–predicted mtDNA fragments, we implemented a multi-step in silico validation pipeline. This pipeline comprised multiple sequence alignment, reference-based ancestral reconstruction for contextual comparison, local alignment scoring, and similarity search. These analyses were designed as diagnostic sanity checks to verify that predicted fragments were non-random and compatible with known mitochondrial sequence structure, rather than to perform phylogenetic inference. Figure 1 summarizes the validation workflow.

2.5.1. Multiple Sequence Alignment (MSA)

Complete mitochondrial genomes from all 21 species were aligned using MUSCLE, as implemented in MEGA X v12. Default gap penalties and a codon-independent nucleotide scoring scheme were applied. The resulting alignments were manually inspected to verify positional homology, ensuring the absence of misaligned terminal regions, excessive gap clustering, or spurious insertions/deletions before proceeding to ancestral sequence reconstruction.

2.5.2. Ancestral Sequence Reconstruction

Using the curated MUSCLE alignment, ancestral nodes were inferred in MEGA X under a Maximum Likelihood (ML) framework employing the Tamura–Nei substitution model. Internal nodes corresponding to major phylogenetic splits (e.g., within Archosauria and between outgroup clades) were identified and exported as FASTA sequences. These reconstructed ancestral sequences served as reference points for assessing the phylogenetic plausibility of the machine learning–predicted fragments. Figure 2 illustrates the inferred phylogenetic tree with highlighted ancestral nodes.

2.5.3. Leave-One-Species-Out Ancestral Sequence Reconstruction MAFFT

To ensure a fair evaluation aligned with the LOSO scheme used for model training, an additional ancestral sequence reconstruction was performed excluding the held-out species from the MSA and tree for each fold. For each species, the remaining 20 mtDNA sequences were aligned using MAFFT v7.526 with automatic strategy selection and support for non-standard symbols. A Neighbor-Joining (NJ) tree was constructed using Biopython’s DistanceTreeConstructor with an identity-based distance matrix and midpoint rooting. To identify the equivalent internal node for reconstruction (approximating the held-out species’ parent), we first built a reference NJ tree from all 21 species and computed the descendant leaf set of each species’ parent node (excluding the species itself). In the LOSO-pruned tree, the internal node with the closest matching descendant set (exact match preferred; otherwise maximized Jaccard similarity) was selected. Fitch parsimony was then applied to reconstruct the ancestral sequence at this node, resolving ambiguities by preferring bases from the Fitch candidate set or the most frequent base in the column. The model’s 2000 bp predicted fragment for the held-out species was compared to this LOSO-reconstructed ancestral sequence using Smith–Waterman local alignment.
MUSCLE and MAFFT are both widely used for whole-mitogenome alignment and produce highly similar columns for conserved regions. ML in MEGA X provides a single, consistent all-species reference, while an NJ guide on reduced sets minimizes runtime and scripting complexity during LOSO. The biological validation in this study focuses on short, contiguous fragments and species-level identification rather than on model choice testing per se. The conclusions within this validation context rely on relative comparisons.
To assess whether methodological choices drive the results, sensitivity checks were performed by re-running selected folds with alternative settings: (i) MUSCLE in place of MAFFT in the LOSO pipeline, (ii) ML trees in place of NJ for the reduced 20-taxon sets, and (iii) recomputation of LOSO ancestral sequences under these variants. For each variant, predicted fragments were re-scored against the corresponding LOSO ancestor with the same Smith–Waterman parameters, and BLAST species identification was repeated under the fixed megablast configuration. A zero gap-open penalty was intentionally chosen for the Smith–Waterman local alignments to avoid penalizing short indels or alignment ‘jitter’ potentially introduced at the boundaries of the 200 bp sliding-window predictions. This setting, paired with a −2.5 extension penalty, prioritizes the identification of homologous blocks while remaining permissive of local uncertainties inherent in this window-based proof of concept. Importantly, identical parameters were applied to all DNA-mimicking nulls and baselines to ensure a controlled statistical comparison.

2.5.4. Local Alignment (Smith-Waterman)

Each predicted fragment, generated under the LOSO scheme, was locally aligned against the corresponding MEGA-inferred ancestral sequence using the Smith–Waterman algorithm. SW was implemented with Parasail using match/mismatch = 1/−2 and affine gaps (open = 0, extend = −2.5). Percent identity and alignment scores were calculated to assess whether predicted sequences shared detectable homology with ancestral nodes beyond random expectation. Both forward and reverse complement orientations were evaluated to capture the best possible local match. As an orthogonal validation step, all predicted sequences were queried using BLASTn against the NCBI nt database, restricted to vertebrate mitochondrial sequences. High-scoring segment pairs (HSPs) corresponding to the expected species or clade confirmed that the CNN-BiLSTM-generated fragments were consistent with known mitochondrial sequence structure, even in the absence of explicit evolutionary priors during model training. As a basic control, each predicted fragment was also compared against a randomly shuffled version of its corresponding ancestral sequence. This null model produced a baseline local identity of approximately 25% (Section 2.5.6), consistent with a uniform nucleotide distribution, supporting that the CNN–BiLSTM predictions reflect conserved sequence regularities rather than random composition

2.5.5. Sensitivity to Alignment and Tree/ASR Choices

To quantify how methodological choices affect the predicted-to-ancestor identity, we compared two pipelines on the same 21-species panel. The MEGA arm used the curated MUSCLE alignment and maximum-likelihood (Tamura–Nei) tree from the original analyses to obtain the ancestral sequence for each species. The LOSO arm re-aligned the remaining 20 species with MAFFT (--auto) and built a Neighbor-Joining (NJ) tree using an identity-based distance in Biopython. To target the same ancestral position as the excluded taxon, we computed the descendant leaf set of that taxon’s parent in a 21-species reference NJ tree and, in each LOSO tree, selected the internal node with the highest Jaccard similarity of descendant sets. We then reconstructed the sequence at that node with Fitch parsimony. Predicted fragments were scored against both ancestors with Smith–Waterman (Biopython PairwiseAligner; match = 1, mismatch = −2, gap-open = 0, gap-extend = −2.5), and we report gap-inclusive % identity (matches/alignment columns), alignment length (including gap columns), and alignment score. The objective here is relative sensitivity (Δ%ID between arms) rather than a new absolute benchmark; both arms are evaluated with the same scoring engine and parameters.

2.5.6. DNA-Mimicking Null Simulations

To address potential bias from Smith–Waterman selecting the best local match (potentially inflating identities above a 25% random baseline), we simulated two DNA-mimicking null models using only the training-species mtDNA (excluding the held-out species). The first null trained an order-4 (k = 5) Markov model on nucleotide frequencies and transitions, sampling 2000 bp sequences. The second null extracted random 2000 bp circular windows from training mtDNA. For each species, we computed Monte Carlo p-values as the proportion of 2000 null samples with Smith–Waterman %ID ≥ observed (vs. the LOSO ancestral sequence), using adaptive sampling (up to 2000 per null) with early stopping if the one-sided 99.9% Wilson confidence interval excluded α = 0.05. p-values were FDR-corrected (Benjamini–Hochberg) across species.

2.5.7. Statistical Evaluation of Sequence Identity

To assess whether the predicted fragments retained significantly greater identity to reconstructed ancestral nodes than expected by chance, a one-sided binomial test was performed. For each species, the observed number of matching bases was derived from the percent identity between the 2000 bp predicted fragment and the corresponding MEGA X ancestral fragment. The null hypothesis assumed a baseline identity of 25%, corresponding to random base guessing under uniform nucleotide distribution. P-values were computed using scipy.stats.binomtest (SciPy v1.11.4). All predicted fragments exhibited significantly higher identity than the random baseline (p < 0.001). As an additional control, a 2-mer (first-order Markov) baseline model was applied under the same LOSO.
To strengthen the comparison beyond the 2-mer Markov baseline, we evaluated additional baselines using only training-species mtDNA: (i) k-mer language models (k = 3–6, order 2–5 Markov) sampling 2000 bp sequences, (ii) position-specific scoring matrices (PSSM, akin to profile-HMM) from LOSO MSAs (MAFFT-aligned training species), sampling sequences and reporting best/mean %ID over 50 samples, (iii) per-position consensus (majority vote) from LOSO MSAs, and (iv) nearest-neighbor copy (2000 bp window from the training species with highest full-mtDNA identity to the held-out species, selected via random fixed seed). For each baseline, we computed Smith–Waterman %ID vs. the LOSO ancestral sequence, with Monte Carlo p-values (p_obs = Pr[baseline %ID ≥ observed]) for k-mer models (up to 2000 samples). We post-processed results to add lift (%ID observed minus baseline mean) and z-scores for k = 5 and k = 6, significance flags (p_obs ≤ 0.05), FDR q-values (Benjamini–Hochberg), and aggregate metrics (median lift, mean z, Stouffer’s Z with one-sided p).

2.5.8. Definition of Sequence Identity Metrics

To comprehensively evaluate the predicted fragments, this study employs three distinct percent identity (%ID) metrics, each suited for a specific analytical purpose. Readers should note that the values generated by these metrics are calculated differently and are not directly comparable.
  • Ungapped, Equal-Length Identity: Used for initial comparisons against the ground-truth sequence and the all-species ancestral nodes is calculated as (Matches/SequenceLength) × 100. It provides a straightforward measure of nucleotide similarity without considering insertions or deletions.
  • Gap-Excluded Smith-Waterman Identity: Used in the fair LOSO validation against the re-calculated ancestors. This is calculated as (Matches/(Matches + Mismatches)) × 100. This metric focuses on the accuracy of the aligned regions only, ignoring gaps, which is useful for assessing the conservation of homologous blocks.
  • Gap-Inclusive Smith-Waterman Identity: Used for the DNA-mimicking null simulations. This is calculated as (Matches/TotalAlignmentColumns) × 100. By including gap columns in the denominator, this metric penalizes for insertions and deletions, providing the most stringent assessment of overall alignment quality. This was chosen for the null comparison because it holistically accounts for the structure that the Smith-Waterman algorithm optimizes.
The three identity formulations address different biological and algorithmic questions, so each is paired to a specific comparison. Ungapped, equal-length %ID quantifies pure substitutional agreement at a fixed locus and is alignment-free; it avoids artifacts from variable alignment lengths and is appropriate when the intended span length is fixed or when sequences have been trimmed to equal length. Gap-excluded SW %ID isolates conservation within confidently aligned blocks by removing gaps from the denominator; this emphasizes base substitutions while reducing sensitivity to lineage-specific indel processes or to local-alignment boundary effects, which is desirable in LOSO validation against re-estimated ancestors. Gap-inclusive SW %ID measures overall alignment fidelity, penalizing insertions/deletions by counting all alignment columns; this more stringent score is well suited for null comparisons, where gappy alignments could otherwise inflate match rates. Because denominators differ (span length vs. matched+mis-matched bases vs. total alignment columns), the absolute values are not directly comparable across metrics; conclusions should be drawn within a metric, chosen to match the scientific question and the expected indel behavior of the comparison.

2.6. Robustness to Starting Position and Region Stratification

To verify that model performance was not an artifact of specific genomic coordinates or regional biases, we conducted two complementary robustness evaluations. First, positional invariance was assessed via randomized circular shifting of predicted 2000 bp fragments to ensure that results were not dependent on an arbitrary window start position. Specifically, each predicted fragment was circularly shifted ($N = 10$ shifts per species) and re-aligned to the corresponding LOSO ancestral sequence using Smith–Waterman local alignment. Comparable alignment statistics across shifts indicated that the observed non-random similarity was robust to starting-position selection.
Second, a regional stratification analysis was performed to assess whether model performance was confined to particular mitochondrial regions. Using GenBank annotations retrieved via Entrez, base-wise masks were constructed for the control region (D-loop) and coding regions. These masks were projected onto the raw mtDNA sequences via MAFFT pairwise alignments, and predicted fragment hits were identified using Bio.Align.PairwiseAligner, with regions assigned based on a ≥70% overlap criterion. This analysis confirmed that conditional prediction performance was not restricted to a single genomic region but extended across both regulatory and protein-coding portions of the mitochondrial genome.

2.7. Two-Flank Masked-Span Imputation

Two-flank masked-span imputation of mitochondrial DNA was evaluated under a leave-one-species-out (LOSO) protocol across 21 vertebrate species (one FASTA per species). This task constitutes the most stringent and methodologically critical evaluation in the study, as it requires fully free-running reconstruction of long mtDNA spans in the absence of ground-truth conditioning. Sequences were upper-cased, U → T, non-ACGT characters replaced with A, and genomes treated as circular for slicing. For each fold, one species was held out for evaluation and models were trained on all remaining species.
Two compact TensorFlow/Keras next-base predictors were used: a forward model (F) that predicts left → right and a reverse model (R) trained on reverse-complemented windows so that, at test time, it predicts right → left from the right flank; R outputs were reverse-complemented back to forward orientation. Both models shared the architecture Conv1D(48, k = 7, ReLU) → Conv1D (48, k = 5, dilation = 2, ReLU) → BiLSTM (48, return_sequences) → Dropout (0.25) → Dense (4, softmax), optimized with categorical cross-entropy (label smoothing 0.05) and Adam (1 × 10−3; clipnorm 1.0).
Training employed on-the-fly random 400-nt windows (left context 399 nt) via tf.data, with denoising (5% random base flips confined to the left context) and context dropout (retaining a random trailing tail of 200–399 nt and zero-padding the remainder) to mitigate exposure bias. The “FAST” budget per fold was batch size 128, 6 epochs, 80 steps per epoch (validation 12 steps), early stopping (patience 3) and ReduceLROnPlateau (patience 2, factor 0.5); seeds were fixed for Python (v 3.13.5)/NumPy (v 2.3.2)/TensorFlow (v 2.20.0-rc0) and CPU multithreading was enabled.
In the held-out genome, three fixed, deterministic anchor positions defined masked regions, and 500 bp spans were imputed. The F reconstruction was generated by free-running from the left flank; the R reconstruction was generated by free-running in reverse-complement space from the right flank and reverse-complemented back; a per-base consensus (CONS) selected between F and R by higher predicted probability (ties to F).
Two baselines were evaluated: Nearest-Neighbor (NN), which selected the training species with maximum 3-mer cosine similarity computed over the full mtDNA and copied the 500 bp span from that genome at the matched circular start; and Flank-Copy (FlankCopy), which tiled the left or right flank, reporting the higher-identity result. The primary metric was equal-length percent identity (character-wise identity without gaps); for detailed inspection only (not used for headline comparisons or the 5-column summary), global Needleman–Wunsch percent identity was also computed with match +2, mismatch −1, gap-open −5, and gap-extend −1. For each species, results were summarized as the median across the three masked spans in a compact table.

3. Results

Section 3 reports results from evaluating a compact CNN–BiLSTM framework for context-conditioned reconstruction of fixed-length mtDNA fragments under a leave-one-species-out (LOSO) design. Section 3.1 summarizes next-nucleotide performance (accuracy, macro-F1) and equal-length identity to the corresponding 2000 bp region of each held-out genome. Section 3.2 provides biological validation via Smith–Waterman alignment to maximum-likelihood ancestral nodes and BLASTn species checks. Section 3.3 calibrates identities against DNA-mimicking nulls. Section 3.4 compares predictions to a 2-mer Markov baseline. Section 3.5 extends baseline analyses to consensus-of-training, profile/PSSM, and nearest-neighbor copy models. Section 3.6 assesses robustness to starting position and regional stratification (D-loop vs. coding). Section 3.7 evaluates sensitivity to alignment engine, tree inference, and ancestral-state reconstruction choices. Section 3.8 presents two-flank masked-span imputation as the primary leakage-free evaluation, providing the most stringent test of free-running conditional reconstruction. Taken together, the results indicate that, under leave-one-species-out conditioning, the CNN–BiLSTM framework is likely to recover mtDNA fragments that retain detectable evolutionary patterns within the intended scope of this proof of concept. Throughout the Results, each percent identity metric is reported and interpreted exclusively within the specific evaluation context for which it is defined, and numerical comparisons are made only within the same identity formulation.

3.1. Model Performance on Next-Nucleotide Prediction

Under teacher-forced conditioning, the CNN-BiLSTM model demonstrated consistently high predictive performance across all 21 species in LOSO evaluation. In this sliding-window next-nucleotide prediction task, accuracy remained above 99.4% for all species, with macro-averaged precision, recall, and F1 scores closely matching overall accuracy (Table 2). The percent identity (%ID) of reconstructed sequences to their ground-truth sequences over the first 2000 bp ranged from 99.55% to 99.75%, with correspondingly low mismatch counts (5–9 per 2000 bp segment). These results indicate that the model effectively captures short-range sequence dependencies and reproduces species-specific nucleotide patterns with minimal error.

3.2. Biological Validation of Predicted Sequences

To assess whether the deep learning–predicted mtDNA fragments exhibited non-random similarity to reference mitochondrial sequences, each 2000 bp fragment generated under the LOSO scheme was subjected to a two-step in silico validation pipeline. First, predicted sequences were locally aligned to their corresponding MEGA X–inferred ancestral nodes using the Smith–Waterman algorithm (Table 3). Local percent identity ranged from 63.73% (Caiman crocodilus) to 67.31% (Accipiter gentilis), substantially exceeding both the ~26% identity observed for a 2-mer Markov baseline and the ~25% random baseline (p < 0.001; Section 2.5.6). Mismatch counts (654–725) are reported relative to the best local alignment, not full-length genome comparison. These results indicate that the CNN-BiLSTM preserved detectable non-random evolutionary similarity in short mtDNA fragments. As an additional biological plausibility check, the predicted 2000 bp mitochondrial fragments were queried against the NCBI database using BLASTn. Identification was repeated under a fixed megablast configuration, with the search space restricted to vertebrate mitochondrial sequences to focus the evaluation on relevant evolutionary targets. Across a representative vertebrate panel (encompassing birds, crocodylians, additional reptile lineages, and an amphibian), all predicted fragments returned the expected species as the top hit (21/21 cases; Supplementary Table S1). Searches were performed against the core_nt database (updated 31 August 2025) using a word size of 28, an E-value cutoff of 0.05, match/mismatch scores of 1/−2, gap costs of 0/2.5, and low-complexity filtering (DUST; mask for lookup table only, filter string L;m;). Top hits typically exhibited 94–100% query coverage and 97.29–100.00% BLAST-reported pairwise identity, with E-values ranging from approximately 1 × 10−98 to 2 × 10−96 and maximum scores between 368 and 375, consistent with standard mitogenome alignments. Representative examples include Accipiter gentilis (99% coverage, 97.29% identity), Alligator mississippiensis (99% coverage, 100.00% identity), and Gekko gecko (99% coverage, 100.00% identity). Where informative, secondary hits corresponded to the closest non-target references. In specific cases, such as Gallus gallus, both primary and secondary hits represented intra-specific database entries. Lower apparent coverage values (e.g., 26%) occurred when the BLAST algorithm prioritized statistically superior alignments to partial mitochondrial regions—such as isolated D-loop entries—over alignments to complete mitogenomes. Complete results are provided in Supplementary Table S2. These BLASTn results are reported solely as sanity checks confirming the biological plausibility and database consistency of the predicted fragments under conditional prediction; they are not interpreted as evidence of phylogenetic inference or broader evolutionary generalization.
Table 3 reports ungapped percent identity (%ID) and mismatch counts for equal-length (2000 bp) sequence comparisons, as described in Section 2.5.3. In the initial run, gap-related metrics such as alignment length, number of gaps, and gap-open events were not calculated, as the analysis focused on direct nucleotide similarity without considering gaps. Consequently, these metrics were omitted from Table 3 by design. Table 4 reports identities between the predicted ~2 kb fragments and the LOSO ancestral sequences reconstructed after removing the test species from the alignment and tree. Using a gap-excluded Smith–Waterman identity (matches divided by base–base aligned pairs), local identities span 58.04% (Gekko gecko) to 88.95% (Dromaius novaehollandiae), with a mean of 63.82% across the 21 species. Although the identity definition here differs from the all-species snapshot in Table 3 (which uses ungapped, equal-length position-wise identity), the central tendency is similar (≈63.73–67.31%), indicating that the observed similarity is not dependent on inclusion of the test taxon in the ancestral reconstruction. The elevated Dromaius value reflects SW aligning a shorter, unusually low-divergence core (1186 base–base columns), which raises gap-excluded %ID relative to species with longer aligned cores. Minor per-species deviations are expected from (i) changes in tree context/branch lengths when one taxon is removed, (ii) SW’s local trimming (which can raise or lower %ID relative to equal-length comparisons), and (iii) small differences in homologous coverage near indel-rich regions. Overall, concordance between LOSO and all-species identities supports the interpretation that the predicted fragments retain non-random, phylogenetically consistent information under a fair, leakage-free ancestral reconstruction.
Sensitivity analyses indicated that toolchain choice did not alter the qualitative conclusions. Replacing MAFFT with MUSCLE in the LOSO pipeline and replacing NJ with ML on the reduced sets yielded identities within a narrow band relative to the primary configuration, with no change to the 21/21 BLAST top-hit assignments. Minor per-species fluctuations are consistent with expected differences in gap placement and branch-length estimation rather than systematic bias.
Note on definitions. Table 4 uses gap-excluded SW %ID and reports the corresponding SW base–base alignment length. Table 3 reports ungapped, equal-length identities. Because the denominators differ, absolute values are not directly comparable; SW local trimming can increase or decrease %ID relative to equal-length metrics. (Table 4 also lists MSA columns and MSA gap columns for the LOSO multiple alignment; these are not used to compute SW %ID).

3.3. Results for the DNA-Mimicking Null Simulations

Table 5 summarizes, for each LOSO fold, the observed Smith–Waterman identities of the predicted fragments to their LOSO ancestors (50.18–53.50%) alongside two DNA-mimicking backgrounds. For the k-mer language-model nulls (order-(k − 1), k ≈ 5–6), the null distributions are tightly concentrated, with means of approximately 50.4–50.6% (σ ≈ 0.5%). Consequently, raw excesses of +0.5 to +3.1 percentage points correspond to non-trivial tail probabilities; per-species empirical p-values ranged up to 0.17, with Benjamini–Hochberg–adjusted q-values ≤ 0.188. Although multiple-testing correction attenuates formal significance for some species, effect sizes are positive in nearly all cases, and the direction of deviation is consistent with enrichment beyond composition- and context-matched chance levels.
For the real-DNA background (length-matched windows sampled from non-homologous vertebrate mtDNA), inference is based on empirical percentiles and confidence intervals derived from the Monte Carlo null. Across species, empirical tail probabilities reach the minimum attainable value under the discrete null (p ≈ 1/(N + 1); ≈ 5 × 10−4 when N = 2000), resulting in strong saturation effects. Under these conditions, percentile-based reporting provides a more informative and interpretable summary of deviation from the null than multiple-testing correction. Accordingly, we report empirical percentiles and confidence intervals as the primary inferential measures for the real-DNA background.
In LOSO runs, SW alignment length reflects the number of alignment columns, i.e., L = M + X + GL = M + X + GL = M + X + G, where MMM are matches, XXX mismatches, and GGG gap columns. When the LOSO ancestral sequence contains insertions relative to the ~2 kb prediction, SW introduces gaps on the query side so those extra ancestral bases can be aligned, which increases GGG and therefore LLL beyond ~2000 bp even though the number of non-gap query letters aligned never exceeds ~2000. Because our DNA-mimicking nulls (k-mer LMs and real-DNA windows) are aligned with the same SW parameters and the same definition of LLL, any length inflation from gap handling is present in both observation and null. As a direct consequence, the SW percent identity % I D S W = M/L is lower than an ungapped, position-wise identity (the denominator includes GGG), yet comparisons to the null remain fair: z-scores and percentiles measure excess identity beyond what SW would achieve by chance on composition- and context-matched DNA. Put simply, the difference between the ~2 kb query length and the larger SW alignment length arises from gap columns introduced by SW, not from extra predicted bases; and because our simulation nulls use the identical alignment and scoring, this effect is built into the null and does not inflate significance.

3.4. Baseline Comparison

To assess whether the CNN-BiLSTM model outperforms trivial sequence models, a 2-mer (first-order Markov) baseline was applied using a LOSO scheme. For each excluded species, a 2000 bp fragment was generated from the 2-mer model trained on the remaining species and compared to the closest MEGA X ancestral node. Baseline fragments showed only ~26–27% identity to the corresponding ancestral sequences, consistent with near-random performance. Results shown in Table 6.
By contrast, CNN–BiLSTM predictions achieved 63.7–67.3 percent identity to the same ancestral nodes, approximately doubling the performance of the 2-mer Markov baseline. These preliminary results indicate that the model captures structured sequence regularities shared across related mitochondrial genomes within the limited scope of this proof-of-concept study. However, because low-order Markov models are known to be weak comparators, this contrast alone should not be interpreted as strong evidence of generalization and instead motivated the additional, more stringent analyses reported below.

3.5. Expanded Baseline Comparisons

Supplementary Table S3 contrasts the model’s LOSO identities to a suite of stronger baselines computed on the training taxa only. As expected, profile-derived references that reuse training information provide high-performance comparative baselines: the consensus-of-train achieves 83.85–100% identity to the LOSO ancestor, PSSM (best sample) reaches 50.00–73.07%, and the nearest-neighbor (copy) baseline attains 53.83–85.91%. These baselines are treated as strong comparative methods; performance is calibrated against composition-controlled nulls.: for k-mer language-model mimics the null means are tightly centered at ~50.4–50.6%. Relative to these nulls, the observed identities show positive, consistent enrichment with median raw lift (obs − μ) of 1.16 percentage points (k = 5) and 1.14 pp (k = 6), mean z-scores of 2.35 (k = 5) and 2.33 (k = 6) across species, and a cross-species Stouffer’s Z = 10.76 (two-sided p ≈ 3.4 × 10−27) for both k values. After Benjamini–Hochberg correction, 14/21 species are significant at q ≤ 0.05 for k = 5 (13/21 for k = 6). In aggregate, the evidence indicates that the model’s fragments are significantly above what realistic composition-matched chance would yield, while profile/consensus/nearest-neighbor baselines, by design, remain higher.

3.6. Robustness to Starting Position and Region Stratification Results

Table 7 reports the unshifted baseline per species. SW identity to the LOSO ancestor (obs_%ID_vs_ASR) spans 50.18–53.50% across the 21 taxa, and the corresponding SW alignment lengths (obs_aln_len) range 2049–2221 bp. Alignment length exceeds ~2 kb because SW counts gap columns. Table 7 also carries the region annotation for each 2000 bp window (region_label, frac_dloop, frac_coding) and the mapped local-block coordinates (hit_start, hit_end). Four windows are labeled D-loop with high control-region occupancy (Accipiter gentilis frac_dloop 1.000; Anas platyrhynchos 0.946; Gallus gallus 1.000; Xenopus laevis 1.000). The remaining windows are coding-dominated, with only minor control-region overlap in a few cases (for example, Crocodylus porosus frac_dloop 0.005; Caiman crocodilus 0.001). On this unshifted baseline, D-loop cases show 50.97–53.50% identity (from Gallus gallus to Accipiter gentilis), while coding cases show 50.18–53.33% (for example, Gekko gecko 50.18%, Chelonia mydas 53.33%). D-loop alignment lengths are 2105–2166 bp (from Xenopus laevis to Anas platyrhynchos), and coding lengths are 2049–2221 bp (for example, Dromaius novaehollandiae 2049 bp, Struthio camelus 2221 bp).
Table 8 provides the reference accessions and the genomic-context labels and fractions that underpin Table 7 (accession, region_label, frac_dloop, frac_coding) together with the start–end coordinates of the top-scoring local block. The D-loop windows align over extended control-region tracts, for example, Xenopus laevis 1–1.205 and Accipiter gentilis 1–1.109, consistent with their frac_dloop values
Table 9 summarizes the randomized-start experiment (10 circular offsets per species) and reports the unshifted identity as obs_%ID in a ×10 scale (for example, 535.002 corresponds to 53.500%). On that same scale, the mean across shifts (shift_mean_%ID) ranges 497.757–538.685 (that is, 49.7757–53.8685%), the standard deviation across shifts (shift_sd_%ID) is 0.2157–0.7944 (that is, 0.0216–0.0794 percentage points), and the within-species range from shift_min_%ID to shift_max_%ID corresponds to 0.0656–2.3410 percentage points overall (shift_range_%ID 0.6560–23.410 in table units). Examples of small variability include Sphenodon punctatus (shift_range_%ID 0.6560) and Melopsittacus undulatus (0.9034), whereas larger yet still modest variability is seen in Meleagris gallopavo (23.410), Dromaius novaehollandiae (21.716), and Anas platyrhynchos (20.585) in table units, i.e., ≤2.341 percentage points on the original scale.
Table 10 lists the full per-offset results for the same experiment, including the circular offset used (shift_offset) and the corresponding identity (shift_%ID, ×10 scale) for each of the 10 shifts per species. All values fall within the narrow bands summarized in Table 9. For both Table 9 and Table 10 the SW engine is parasail, and scoring and masking settings match those used elsewhere.
Collectively, Table 7, Table 8, Table 9 and Table 10 show that within-species variation in SW identity across randomized start positions is small and that coding versus D-loop windows exhibit closely similar identity and alignment-length distributions. These results indicate that the conclusions do not depend on the exact window boundary or on regional context under the constant SW configuration applied here. Parasail is not directly comparable to pairwise2 figures due to engine/scoring differences.

3.7. Sensitivity Analysis

For 21 species, identities of predicted fragments to LOSO (MAFFT + NJ + Fitch) ancestors were similar to those to MEGA (MUSCLE + ML Tamura–Nei) ancestors (Table 11). The per-species change, Δ%ID = LOSO − MEGA, had a median of +0.743 percentage points (range −2.610 to +2.264 pp; 16/21 positive). Smith–Waterman alignment lengths were comparable between arms (~3.0–3.1 kb), and alignment scores tracked identity. These observations indicate that estimates are robust to substituting MUSCLE/ML with MAFFT/NJ/Fitch for ancestor reconstruction when scoring is otherwise identical. Absolute identity values in this table are not directly comparable to earlier parasail-based reports because Biopython PairwiseAligner was used here; the interpretation is based on within-table comparisons.

3.8. LOSO Two-Flank Imputation

As the most stringent evaluation in this study, the two-flank masked-span imputation task was assessed across 21 LOSO folds (500 bp spans; three anchors per species). The consensus reconstruction exceeded the Nearest-Neighbor (NN) baseline in 17/21 species (81%), with an across-species median Δ(CONS–NN) of +3.2 percentage points (IQR ≈ 4.4 pp) and a two-sided sign-test p-value of approximately 0.007, indicating a statistically consistent improvement over NN under this training budget. In comparison with the Flank-Copy baseline, the consensus reconstruction was higher in 13/21 species (62%), with a median Δ(CONS–Flank) of +2.2 percentage points (IQR ≈ 5.2 pp); however, this difference did not reach statistical consistency (sign-test p ≈ 0.38), indicating broadly comparable performance between the two approaches rather than uniform dominance. The largest gains relative to NN were observed in Boa constrictor (+13.8 pp), Sphenodon punctatus (+9.6 pp), Accipiter gentilis (+7.2 pp), Anas platyrhynchos (+6.0 pp), and Crocodylus porosus and Nerodia sipedon (+5.4 pp). Conversely, a small subset of avian species underperformed NN (e.g., Aptenodytes forsteri −5.8 pp, Struthio camelus −5.4 pp, Dromaius novaehollandiae −4.4 pp), consistent with strong nearest-neighbor signal when closely related training genomes are available (Table 12).
Overall, two-flank masked-span imputation demonstrates a statistically consistent advantage over NN while achieving performance broadly comparable to Flank-Copy, suggesting that the model captures mtDNA regularities beyond simple compositional matching without implying systematic superiority over homology-tiling baselines. As expected, baselines that explicitly reuse homologous sequence information provide strong reference points for masked-span recovery; accordingly, we interpret the two-flank results as demonstrating a consistent advantage over nearest-neighbor heuristics and performance comparable to simple homology reuse, rather than a claim of dominance over alignment-driven methods.

4. Discussion

This study presents a purely in silico proof of concept exploring whether deep learning models can learn non-random evolutionary sequence regularities from mitochondrial DNA (mtDNA) sequences under controlled conditioning. By combining a CNN–BiLSTM next-nucleotide prediction framework with a leave-one-species-out (LOSO) evaluation scheme, the study demonstrates that the network can accurately predict short mtDNA fragments for species excluded from training in a context-dependent next-nucleotide framework, generating predictions that retain non-random similarity to biologically meaningful reference sequences. The model achieved greater than 99 percent nucleotide-level accuracy and near-perfect macro-averaged F1 scores across 21 vertebrate species, reflecting a strong capacity to learn local sequence dependencies. This behavior is plausibly supported by the bidirectional LSTM layers, which capture context in both directions, and by overlapping 200 bp windows that preserve local sequence continuity. High predictive accuracy in a next-nucleotide framework is expected for relatively conserved mitochondrial genomes and is therefore interpreted here as a diagnostic prerequisite rather than a standalone indicator of generalization. Each LOSO iteration required approximately 60–90 min of training on a standard workstation (HP ProBook 450 G9, Intel Core i5-1235U, 16 GB RAM), totaling roughly 50 h for the full 21-species cycle. This computational efficiency reflects the compact model architecture and modest dataset size, rendering the approach feasible for a small-scale proof of concept. At the same time, retraining the model for each LOSO fold is not scalable to substantially larger or more diverse datasets; future extensions could instead employ multi-species training, transfer learning, or k-fold cross-validation to improve scalability. Biological plausibility was assessed using a combination of complementary analyses, including MUSCLE alignments, MEGA X ancestral node reconstruction, Smith–Waterman local alignments, BLASTn searches, and shuffled-sequence null models. Predicted fragments exhibited 63.73–67.31 percent local identity to reconstructed ancestral nodes, substantially exceeding the approximately 25 percent identity expected under random base distributions (p much less than 0.001). These values also exceeded those obtained with a first-order Markov (2-mer) baseline, which achieved only 26–27 percent identity, indicating that the CNN–BiLSTM predictions capture sequence structure beyond simple k-mer statistics. Higher identities (approximately 67 percent) were observed for Accipiter gentilis and Xenopus laevis, whereas lower identities (approximately 63 percent) were observed in some avian taxa. These variations are reported descriptively and may reflect lineage-specific substitution dynamics, while remaining dependent on alignment quality and the assumptions inherent to maximum-likelihood ancestral reconstruction in MEGA X.
Additional ancestral sequence reconstructions performed with MAFFT yielded local identities ranging from 58.04 percent (Gekko gecko) to 88.95 percent (Dromaius novaehollandiae), with a mean of 63.82 percent across species. These values are comparable to those obtained using the original all-species ancestral reconstruction (63.73–67.31 percent), indicating that predicted-to-ancestor similarities remain stable even when the reference ASR excludes the held-out species. Replacing MUSCLE with MAFFT and maximum-likelihood trees with neighbor-joining plus Fitch parsimony altered predicted-to-ancestor identities by a median of +0.743 percentage points (range −2.610 to +2.264 percentage points, 16 of 21 higher), suggesting that the reported observations are not sensitive to reasonable alignment or tree/ASR choices.
Expanded baseline analyses further contextualized performance. Relative to composition-matched nulls (k-mer language models and real-DNA windows), observed percent identities typically fell at or above the 98th–99th percentile, with a small number around the 96.5th–97.5th percentile. Median lift values were 1.16 for k = 5 and 1.14 for k = 6, with mean z-scores of 2.35 (k = 5) and 2.33 (k = 6). A meta-analytic Stouffer’s Z of 10.76 (p approximately 3.4 × 10−27) was observed across species, with 14 of 21 species significant at q ≤ 0.05 for k = 5 and 13 of 21 for k = 6. Profile/consensus and nearest-neighbor baselines, which explicitly exploit homology and alignment-derived information unavailable to the generative model, consistently yielded higher absolute identities overall (consensus 83.85–100 percent; PSSM best 50.00–73.07 percent; nearest neighbor 53.83–85.91 percent). These methods therefore serve as strong homology-informed reference points rather than direct performance targets, and results are interpreted relative to the null models that control for compositional and short-range sequence effects. Identities to LOSO ancestral sequences were stable across 10 random starts per species, with median within-species ranges of approximately 1–2 percentage points and median standard deviations below 0.5 percentage points. Results were also consistent across genomic regions, with windows overlapping coding regions versus the D-loop showing comparable identities and alignment lengths, and BLAST top-hit assignments remaining correct in 21 of 21 cases in both strata. Together, these controls indicate that the reported observations do not depend on window placement or genomic region. BLASTn searches further confirmed that in all 21 cases, the top hit corresponded to the correct species, serving as a biological plausibility check rather than evidence of phylogenetic inference. For example, a predicted Caiman crocodilus fragment aligned to “Caiman crocodilus, mitochondrion, complete genome” (NC_002744.2) with 99.51 percent identity and 100 percent query coverage. However, a targeted search against the C. crocodilus cytochrome b gene did not yield a significant match, indicating that while the model preserves broad mtDNA sequence structure, it does not necessarily predict specific functional loci without explicit guidance.
Two-flank masked-span imputation under LOSO provided a leakage-free evaluation of flank-conditioned reconstruction in unseen species. The observed gains over the nearest-neighbor baseline, together with performance comparable to the flank-copy baseline, constitute the core proof of concept that a compact CNN–BiLSTM architecture can support conditional mtDNA span completion beyond simple flank-tiling heuristics. This approach is intended as a complement, not a replacement, for traditional phylogenetic and paleogenomic tools. Maximum-likelihood and Bayesian ancestral reconstruction methods, such as PAML and BEAST, remain the gold standard due to their explicit evolutionary modeling and reliance on high-quality alignments. In contrast, the present deep learning framework operates directly on raw nucleotide sequences and offers a data-driven exploratory pathway for conditional sequence completion.
While multiple identity metrics are reported for different analytical purposes, conclusions in each Results subsection are drawn within the corresponding percent identity definition, and differences across sections reflect distinct analytical questions rather than directly comparable identity scales.
Although the predicted fragments are short, they could serve as alignment anchors or exploratory guide sequences in highly fragmented mtDNA datasets, potentially bridging AI-generated sequences with classical phylogenetic workflows. Overall, this work should be regarded as an exploratory computational proof of concept, providing preliminary evidence that deep learning can identify non-random sequence patterns in mtDNA that are consistent with evolutionary structure, without performing phylogenetic inference. The specific constraints of this study, including short fragment length, limited dataset size, and the absence of ancient DNA damage modeling, are discussed in Section 4.1 (Limitations). Potential directions for addressing these issues and extending the approach are outlined in Section 4.2 (Suggestions for Future Work), where larger and more taxonomically diverse datasets, simulated or empirical ancient DNA, longer predicted sequences, and functional annotation could help evolve this framework into a practical complement to established evolutionary analysis pipelines.

4.1. Limitations

Several limitations must be acknowledged. The dataset is small (21 species) and biased toward archosaurs, which increases the risk of clade-specific overfitting and limits generalizability. The predicted fragment length (2000 bp) is short compared to complete mitochondrial genomes (16–20 kb), constraining immediate value for de novo reconstruction. The study does not incorporate aDNA damage modeling, such as cytosine deamination or realistic fragmentation, which is critical for practical paleogenomic applications. From a computational standpoint, LOSO training scales poorly, for example, extending to 50 species could require ~96 h on comparable hardware. This underscores the need for more scalable strategies such as k-fold cross-validation or transfer learning. In addition, the use of overlapping 200 bp windows with a 50 bp stride may introduce redundancy between training and test inputs, potentially inflating reported accuracy (>99%). Biological validation is subject to further caveats. Ancestral references were reconstructed from multiple-sequence alignments (MUSCLE/MEGA X ML for the all-species snapshot; MAFFT/NJ/Fitch for LOSO), and absolute identity values depend on alignment and tree/ASR choices and are therefore meaningful only within the specific identity formulation used in each analysis. Misalignments, particularly among divergent outgroups such as Xenopus laevis, or model limitations could introduce error into the inferred nodes, potentially affecting the reported identity values. Finally, the predicted fragments were not assessed for functional content. For example, a targeted BLASTn search for the Caiman crocodilus cytochrome b gene did not yield a significant match, indicating that predicted fragments may not consistently capture coding regions or other functionally conserved loci. This limits biological interpretation, since functional conservation is a key aspect of mitochondrial evolution. In summary, this work should be regarded as a computational proof of concept. The findings suggest that deep learning can potentially recover non-random sequence patterns consistent with phylogenetic relationships, but they should not be interpreted as benchmarks for ancestral reconstruction. The model has not been tested on empirical aDNA, does not simulate damage patterns, and has not been applied to phylogenetic placement. Its relevance to paleogenomics therefore remains hypothetical until validated with simulated or empirical data.

4.2. Suggestions for Future Work

This PoC is intentionally scoped and demonstrates a novel, context-conditioned reconstruction framework for mitochondrial DNA under in-species flanks. As a preliminary study, it delivers clear feasibility signals with transparent reporting of assumptions and controls, and it can serve as a compact, reusable template for subsequent paleogenomic modeling efforts. Building on this foundation, several directions naturally extend the work while maintaining transparency about current scope. Broadening taxonomic coverage to include mammals, fish, and other vertebrates would allow assessment of generalizability across more divergent lineages. Incorporating realistic ancient-DNA damage patterns (e.g., fragmentation and cytosine deamination) would enable more direct evaluation in paleogenomic contexts. Generating longer predictions toward full mitochondrial reconstructions could proceed via iterative sequence reconstruction or architectures capable of long-range dependencies (e.g., transformers). Adding functional annotation of predicted fragments, covering coding genes, tRNAs, and conserved motifs, would provide biological context beyond sequence identity. Computational scalability could be enhanced with k-fold cross-validation and transfer learning, supporting application to larger and more diverse datasets; exploration of advanced deep-learning architectures may further capture subtle evolutionary sequence regularities. Design choices in this PoC are documented to aid interpretability. The model targets archosaur-specific mitochondrial patterns; accordingly, leave-one-clade-out analysis was not included in the present scope. To assess learning beyond this clade, future work includes leave-one-clade-out experiments (e.g., training on crocodilians to predict birds) and the addition of non-archosaur outgroups (e.g., squamates, mammals). The PoC evaluates context-conditioned reconstruction and does not claim de novo or cross-species generation; the two-flank masked-span test is included as the primary leakage-free evaluation, providing a stringent free-running assessment without excessive error accumulation. Because overlapping windows can overstate token-level accuracy, the >99% next-base figure is presented as an optimistic high-end estimate, with blocked or non-overlapping tests designated for future work. For sequence comparison, sensitive Smith–Waterman penalties tuned for short fragments were used consistently across baselines and nulls; robustness to more standard gap costs and global alignment is future work. Given small, discrete nulls, emphasis is placed on empirical percentiles and effect sizes rather than asymptotic p-values; expanded permutations, larger k-mer samples, and a pre-registered alpha are future work. Biological checks (BLAST hits and simple gene-context inspection) are positioned as qualitative validation checks appropriate to a PoC; comprehensive phylogenetic placement and functional analyses are future work. The N = 21, archosaur-leaning dataset reflects a focused feasibility setting; broader taxonomic coverage and cross-clade hold-outs to assess generalization are future work. Prediction length was fixed by design in this proof of concept rather than treated as a tunable variable. Teacher-forced fragment assembly was evaluated at 2000 bp to demonstrate stable, position-resolved next-base prediction under full conditioning, whereas leakage-free masked-span imputation was evaluated at 500 bp to provide a stringent free-running test without excessive error accumulation. Systematic sweeps over prediction length were not performed in this study. Based on known behavior of autoregressive sequence models, longer free-running generations are expected to accumulate errors, whereas teacher-forced assembly is substantially less sensitive to length. Accordingly, the two-flank masked-span results are interpreted as a conservative estimate of generalization performance. Quantifying the relationship between prediction length and reconstruction accuracy (e.g., 250 bp, 500 bp, 1000 bp, and longer spans) is designated as future work. Importantly, none of these extensions are required to support the present feasibility claims but are outlined to clarify the boundaries between this proof of concept and future methodological development. Future work may consider incorporating additional alignment-informed or homology-based imputation baselines in order to further refine the contextual interpretation of two-flank masked-span performance. Such baselines would enable a more granular comparison against methods that explicitly exploit positional homology, complementing the present analysis, which is focused on demonstrating statistically consistent gains over nearest-neighbor heuristics and performance comparable to simple homology reuse within a controlled proof-of-concept framework. An important direction for future work is the evaluation of fully closed-loop autoregressive generation over longer genomic spans. Unlike the present study, which focuses on context-conditioned reconstruction anchored by known flanking sequence, unconstrained free-running generation is subject to error accumulation and span-length-dependent degradation as predicted bases are recursively fed back as input. Assessing reconstruction fidelity as a function of generated span length and evolutionary divergence would provide a more complete characterization of generative limits but lies beyond the scope of this proof-of-concept, which is intentionally restricted to conditional sequence completion.

5. Conclusions

This study presents a focused computational proof of concept demonstrating that deep learning models can learn non-random sequence regularities in mitochondrial genomes under controlled, context-conditioned prediction. Without relying on predefined substitution models, the proposed framework shows that neural sequence models can perform conditional mtDNA fragment completion for species excluded from training, producing biologically plausible sequences that exceed simple homology-based and stochastic baselines. Under controlled and leakage-free evaluation using two-flank masked-span imputation, the consensus reconstruction outperformed a nearest-neighbor baseline in 17 out of 21 species (81%), providing quantitative support for transferable sequence regularities beyond trivial sequence reuse
Rather than replacing classical phylogenetic or evolutionary inference methods, this work illustrates how modern sequence modeling can serve as an exploratory and complementary tool for studying mitochondrial DNA under partial-information scenarios. The results indicate that deep learning can generate mtDNA fragments that retain non-random evolutionary similarity under leakage-free evaluation, supporting further investigation into conditional reconstruction tasks without claiming phylogenetic placement or ancestral inference. Beyond evolutionary genomics, the underlying approach may have longer-term biomedical relevance. Models capable of reconstructing short mtDNA fragments from partial context could eventually support low-coverage clinical mitochondrial sequencing workflows, where recovering uncertain or missing positions may improve variant calling. Similar principles may be applicable to forensic or archaeological mtDNA samples that are highly degraded or incomplete. In addition, identifying conserved sequence regularities may, in the longer term, contribute to studies of evolutionary constraints overlapping with mtDNA loci implicated in metabolic or neurodegenerative disorders.
Although these applications lie beyond the scope of the present proof of concept, they highlight how AI-assisted, context-conditioned reconstruction could ultimately complement established mitochondrial genomics pipelines in both research and applied settings.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai7010027/s1, Table S1. Species and NCBI Accession Numbers used in this Study. Table S2. BLASTN Results. Table S3. Stronger baselines computed on the training taxa only.

Author Contributions

Conceptualization, D.A.; methodology, D.A.; software, D.A.; validation, D.A., D.C., P.A.A., D.T.G., S.A.K., E.I.A., and I.K.K.; formal analysis, D.A.; investigation, D.A.; resources, D.A.; data curation, D.A.; writing—original draft preparation, D.A.; writing—review and editing, D.A. and P.A.A.; visualization, D.A.; supervision, D.A.; project administration, D.A.; funding acquisition, D.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All mtDNA sequences were retrieved from the NCBI Nucleotide database. No new sequence data were generated.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Allentoft, M.E.; Collins, M.; Harker, D.; Haile, J.; Oskam, C.L.; Hale, M.L.; Campos, P.F.; Samaniego, J.A.; Gilbert, M.T.P.; Willerslev, E.; et al. The half-life of DNA in bone: Measuring decay kinetics in 158 dated fossils. Proc. R. Soc. B Biol. Sci. 2012, 279, 4724–4733. [Google Scholar] [CrossRef] [PubMed]
  2. Recalibrating EQUUS Evolution Using the Genome Sequence of an Early Middle Pleistocene Horse. Available online: https://www.researchgate.net/publication/242333094_Recalibrating_Equus_evolution_using_the_genome_sequence_of_an_early_Middle_Pleistocene_horse (accessed on 6 August 2025).
  3. Meyer, M.; Fu, Q.; Aximu-Petri, A.; Glocke, I.; Nickel, B.; Arsuaga, J.-L.; Martínez, I.; Gracia, A.; de Castro, J.M.B.; Carbonell, E.; et al. A mitochondrial genome sequence of a hominin from Sima de los Huesos. Nature 2014, 505, 403–406. [Google Scholar] [CrossRef] [PubMed]
  4. van der Valk, T.; Pečnerová, P.; Díez-Del-Molino, D.; Bergström, A.; Oppenheimer, J.; Hartmann, S.; Xenikoudakis, G.; Thomas, J.A.; Dehasque, M.; Sağlıcan, E.; et al. Million-year-old DNA sheds light on the genomic history of mammoths. Nature 2021, 591, 265–269. [Google Scholar] [CrossRef] [PubMed]
  5. Yang, Z. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol. Biol. Evol. 2007, 24, 1586–1591. [Google Scholar] [CrossRef] [PubMed]
  6. Bouckaert, R.; Vaughan, T.G.; Barido-Sottani, J.; Duchêne, S.; Fourment, M.; Gavryushkina, A.; Heled, J.; Jones, G.; Kühnert, D.; De Maio, N.; et al. BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 2019, 15, e1006650. [Google Scholar] [CrossRef] [PubMed]
  7. Zeng, H.; Edwards, M.D.; Liu, G.; Gifford, D.K. Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics 2016, 32, i121–i127. [Google Scholar] [CrossRef] [PubMed]
  8. Alipanahi, B.; Delong, A.; Weirauch, M.T.; Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015, 33, 831–838. [Google Scholar] [CrossRef] [PubMed]
  9. Avsec, Ž.; Agarwal, V.; Visentin, D.; Ledsam, J.R.; Grabska-Barwinska, A.; Taylor, K.R.; Assael, Y.; Jumper, J.; Kohli, P.; Kelley, D.R. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 2021, 18, 1196–1203. [Google Scholar] [CrossRef] [PubMed]
  10. Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
  11. Singh, J.; Hanson, J.; Paliwal, K.; Zhou, Y. RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat. Commun. 2019, 10, 5407. [Google Scholar] [CrossRef] [PubMed]
  12. Generating and Designing DNA with Deep Generative Models. ResearchGate. Available online: https://www.researchgate.net/publication/321902509_Generating_and_designing_DNA_with_deep_generative_models (accessed on 8 August 2025).
  13. Kenneweg, P.; Dandinasivara, R.; Luo, X.; Hammer, B.; Schönhuth, A. Generating synthetic genotypes using diffusion models. Bioinformatics 2025, 41, i484–i492. [Google Scholar] [CrossRef] [PubMed]
  14. De Leonardis, M.; Pagnani, A.; Barrat-Charlaix, P. Reconstruction of Ancestral Protein Sequences Using Autoregressive Generative Models. Mol. Biol. Evol. 2025, 42, msaf070. [Google Scholar] [CrossRef] [PubMed]
  15. Matthews, D.S.; Spence, M.A.; Mater, A.C.; Nichols, J.; Pulsford, S.B.; Sandhu, M.; Kaczmarski, J.A.; Miton, C.M.; Tokuriki, N.; Jackson, C.J. Leveraging ancestral sequence reconstruction for protein representation learning. Nat. Mach. Intell. 2024, 6, 1542–1555. [Google Scholar] [CrossRef]
  16. Quang, D.; Xie, X. DanQ: A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 2016, 44, e107. [Google Scholar] [CrossRef] [PubMed]
  17. Nesterenko, L.; Blassel, L.; Veber, P.; Boussau, B.; Jacob, L. Phyloformer: Fast, Accurate, and Versatile Phylogenetic Reconstruction with Deep Neural Networks. Mol. Biol. Evol. 2025, 42, msaf051. [Google Scholar] [CrossRef] [PubMed]
  18. Wang, Z.; Sun, J.; Gao, Y.; Xue, Y.; Zhang, Y.; Li, K.; Zhang, W.; Zhang, C.; Zu, J.; Zhang, L. Fusang: A framework for phylogenetic tree inference via deep learning. Nucleic Acids Res. 2023, 51, 10909–10923. [Google Scholar] [CrossRef] [PubMed]
  19. Yang, J.; Wu, C.; Du, B.; Zhang, L. Enhanced Multiscale Feature Fusion Network for HSI Classification. IEEE Trans. Geosci. Remote. Sens. 2021, 59, 10328–10347. [Google Scholar] [CrossRef]
  20. Wang, D.; Hu, M.; Jin, Y.; Miao, Y.; Yang, J.; Xu, Y.; Qin, X.; Ma, J.; Sun, L.; Li, C.; et al. HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 6427–6444. [Google Scholar] [CrossRef] [PubMed]
  21. Paradiso, C.; Gratton, P.; Trucchi, E.; López-Delgado, J.; Gargano, M.; Garizio, L.; Carr, I.M.; Colosimo, G.; Sevilla, C.; Welch, M.E.; et al. Genomic insights into the biogeography and evolution of Galápagos iguanas. Mol. Phylogenetics Evol. 2025, 204, 108294. [Google Scholar] [CrossRef] [PubMed]
  22. Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 2015, 12, 931–934. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Analysis pipeline.
Figure 1. Analysis pipeline.
Ai 07 00027 g001
Figure 2. Maximum Likelihood phylogenetic tree of the 21 vertebrate mitochondrial genomes used in this study. The tree was reconstructed in MEGA X v12 using the Tamura–Nei substitution model from the MUSCLE-aligned complete mtDNA sequences. Numbers in parentheses correspond to species identifiers from Supplementary Table S1, followed by their GenBank accession numbers. The “NC” codes refer to the accession numbers listed in Supplementary Table S1. Letters “A” denote internal ancestral nodes inferred by MEGA X; “C” indicates consensus sequences at ambiguous nodes; “AC” reflects positions with multiple equally likely base calls; and “–” indicates missing data at specific ancestral positions. These reconstructed ancestral sequences served as reference nodes for evaluating the phylogenetic plausibility of CNN-BiLSTM predictions via local alignment. The scale bar (0.20) represents substitutions per site.
Figure 2. Maximum Likelihood phylogenetic tree of the 21 vertebrate mitochondrial genomes used in this study. The tree was reconstructed in MEGA X v12 using the Tamura–Nei substitution model from the MUSCLE-aligned complete mtDNA sequences. Numbers in parentheses correspond to species identifiers from Supplementary Table S1, followed by their GenBank accession numbers. The “NC” codes refer to the accession numbers listed in Supplementary Table S1. Letters “A” denote internal ancestral nodes inferred by MEGA X; “C” indicates consensus sequences at ambiguous nodes; “AC” reflects positions with multiple equally likely base calls; and “–” indicates missing data at specific ancestral positions. These reconstructed ancestral sequences served as reference nodes for evaluating the phylogenetic plausibility of CNN-BiLSTM predictions via local alignment. The scale bar (0.20) represents substitutions per site.
Ai 07 00027 g002
Table 1. Species.
Table 1. Species.
Birds (Aves)Crocodilians (Crocodylia)Non-Archosaur Reptiles (Outgroups)
Gallus gallus (chicken)Crocodylus porosus (saltwater crocodile)Chelonia mydas (green sea turtle)
Struthio camelus (ostrich)Alligator mississippiensis (American alligator)Pelodiscus sinensis (Chinese softshell turtle)
Archilochus colubris (ruby-throated hummingbird)Caiman crocodilus (spectacled caiman)Varanus salvator (water monitor)
Accipiter gentilis (northern goshawk) Gekko gecko (tokay gecko)
Dromaius novaehollandiae (emu) Boa constrictor (boa constrictor)
Meleagris gallopavo (wild turkey) Nerodia sipedon (northern water snake)
Anas platyrhynchos (mallard) Sphenodon punctatus (tuatara)
Columba livia (rock dove) Single amphibian outgroup
Aptenodytes forsteri (emperor penguin) Xenopus laevis (African clawed frog)
Melopsittacus undulatus (budgerigar)
Table 2. Model performance on next nucleotide prediction under LOSO evaluation.
Table 2. Model performance on next nucleotide prediction under LOSO evaluation.
Species%Accuracy%Precision%Recall%F1%IDMismatches
Accipiter gentilis99.6599.7099.6799.6899.657
Alligator mississippiensis99.6099.6599.6099.6299.608
Anas platyrhynchos99.6599.6999.6799.6899.657
Aptenodytes forsteri99.7099.7199.6999.7099.706
Archilochus colubris99.6099.6699.5899.6299.608
Boa constrictor99.6599.6699.5499.6099.657
Caiman crocodilus99.6099.6099.5999.5999.608
Chelonia mydas99.6599.7099.6099.6599.657
Columba livia99.5599.6099.5699.5899.559
Crocodylus porosus99.6599.6699.6199.6399.657
Dromaius novaehollandiae99.6599.7199.6499.6799.657
Gallus gallus99.6099.6299.6199.6299.608
Gekko gecko99.6099.6699.5799.6199.608
Meleagris gallopavo99.6099.6699.5799.6199.608
Melopsittacus undulatus99.7099.6899.6899.6899.706
Nerodia sipedon99.7099.7099.7499.7299.706
Pelodiscus sinensis99.6099.6099.6499.6299.608
Sphenodon punctatus99.7599.7999.7899.7999.755
Struthio camelus99.6599.6799.6199.6499.657
Varanus salvator99.6599.6499.6399.6399.657
Xenopus laevis99.7599.7599.7499.7599.755
Note: Model performance on sliding-window next-nucleotide prediction under LOSO evaluation. Metrics include overall accuracy, macro-averaged precision, recall, and F1 score, as well as percent identity (%ID) computed over a total of 2000 bp reconstructed in a teacher-forced, position-resolved manner from overlapping windows. Each reconstructed position is compared to the corresponding ground-truth mtDNA position of the left-out species, rather than to a single contiguous reconstructed locus. The corresponding mismatch count is also reported.
Table 3. Biological Validation Metrics for Predicted 2000 bp Fragments.
Table 3. Biological Validation Metrics for Predicted 2000 bp Fragments.
SpeciesPredicted Length (bp)% IdentityMismatches
Accipiter gentilis200067.31654
Xenopus laevis200066.98660
Sphenodon punctatus200065.90682
Melopsittacus undulatus200065.06699
Chelonia mydas200064.85703
Varanus salvator200064.76705
Pelodiscus sinensis200064.74705
Boa constrictor200064.69706
Nerodia sipedon200064.58708
Meleagris gallopavo200064.51710
Anas platyrhynchos200064.43711
Struthio camelus200064.40712
Dromaius novaehollandiae200064.19716
Gallus gallus200064.15717
Crocodylus porosus200064.06719
Archilochus colubris200063.98720
Alligator mississippiensis200063.92722
Aptenodytes forsteri200063.89722
Gekko gecko200063.87723
Columba livia200063.86723
Caiman crocodilus200063.73725
Table 4. Ancestral Sequence Reconstruction MAFFT.
Table 4. Ancestral Sequence Reconstruction MAFFT.
Speciespred_lenMSA ColumnsIdentitySW Alignment (Base–Base Pairs)MismatchesMSA Gap ColumnsSW gap_opensScore
Accipiter gentilis200027,2980.613193189573325,508102917.00
Alligator mississippiensis200029,7190.586746185676728,007107720.00
Anas platyrhynchos200029,7010.593453189477027,913113762.00
Aptenodytes forsteri200029,3930.701721156946828,25592812.00
Archilochus colubris200030,1120.787385141130029,29079856.00
Boa constrictor200029,3750.610521188273327,611113792.00
Caiman crocodilus200029,5930.595745178672228,02197720.00
Chelonia mydas200031,6720.617834188472029,904113869.00
Columba livia200029,8900.590393193679328,018105805.00
Crocodylus porosus200029,7380.596234191277227,914123728.00
Dromaius novaehollandiae200029,9260.889545118613129,55467850.00
Gallus gallus200029,9150.587595183875828,239109688.00
Gekko gecko200030,0590.580376191680428,227111691.00
Meleagris gallopavo200029,7620.697360155347028,65685834.00
Melopsittacus undulatus200029,3500.605306184772927,656100847.00
Nerodia sipedon200029,4580.586207185676827,746105751.00
Pelodiscus sinensis200030,0590.651532176261428,535107856.00
Sphenodon punctatus200029,1680.597790181072827,54893796.00
Struthio camelus200029,8800.711608165447728,57296934.00
Varanus salvator200029,6120.593174184675127,92098756.00
Xenopus laevis200029,8670.608275186172928,145103876.00
Table 5. DNA-mimicking null simulations results.
Table 5. DNA-mimicking null simulations results.
Speciesobs_%IDobs_aln_lenklm_muklm_sdklm_pklm_qklm_CI_low99.9klm_CI_up99.9klm_nreal_mureal_sdreal_percentilereal_CI_low99.9real_CI_up99.9
Accipiter gentilis53.50215750.470.52<0.001<0.0010.0000000.04557220068.047.980.9650000.8992190.988399
Alligator mississippiensis51.20212750.460.510.0720000.0889410.0505850.101513100065.836.911.0000000.9544281.000000
Anas platyrhynchos51.89216650.440.540.0050000.0116670.0007550.03232940068.638.130.9850000.9288120.996983
Aptenodytes forsteri51.51212050.480.540.0340000.0510000.0235270.048902200069.538.100.9900000.9368510.998489
Archilochus colubris51.30216050.440.530.0575000.0754690.0434180.075788200067.847.800.9900000.9368510.998489
Boa constrictor51.75220150.490.520.0075000.0157500.0015070.03646140066.576.930.9850000.9288120.996983
Caiman crocodilus50.89208950.390.470.1700000.1878950.1034520.26662520066.207.111.0000000.9544281.000000
Chelonia mydas53.33216250.440.50<0.001<0.0010.0000000.04557220067.227.780.9850000.9288120.996983
Columba livia51.67221250.450.540.0183330.0320830.0074700.04428960069.268.401.0000000.9544281.000000
Crocodylus porosus51.59220650.340.510.0100000.0190910.0024150.04043540066.068.040.9950000.9453200.999564
Dromaius novaehollandiae51.39204950.430.540.0385000.0539000.0272500.054136200068.568.211.0000000.9544281.000000
Gallus gallus50.97210950.460.520.1450000.1691670.0842920.23806420069.337.951.0000000.9544281.000000
Gekko gecko50.18219850.500.510.7200000.7200000.6136010.80634720065.156.550.9950000.9453200.999564
Meleagris gallopavo51.90207950.490.510.0025000.0065620.0002180.02798240068.377.930.9950000.9453200.999564
Melopsittacus undulatus52.86211550.370.47<0.001<0.0010.0000000.04557220067.388.520.9800000.9210890.995162
Nerodia sipedon51.81207750.590.47<0.001<0.0010.0000000.04557220067.676.850.9950000.9453200.999564
Pelodiscus sinensis53.20217550.610.55<0.001<0.0010.0000000.04557220067.296.830.9800000.9210890.995162
Sphenodon punctatus51.75208550.570.510.0200000.0323080.0084510.04658960068.057.171.0000000.9544281.000000
Struthio camelus52.99222150.490.54<0.001<0.0010.0000000.04557220067.378.360.9750000.9136120.993095
Varanus salvator50.84214050.620.540.3700000.3885000.2726870.47916120069.567.351.0000000.9544281.000000
Xenopus laevis53.35210550.400.54<0.001<0.0010.0000000.04557220069.398.940.9800000.9210890.995162
Note. Empirical tail probabilities are computed as p = ( r + 1 ) / ( N + 1 ) , where r is the number of null samples with identity greater than or equal to the observed value and N is the number of Monte Carlo null samples (up to 2000). Values shown as <0.001 indicate attainment of the minimum possible empirical p-value under this discrete null. For the k-mer language-model nulls, Benjamini–Hochberg correction is applied across species and both p- and q-values are reported. For the real-DNA background, inference is based on empirical percentiles and 99.9% confidence intervals, which indicate the position of the observed identity within the null distribution.
Table 6. First Baseline Comparison.
Table 6. First Baseline Comparison.
SpeciesCNN-BiLSTM %ID vs. Ground-TruthMarkov 2-mer %ID vs. Ground-TruthCNN-BiLSTM %ID vs. Ancestral NodeMarkov 2-mer %ID vs. Ancestral NodePredicted Length (bp)Ancestor Length (bp)
Accipiter gentilis99.6565.2967.3126.72200012,484
Alligator mississippiensis99.6065.3066.9826.98200013,631
Anas platyrhynchos99.6564.9065.9026.05200013,652
Aptenodytes forsteri99.7064.9265.0627.26200013,652
Archilochus colubris99.6064.8764.8527.09200013,649
Boa constrictor99.6565.1764.7626.81200013,590
Caiman crocodilus99.6064.9364.7427.53200013,635
Chelonia mydas99.6565.3864.6926.78200013,641
Columba livia99.5565.2064.5826.21200013,651
Crocodylus porosus99.6565.3064.5127.03200013,636
Dromaius novaehollandiae99.6565.2964.4326.19200013,651
Gallus gallus99.6064.8464.4027.09200013,648
Gekko gecko99.6066.2764.1927.77200013,548
Meleagris gallopavo99.6065.4164.1526.98200013,651
Melopsittacus undulatus99.7065.0564.0627.20200013,652
Nerodia sipedon99.7065.1863.9826.99200013,546
Pelodiscus sinensis99.6064.8163.9226.69200013,638
Sphenodon punctatus99.7565.1663.8927.15200011,162
Struthio camelus99.6564.8763.8726.77200013,652
Varanus salvator99.6565.0063.8626.95200013,594
Xenopus laevis99.7564.6463.7326.53200013,603
Table 7. Starting-position robustness: unshifted baseline per species.
Table 7. Starting-position robustness: unshifted baseline per species.
Speciespred_lenobs_%ID_vs_ASRobs_aln_lenregion_labelfrac_dloopfrac_codinghit_starthit_end%ID_vs_speciesRAW
Accipiter gentilis200053.502157DLOOP1.0000.0001110955.03
Alligator mississippiensis200051.202127CODING0.0000.9991109955.05
Anas platyrhynchos200051.892166DLOOP0.9460.0541110955.03
Aptenodytesforsteri200051.512120CODING0.0000.9901109955.15
Archilochus colubris200051.302160CODING0.0001.0001109955.05
Boaconstrictor200051.752201CODING0.0001.0001110955.10
Caimancrocodilus200050.892089CODING0.0010.9991110955.13
Cheloniamydas200053.332162CODING0.0001.0001109955.10
Columbalivia200051.672212CODING0.0001.0001109955.10
Crocodylusporosus200051.592206CODING0.0050.9951110855.15
Dromaiusnovaehollandiae200051.392049CODING0.0001.0001110055.18
Gallus gallus200050.972109DLOOP1.0000.0001111055.15
Gekko gecko200050.182198CODING0.0000.9971110755.03
Meleagris gallopavo200051.902079CODING0.0001.0001110555.20
Melopsittacus undulatus200052.862115CODING0.0001.0001110055.18
Nerodia sipedon200051.812077CODING0.0000.9651110954.98
Pelodiscus sinensis200053.202175CODING0.0001.0001110055.13
Sphenodon punctatus200051.752085CODING0.0001.0001111155.22
Struthio camelus200052.992221CODING0.0001.0001110055.03
Varanus salvator200050.842140CODING0.0001.0001110055.08
Xenopus laevis200053.352105DLOOP1.0000.0001120555.95
Table 8. Genomic context and local-block coordinates.
Table 8. Genomic context and local-block coordinates.
SpeciesAccessionhit_starthit_endregion_labelfrac_dloopfrac_coding
Accipiter gentilisNC_011818.111109DLOOP1.0000.000
Alligator mississippiensisNC_001922.111099CODING0.0000.999
Anas platyrhynchosNC_009684.111109DLOOP0.9460.054
Aptenodytes forsteriNC_027938.111099CODING0.0000.990
Archilochus colubrisNC_010094.111099CODING0.0001.000
Boa constrictorNC_007398.111109CODING0.0001.000
Caiman crocodilusNC_002744.211109CODING0.0010.999
Chelonia mydasNC_000886.111099CODING0.0001.000
Columba liviaNC_013978.111099CODING0.0001.000
Crocodylus porosusNC_008143.111108CODING0.0050.995
Dromaius novaehollandiaeNC_002784.111100CODING0.0001.000
Gallus gallusNC_001323.111110DLOOP1.0000.000
Gekko geckoNC_007627.111107CODING0.0000.997
Meleagris gallopavoNC_010195.211105CODING0.0001.000
Melopsittacus undulatusNC_009134.111100CODING0.0001.000
Nerodia sipedonNC_015793.111109CODING0.0000.965
Pelodiscus sinensisNC_068236.111100CODING0.0001.000
Sphenodon punctatusNC_004815.111111CODING0.0001.000
Struthio camelusNC_002785.111100CODING0.0001.000
Varanus salvatorNC_010974.111100CODING0.0001.000
Xenopus laevisNC_001573.111205DLOOP1.0000.000
Table 9. Starting-position robustness: across-shift summary.
Table 9. Starting-position robustness: across-shift summary.
SpeciesPred lenobs_%IDshift_mean_%IDshift_sd %IDshift_min %IDshift_max %IDshift_range %IDn Shifts
Accipiter gentilis2000535.002527.4500.5612520.055536.56316.50710
Alligator mississippiensis2000511.989505.1750.2728501.892511.0800.918810
Anas platyrhynchos2000518.929516.2380.7002502.793523.37820.58510
Aptenodytes forsteri2000515.094517.5120.5896507.531525.79718.26710
Archilochus colubris2000512.963515.6070.3139511.510519.6670.815710
Boa constrictor2000517.492518.6800.5296511.855526.39214.53710
Caiman crocodilus2000508.856502.8210.4685496.070508.31012.24010
Cheloniamydas2000533.302531.6470.2796526.437534.4110.797410
Columba livia2000516.727517.6290.4275508.024521.80013.77610
Crocodylus porosus2000515.866523.6590.7944512.749532.44919.70010
Dromaius novaehollandiae2000513.909524.1850.7652513.896535.61121.71610
Gallus gallus2000509.720497.7570.6165488.899506.54817.64910
Gekko gecko2000501.820506.0120.2900501.590511.68310.09310
Meleagris gallopavo2000519.000529.2600.6319515.691539.10123.41010
Melopsittacus undulatus2000528.605529.7790.3075523.585532.6190.903410
Nerodia sipedon2000518.055517.7930.2972513.858521.5740.771610
Pelodiscus sinensis2000531.954529.1230.4390521.159534.13812.97910
Sphenodon punctatus2000517.506519.9600.2157517.092523.6520.656010
Struthio camelus2000529.941526.8570.3905521.024532.68011.65610
Varanus salvator2000508.411514.9510.3618507.914519.32711.41310
Xenopus laevis2000533.492538.6850.5428530.784546.49415.71110
Table 10. Starting-position robustness: per-offset identities.
Table 10. Starting-position robustness: per-offset identities.
Speciesshift_indexshift_offsetshift_%ID Scaled ×10
Accipiter gentilis01265526.603
Accipiter gentilis11895536.563
Accipiter gentilis21092527.548
Accipiter gentilis31453520.092
Accipiter gentilis4749528.311
Accipiter gentilis51169523.701
Accipiter gentilis61199535.974
Accipiter gentilis71498520.055
Accipiter gentilis8339529.078
Accipiter gentilis91587526.577
Alligator mississippiensis01688511.080
Alligator mississippiensis1681501.892
Alligator mississippiensis2785503.817
Alligator mississippiensis31300502.557
Alligator mississippiensis41617507.069
Alligator mississippiensis5736504.556
Alligator mississippiensis61887505.254
Alligator mississippiensis7630503.700
Alligator mississippiensis8801504.274
Alligator mississippiensis91427507.554
Anas platyrhynchos0419514.023
Anas platyrhynchos11559519.095
Anas platyrhynchos21344502.793
Anas platyrhynchos31769523.378
Anas platyrhynchos41868521.938
Anas platyrhynchos5747517.593
Anas platyrhynchos6223519.159
Anas platyrhynchos7871517.734
Anas platyrhynchos81680521.533
Anas platyrhynchos91300505.131
Aptenodytes forsteri0816519.511
Aptenodytes forsteri11843511.993
Aptenodytes forsteri2133518.325
Aptenodytes forsteri31035525.797
Aptenodytes forsteri41431509.451
Aptenodytes forsteri51368521.292
Aptenodytes forsteri6712518.571
Aptenodytes forsteri71819507.531
Aptenodytes forsteri8826521.327
Aptenodytes forsteri9826521.327
Archilochus colubris01641519.667
Archilochus colubris11510511.510
Archilochus colubris21638519.667
Archilochus colubris3636512.459
Archilochus colubris41359517.383
Archilochus colubris536514.479
Archilochus colubris61704516.129
Archilochus colubris7723514.880
Archilochus colubris81217518.332
Archilochus colubris9242511.561
Boa constrictor0943524.793
Boa constrictor1504521.615
Boa constrictor21961519.067
Boa constrictor3350526.392
Boa constrictor41629513.702
Boa constrictor51744520.626
Boa constrictor6642522.233
Boa constrictor71391511.855
Boa constrictor81414512.905
Boa constrictor91887513.612
Caiman crocodilus01874501.415
Caiman crocodilus11238496.070
Caiman crocodilus2936508.310
Caiman crocodilus31235496.070
Caiman crocodilus4542507.435
Caiman crocodilus51646506.824
Caiman crocodilus6411504.753
Caiman crocodilus71121497.925
Caiman crocodilus8137503.646
Caiman crocodilus91307505.758
Chelonia mydas01076534.202
Chelonia mydas11985534.411
Chelonia mydas21267529.813
Chelonia mydas31986534.411
Chelonia mydas41182526.437
Chelonia mydas51584529.086
Chelonia mydas689530.790
Chelonia mydas7147532.689
Chelonia mydas8834530.220
Chelonia mydas91971534.411
Columba livia01337519.157
Columba livia11847518.312
Columba livia21269518.934
Columba livia3110515.525
Columba livia41383508.024
Columba livia535521.252
Columba livia61300517.774
Columba livia7403521.800
Columba livia8403521.800
Columba livia9214513.715
Crocodylus porosus01918518.194
Crocodylus porosus11250530.835
Crocodylus porosus2683525.234
Crocodylus porosus3885525.224
Crocodylus porosus41171532.398
Crocodylus porosus51618513.953
Crocodylus porosus6468530.303
Crocodylus porosus71125532.449
Crocodylus porosus893515.250
Crocodylus porosus91589512.749
Dromaius novaehollandiae01137521.458
Dromaius novaehollandiae11487520.935
Dromaius novaehollandiae21322523.902
Dromaius novaehollandiae3629535.611
Dromaius novaehollandiae41737518.221
Dromaius novaehollandiae51825516.325
Dromaius novaehollandiae6437534.253
Dromaius novaehollandiae71252524.479
Dromaius novaehollandiae81995513.896
Dromaius novaehollandiae9370532.768
Gallus gallus01083497.530
Gallus gallus11078497.530
Gallus gallus21082497.530
Gallus gallus3378502.770
Gallus gallus41587502.719
Gallus gallus51357488.919
Gallus gallus6103506.548
Gallus gallus71528492.169
Gallus gallus81013502.956
Gallus gallus91441488.899
Gekko gecko01508511.683
Gekko gecko11195503.461
Gekko gecko2932505.564
Gekko gecko31993501.590
Gekko gecko41840506.132
Gekko gecko550506.996
Gekko gecko61648508.204
Gekko gecko71069503.076
Gekko gecko8998507.685
Gekko gecko91110505.731
Meleagris gallopavo01540527.725
Meleagris gallopavo11339536.316
Meleagris gallopavo21492530.769
Meleagris gallopavo31910526.676
Meleagris gallopavo4722515.691
Meleagris gallopavo51504531.114
Meleagris gallopavo61143525.912
Meleagris gallopavo71635530.732
Meleagris gallopavo81165528.565
Meleagris gallopavo91320539.101
Melopsittacus undulatus0526532.619
Melopsittacus undulatus1536531.875
Melopsittacus undulatus2136531.966
Melopsittacus undulatus31451527.700
Melopsittacus undulatus4267529.661
Melopsittacus undulatus51941525.968
Melopsittacus undulatus6120531.966
Melopsittacus undulatus71850523.585
Melopsittacus undulatus8348530.317
Melopsittacus undulatus9647532.131
Nerodia sipedon01679516.651
Nerodia sipedon1160520.665
Nerodia sipedon2392515.539
Nerodia sipedon3900516.729
Nerodia sipedon41951513.988
Nerodia sipedon51106513.858
Nerodia sipedon61707517.241
Nerodia sipedon7518520.299
Nerodia sipedon81797521.384
Nerodia sipedon9109521.574
Pelodiscus sinensis01135531.063
Pelodiscus sinensis1307521.159
Pelodiscus sinensis2874523.062
Pelodiscus sinensis3715525.599
Pelodiscus sinensis41241532.934
Pelodiscus sinensis51468530.658
Pelodiscus sinensis61804534.138
Pelodiscus sinensis71856530.575
Pelodiscus sinensis81906532.715
Pelodiscus sinensis987529.328
Sphenodon punctatus0880520.561
Sphenodon punctatus1588517.466
Sphenodon punctatus21517519.542
Sphenodon punctatus31989517.092
Sphenodon punctatus4318521.236
Sphenodon punctatus5114518.776
Sphenodon punctatus61298518.818
Sphenodon punctatus7944519.535
Sphenodon punctatus8711523.652
Sphenodon punctatus9779522.919
Struthio camelus0140528.578
Struthio camelus1426532.680
Struthio camelus2721524.816
Struthio camelus31345522.769
Struthio camelus41386521.024
Struthio camelus51318526.145
Struthio camelus61230524.966
Struthio camelus7539529.412
Struthio camelus81469525.564
Struthio camelus9437532.619
Varanus salvator0392517.098
Varanus salvator1278514.967
Varanus salvator2991515.306
Varanus salvator3802518.451
Varanus salvator41255507.914
Varanus salvator5879513.182
Varanus salvator61132511.153
Varanus salvator7517519.327
Varanus salvator8327513.590
Varanus salvator9453518.519
Xenopus laevis01456545.098
Xenopus laevis11725539.915
Xenopus laevis2683540.246
Xenopus laevis3804546.494
Xenopus laevis41278530.784
Xenopus laevis5398543.160
Xenopus laevis61642537.827
Xenopus laevis71562537.234
Xenopus laevis81961535.305
Xenopus laevis91279530.784
Note Percent identity columns are reported at ×10 scale (e.g., 535.002 = 53.500%).
Table 11. Sensitivity to alignment and tree/ASR choices results.
Table 11. Sensitivity to alignment and tree/ASR choices results.
Species_Keymega_pidmega_lenmega_scoreloso_pidloso_lenloso_scoredelta_pid_pp
Accipiter gentilis43.4631041315.542.23830921295.5−1.222
Alligator mississippiensis41.10630921248.541.14531281273.00.038
Anas platyrhynchos42.74130651278.543.05530381279.50.314
Aptenodytes forsteri42.1630741265.043.82829731285.01.668
Archilochus colubris42.16630191265.542.84830061272.50.681
Boa constrictor42.03431071292.042.77830531288.00.743
Caiman crocodilus42.77529551248.542.45330011259.0−0.322
Chelonia mydas42.63130941299.043.90229681287.01.271
Columba livia41.60831341271.543.30729881278.01.698
Crocodylus porosus41.75430561260.042.74229831271.00.988
Dromaius novaehollandiae43.3429581273.544.16629571297.00.826
Gallus gallus42.14530301252.541.93430811268.0−0.211
Gekko gecko43.14829481254.041.64331291272.0−1.505
Meleagris gallopavo41.86330601260.542.2830701277.50.417
Melopsittacus undulatus41.6831311286.043.39430351294.01.714
Nerodia sipedon41.67230861265.542.99830421287.51.326
Pelodiscus sinensis42.39630381274.543.59530601301.01.198
Sphenodon punctatus44.53429641282.541.92431081279.0−2.61
Struthio camelus42.03130431265.044.29529711296.52.264
Varanus salvator42.76830421279.042.79731031296.00.029
Xenopus laevis43.66230371310.044.47730511352.50.816
Table 12. Summary of per-species median %ID (Equal-length) for consensus two-flank imputation and baselines.
Table 12. Summary of per-species median %ID (Equal-length) for consensus two-flank imputation and baselines.
SpeciesC_EqualLen_%ID_medianNN_EqualLen_%ID_medianFlankCopy_EqualLen_%ID_medianMaskReps
Accipiter gentilis34.627.429.43
Alligator mississippiensis28.226.226.23
Anas platyrhynchos33.627.631.03
Aptenodytes forsteri25.831.630.83
Archilochus colubris26.625.227.03
Boa constrictor38.224.433.63
Caiman crocodilus29.023.827.83
Chelonia mydas28.623.826.43
Columba livia27.025.630.23
Crocodylus porosus33.628.230.63
Dromaius novaehollandiae26.631.027.23
Gallus gallus31.828.634.23
Gekko gecko25.624.630.23
Meleagris gallopavo31.030.227.03
Melopsittacus undulatus29.627.830.63
Nerodia sipedon39.834.434.23
Pelodiscus sinensis29.630.224.83
Sphenodon punctatus32.222.628.03
Struthio camelus25.030.422.83
Varanus salvator27.223.830.43
Xenopus laevis31.428.026.03
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Angelakis, D.; Cavouras, D.; Glotsos, D.T.; Kostopoulos, S.A.; Athanasiadis, E.I.; Kalatzis, I.K.; Asvestas, P.A. In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs. AI 2026, 7, 27. https://doi.org/10.3390/ai7010027

AMA Style

Angelakis D, Cavouras D, Glotsos DT, Kostopoulos SA, Athanasiadis EI, Kalatzis IK, Asvestas PA. In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs. AI. 2026; 7(1):27. https://doi.org/10.3390/ai7010027

Chicago/Turabian Style

Angelakis, Dimitris, Dionisis Cavouras, Dimitris Th. Glotsos, Spiros A. Kostopoulos, Emmanouil I. Athanasiadis, Ioannis K. Kalatzis, and Pantelis A. Asvestas. 2026. "In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs" AI 7, no. 1: 27. https://doi.org/10.3390/ai7010027

APA Style

Angelakis, D., Cavouras, D., Glotsos, D. T., Kostopoulos, S. A., Athanasiadis, E. I., Kalatzis, I. K., & Asvestas, P. A. (2026). In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs. AI, 7(1), 27. https://doi.org/10.3390/ai7010027

Article Metrics

Back to TopTop