In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs

Angelakis, Dimitris; Cavouras, Dionisis; Glotsos, Dimitris Th.; Kostopoulos, Spiros A.; Athanasiadis, Emmanouil I.; Kalatzis, Ioannis K.; Asvestas, Pantelis A.

doi:10.3390/ai7010027

Open AccessArticle

In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs

by

Dimitris Angelakis

^*

,

Dionisis Cavouras

,

Dimitris Th. Glotsos

,

Spiros A. Kostopoulos

,

Emmanouil I. Athanasiadis

,

Ioannis K. Kalatzis

and

Pantelis A. Asvestas

Department of Biomedical Engineering, University of West Attica, 122 43 Athens, Greece

^*

Author to whom correspondence should be addressed.

AI 2026, 7(1), 27; https://doi.org/10.3390/ai7010027

Submission received: 10 December 2025 / Revised: 9 January 2026 / Accepted: 10 January 2026 / Published: 14 January 2026

(This article belongs to the Special Issue Transforming Biomedical Innovation with Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

This study presents an in silico proof of concept exploring whether deep learning models can perform conditional mitochondrial DNA (mtDNA) sequence prediction across species boundaries. A CNN–BiLSTM model was trained under a leave-one-species-out (LOSO) scheme on complete mitochondrial genomes from 21 vertebrate species, primarily archosaurs. Model behavior was evaluated through multiple complementary tests. Under context-conditioned settings, the model performed next-nucleotide prediction using overlapping 200 bp windows to assemble contiguous 2000 bp fragments for held-out species; the resulting high token-level accuracy (>99%) under teacher forcing is reported as a diagnostic of conditional modeling capacity. To assess leakage-free performance, a two-flank masked-span imputation task was conducted as the primary evaluation, requiring free-running reconstruction of 500 bp interior spans using only distal flanking context; in this setting, the model consistently outperformed nearest-neighbor and demonstrated competitive performance relative to flank-copy baselines. Additional robustness analyses examined sensitivity to window placement, genomic region (coding versus D-loop), and random initialization. Biological plausibility was further assessed by comparing predicted fragments to reconstructed ancestral sequences and against composition-matched null models, where observed identities significantly exceeded null expectations. Using the National Center for Biotechnology Information (NCBI) BLAST web interface, BLASTn species identification was performed solely as a biological plausibility check, recovering the correct species as the top hit in all cases. Although limited by dataset size and the absence of ancient DNA damage modeling, these results demonstrate the feasibility of conditional mtDNA sequence prediction as an initial step toward more advanced generative and evolutionary modeling frameworks.

Keywords:

deep learning; CNN-BiLSTM; mitochondrial DNA; ancestral sequence reconstruction; archosaurs; AI for genomics

1. Introduction

Ancient DNA (aDNA) has revolutionized evolutionary biology by enabling the reconstruction of phylogenetic relationships, population histories, and adaptive traits of extinct species. However, the recovery of aDNA is severely constrained by post-mortem degradation processes, including hydrolysis, oxidation, and depurination, which fragment and chemically modify DNA over time. Mitochondrial DNA (mtDNA) has an estimated half-life of ~521 years under temperate, optimal conditions [1] limiting the recovery of authentic sequences to specimens typically younger than 1–2 million years. Consequently, genomic data from Mesozoic organisms such as non-avian dinosaurs, which became extinct over 66 million years ago, remain inaccessible to direct sequencing. Classical paleogenomic studies have primarily focused on Pleistocene and Holocene samples, where partial DNA recovery is still feasible. These efforts rely on next-generation sequencing (NGS), read mapping, and phylogenetic inference to reconstruct partial genomes or impute missing regions based on closely related species [2,3,4]. Traditional approaches to ancestral sequence reconstruction (ASR) employ maximum likelihood (ML) or Bayesian frameworks, implemented in tools such as PAML [5] and BEAST [6]. While robust, these methods depend on the availability of closely related extant sequences and assume predefined evolutionary models, potentially limiting their predictive scope for deeply divergent or extinct lineages. In parallel, deep learning has emerged as a powerful paradigm in computational biology, capable of learning complex, high-dimensional sequence patterns without explicit evolutionary modeling. Convolutional neural networks (CNNs) and recurrent architectures, including bidirectional long short-term memory (BiLSTM) networks, have been successfully applied to promoter prediction, enhancer detection, and variant effect analysis [7,8,9]. More recently, deep learning has achieved breakthroughs in protein structure prediction [10] and RNA secondary structure modeling [11]. Despite this progress, applications to ancestral DNA reconstruction or paleogenomic inference remain largely unexplored. Here, this study presents a proof-of-concept in silico feasibility study that explores whether deep learning models can learn non-random, biologically structured patterns in mitochondrial genomes The study compiled a dataset of 21 complete vertebrate mtDNA sequences, including archosaurs (birds and crocodilians) and non-archosaur outgroups (turtles, lepidosaurs, and amphibians). The mtDNA was selected as the focus because of its higher copy number per cell, lack of recombination, and relatively small, circular genome, which make it more amenable to recovery and complete assembly than nuclear DNA, especially in degraded or low-yield samples A CNN-BiLSTM model was trained in a LOSO scheme to perform next-nucleotide prediction on overlapping windows, and the resulting predicted 2000 bp fragments were compared to ancestral sequences inferred with MUSCLE alignments in MEGA X. In this setup, the model predicts each base using the true preceding context from the held-out species; this is not unconditional de novo sequence generation, but a test of whether the learned cross-species sequence dependencies transfer to unseen species. This study is purely computational and it does not involve wet-lab experiments or the recovery of actual aDNA. Instead, it provides an initial computational assessment of whether deep learning can generate sequences that retain detectable evolutionary patterns. By bridging machine learning and classical phylogenetics, this exploratory framework introduces AI-assisted paleogenomic inference as a complementary direction for future studies. This work is a preliminary proof of concept with a deliberately limited taxonomic and sequence scope, intended to assess feasibility rather than provide immediate paleogenomic reconstruction. While purely computational and not involving the generation or analysis of authentic aDNA, the ability to recover non-random, biologically structured patterns in short mtDNA fragments suggests potential future applications in paleogenomics. Therefore, the core is not the model’s accuracy as a ‘completer’, but its ability to generalize these learned rules to a species it has never seen, which is validated by the LOSO design.

Beyond its phylogenetic scope, AI-driven reconstruction of short mtDNA fragments could support several domains of biomedical research, including the recovery of evolutionary signal in highly fragmented samples, the study of selective pressures that shape mtDNA variants implicated in human disease, and the augmentation of bioinformatics workflows that operate on incomplete or low-coverage mitochondrial data.

Real-world deployment would require the integration of simulated or empirical aDNA, including post-mortem damage patterns and high fragmentation, which remains a direction for future work.

Related Work

Deep learning has been increasingly applied to sequence modeling in genomics, with early work exploring generative frameworks such as GANs and activation maximization for synthetic DNA design [12] and, more recently, latent diffusion models tailored for discrete nucleotide sequences [13]. In the field of ancestral sequence reconstruction (ASR), recent studies have proposed autoregressive deep models for proteins, capturing epistatic constraints and improving reconstruction accuracy over maximum-likelihood baselines [14,15]. However, these generative ASR efforts have focused almost exclusively on proteins rather than nucleotide sequences. For mtDNA, deep learning has been primarily applied to sequence classification and variant effect prediction, often using hybrid convolutional–recurrent architectures such as DanQ [16], which predicts functional elements like regulatory regions, but not for generative reconstruction of missing fragments. Transformer-based approaches for phylogenetic inference, including Phyloformer [17] and Fusang [18], predict evolutionary distances or tree topologies from alignments but do not generate explicit ancestral or taxon-specific sequences. The potential for deep learning to capture complex biological and environmental patterns is increasingly recognized across diverse scientific domains. In hyperspectral imaging (HSI), Yang et al. (2021) demonstrated that standard CNNs often produce “discontinuous” features due to fixed receptive fields, a problem they solved with the Enhanced Multiscale Feature Fusion Network (EMFFN), which integrates spectral and spatial information across multiple parallel scales [19]. Building on these principles of high-dimensional data processing [20], presented HyperSIGMA, a vision transformer-based foundation model scalable to over one billion parameters. HyperSIGMA utilizes a Sparse Sampling Attention (SSA) mechanism to overcome data redundancy. This poses a challenge directly analogous to capturing the intricate dependencies and redundant motifs within mitochondrial DNA (mtDNA) sequences by intelligently sampling the most informative contextual features [20].

While these advancements highlight the generative and predictive capacity of deep learning in high-dimensional data, their application to evolutionary biology remains a critical frontier. Recent genomic perspectives on the adaptation of Galápagos iguanas illustrate the importance of identifying specific selective sweeps and ancestral signals to understand lineage evolution and divergence [21].

Motivated by these gaps, this study applies a lightweight CNN–BiLSTM architecture to explore the feasibility of conditional mtDNA sequence prediction across species boundaries. Rather than performing phylogenetic inference or de novo genome reconstruction, the proposed framework evaluates whether multiscale feature extraction and context-aware modeling can support the completion of missing mtDNA fragments under controlled conditioning. The model is assessed using a leakage-free evaluation design and alignment-based supporting sanity checks to verify biological plausibility and non-random evolutionary similarity, without claiming phylogenetic placement or inference.

The major contributions of this study are as follows:

(i) a computational proof of concept demonstrating context-conditioned mtDNA sequence prediction under a leave-one-species-out (LOSO) training scheme;

(ii) evidence that deep learning models can capture transferable sequence regularities under controlled conditioning that are shared across species;

(iii) a leakage-free evaluation using masked-span imputation that consistently improves over simple nearest-neighbor heuristics and performs competitively with flank-copy baselines; and

(iv) a validation framework employing composition-matched null models and ancestral alignments as sanity checks for non-random sequence similarity.

2. Materials and Methods

2.1. Data Collection

Complete mitochondrial genome sequences from 21 extant vertebrate species were retrieved from the National Center for Biotechnology Information (NCBI, National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894, USA) GenBank database in FASTA format. The dataset was designed to include primarily archosaur species (birds and crocodilians) alongside non-archosaur reptiles and a single amphibian outgroup to provide phylogenetic context for model evaluation. The species included were as follows in Table 1.

Only complete, circularized mitochondrial genomes were included to ensure consistency in sequence length for alignment and modeling. The dataset of 21 species was selected based on the availability of complete, high-quality mitochondrial genomes representing archosaurs (birds and crocodilians) and key non-archosaur outgroups (turtles, lepidosaurs, and a single amphibian, Xenopus laevis). This selection was designed to capture both in-clade conservation and outgroup contrast, providing a phylogenetically informative yet computationally manageable starting point for resource constrained proof-of-concept study on standard workstation hardware.

2.2. Data Preprocessing

All raw mtDNA sequences were processed using Biopython v1.83 to ensure compatibility with the deep learning workflow. Sequences were converted to uppercase to maintain uniformity, and any ambiguous nucleotide characters (N) were removed to avoid uncertainty in downstream modeling. Following sequence cleaning, each mtDNA genome was segmented into overlapping windows of 200 bp with a stride of 50 bp to generate partially overlapping fragments that preserved local sequence continuity These short windows served as the basic inputs for the deep learning model, enabling it to capture local sequence dependencies while increasing the effective dataset size. To generate model predictions at the desired fragment length, the deep learning pipeline was designed to reconstruct 2000 bp predicted sequences for the species left out under the LOSO evaluation scheme. Input windows were one-hot encoded in an L × 4 matrix representation (where L = 200 bp and columns correspond to A, C, G, and T) for CNN-BiLSTM model ingestion. This preprocessing procedure produced a species-dependent set of training windows standardized for model input, enabling the CNN-BiLSTM to learn predictive sequence patterns across all taxa while accommodating interspecies variability. The window size of 200 bp was selected to balance computational efficiency with the capture of sufficient local sequence context. This length is consistent with established deep learning frameworks in genomics, such as DanQ [16] and DeepSEA [22], which utilize 200 bp intervals to model regulatory chromatin profiles effectively. The choice of a 200 bp window size with a 50 bp stride for training was a trade-off between computational feasibility and the ability to capture local mtDNA sequence motifs. A 200 bp context is sufficient to encompass common mitochondrial features such as conserved coding segments and structural motifs, while the 50 bp overlap maintains local sequence continuity without excessive data redundancy. The 2000 bp predicted fragment length was chosen as a fixed comparison window to ensure uniform evaluation across species and to keep Smith–Waterman alignments computationally tractable. While the corresponding MEGA X ancestral sequences were typically much longer, restricting the analysis to the first 2000 bp allowed for consistent benchmarking of all predictions without incurring the exponential runtime and memory costs associated with aligning entire mitochondrial genomes. The average sequence length was approximately 17,191 bp (range: 15,181 bp for Sphenodon punctatus–18,905 bp for Boa constrictor). Given the circular nature of the mitochondrial genome, the sliding window yielded a total of 7220 input fragments across the entire dataset. This volume of data provided sufficient coverage for the model to learn localized sequence patterns under the LOSO framework.

2.3. Feature Extraction

Each mtDNA window was transformed into a one-hot encoded matrix of size L × 4, where L is the nucleotide sequence length and the four columns represent the bases A, C, G, and T. This encoding treats nucleotides as categorical variables and is widely used in deep learning applications in genomics. No biologically engineered features (e.g., GC content or codon bias) were included, in line with the study’s primary objective: to assess whether intrinsic sequence patterns can be learned directly from raw nucleotide data, without incorporating explicit evolutionary assumptions or handcrafted inputs. To explore this, a CNN-BiLSTM model was implemented using TensorFlow/Keras v2.15. The input consisted of 200 bp one-hot encoded windows (L = 200), generated as described in Section 2.2. For each window, the model was trained to predict the next nucleotide in the sequence: the input comprised all nucleotides except the final position, while the target output corresponded to the nucleotide immediately following the window, enabling next-base prediction across the mtDNA.

2.4. Model Development

Model evaluation followed a LOSO scheme to assess generalization to unseen species. For each species in the dataset (21 complete mtDNA genomes in our runs), all remaining species were used for training, and the excluded species served as the independent test set. Within the training data, 10% of windows were held out for validation, yielding a train/validation/test split in each LOSO iteration: 90% of windows from the training species for fitting, 10% from those same species for validation, and 100% of the left-out species for testing. Input windows were 200 bp in length and one-hot encoded over {A, C, G, T}. Training windows were generated with a 50 bp stride, whereas test windows were generated with a 100 bp stride to reduce redundancy during evaluation. A 200 bp context was chosen to balance local mtDNA motif capture with computational feasibility on a modest workstation. The network architecture was a compact CNN-BiLSTM: a 1D convolutional layer (64 filters, kernel size = 5, ReLU) for local motif extraction; a bidirectional LSTM layer (64 units, returning sequences) to capture bidirectional context; a dropout layer (rate = 0.3) for regularization; and a dense softmax output (4 units) providing per-nucleotide predictions. Models were trained with categorical cross-entropy and Adam (learning rate = 0.001) for up to 10 epochs per LOSO fold, using Early Stopping (patience = 3, restore best weights) and ReduceLROnPlateau (factor = 0.5, patience = 2). For each LOSO test species, within-window sequence predictions (positions 2–200) were concatenated and truncated to 2000 bp to form the computational fragment. Performance metrics included nucleotide-level accuracy, precision, recall, macro-averaged F1, and percent identity between the predicted fragment and (i) the ground-truth sequence from the left-out species, and (ii) for an initial all-species snapshot, MUSCLE multiple sequence alignments were paired with maximum likelihood (ML) ancestral reconstruction in MEGA X to provide a well understood baseline that is easy to audit and reproduce from a graphical environment. For fair leave-one-species-out (LOSO) evaluation, MAFFT was used to re-align the remaining 20 taxa per fold due to its speed and robustness in batch mode, followed by an attachment-node ancestral reconstruction on the reduced set. Tree inference in the LOSO pipeline used a simple, reproducible guide tree (for example, neighbor-joining with midpoint rooting) that is sufficient to locate the ancestral attachment point for the held-out lineage in this proof of concept. (Section 2.5). This configuration corresponds to the primary teacher-forced CNN–BiLSTM used for 200 bp next-nucleotide prediction and 2000 bp fragment assembly. A separate, compact CNN–BiLSTM configuration is used for the two-flank masked-span imputation task and is described explicitly in Section 2.7.

To generate a predicted mtDNA fragment of length 2000 bp for a held-out species, we employ a deterministic sliding-window procedure under teacher-forced conditioning. Given the full mitochondrial genome of the held-out species, overlapping windows of fixed length 200 bp are extracted using a constant sampling stride.

For each window, the trained CNN–BiLSTM model performs sequence-to-sequence prediction within the window: the true nucleotide sequence at positions 1–199 is provided as input, and the model predicts the nucleotide identities at positions 2–200. Importantly, the true genomic sequence is always used as input context (teacher forcing), and predicted nucleotides are never fed back into subsequent predictions.

The predicted subsequences (199 bases per window) are concatenated in genomic order to form a long predicted sequence, which is then truncated to obtain a final fragment of exactly 2000 bp. As a result, the reported fragment represents an aggregate of position-wise predictions generated under true-context conditioning across sampled windows, rather than a free-running autoregressive reconstruction. The window sampling strides reported elsewhere (50 bp during training and 100 bp during testing) control the degree of redundancy among windows used for model fitting and evaluation, while fragment assembly itself is performed by concatenating all predicted within-window outputs prior to truncation. Algorithm 1 formalizes this reconstruction procedure.

Algorithm 1: Context-Conditioned mtDNA Fragment Prediction

Input:

G: full mtDNA genome sequence of the held-out species (string)
L_w: window length (200 bp)
S: window sampling stride for the held-out species (e.g., 100 bp in testing)
L_out: desired output fragment length (2000 bp)
M: trained CNN–BiLSTM model (sequence-to-sequence predictor over window positions)

Output:

P: predicted mtDNA fragment of length L_out

Procedure:

Initialize empty list P_preds
Extract overlapping windows from the held-out genome:
○
For start positions i = 0, S, 2S, … while i + L_w ≤ |G|:
▪
W = G[i : i + L_w]
▪
One-hot encode W as X_full ∈ {0,1}^{L_w×4}
▪
Form teacher-forced input sequence (true context within window):
▪
X_in = X_full [0 : L_w−1] (positions 1..199)
▪
Predict per-position next-base distribution within the window:
▪
Ŷ = M(X_in), where Ŷ has length L_w−1 (199 outputs)
▪
Convert Ŷ to discrete bases (e.g., argmax per position):
▪
ŷ_seq = decode(Ŷ) (a 199-nt string)
▪
Append ŷ_seq to P_preds
Concatenate predictions in window order:
○
P_long = concatenate(P_preds)
Truncate (or pad, if needed) to required length:
○
P = P_long[0 : L_out]
Return P

2.5. Biological Validation

To evaluate the biological plausibility and database consistency of the CNN–BiLSTM–predicted mtDNA fragments, we implemented a multi-step in silico validation pipeline. This pipeline comprised multiple sequence alignment, reference-based ancestral reconstruction for contextual comparison, local alignment scoring, and similarity search. These analyses were designed as diagnostic sanity checks to verify that predicted fragments were non-random and compatible with known mitochondrial sequence structure, rather than to perform phylogenetic inference. Figure 1 summarizes the validation workflow.

2.5.1. Multiple Sequence Alignment (MSA)

Complete mitochondrial genomes from all 21 species were aligned using MUSCLE, as implemented in MEGA X v12. Default gap penalties and a codon-independent nucleotide scoring scheme were applied. The resulting alignments were manually inspected to verify positional homology, ensuring the absence of misaligned terminal regions, excessive gap clustering, or spurious insertions/deletions before proceeding to ancestral sequence reconstruction.

2.5.2. Ancestral Sequence Reconstruction

Using the curated MUSCLE alignment, ancestral nodes were inferred in MEGA X under a Maximum Likelihood (ML) framework employing the Tamura–Nei substitution model. Internal nodes corresponding to major phylogenetic splits (e.g., within Archosauria and between outgroup clades) were identified and exported as FASTA sequences. These reconstructed ancestral sequences served as reference points for assessing the phylogenetic plausibility of the machine learning–predicted fragments. Figure 2 illustrates the inferred phylogenetic tree with highlighted ancestral nodes.

2.5.3. Leave-One-Species-Out Ancestral Sequence Reconstruction MAFFT

To ensure a fair evaluation aligned with the LOSO scheme used for model training, an additional ancestral sequence reconstruction was performed excluding the held-out species from the MSA and tree for each fold. For each species, the remaining 20 mtDNA sequences were aligned using MAFFT v7.526 with automatic strategy selection and support for non-standard symbols. A Neighbor-Joining (NJ) tree was constructed using Biopython’s DistanceTreeConstructor with an identity-based distance matrix and midpoint rooting. To identify the equivalent internal node for reconstruction (approximating the held-out species’ parent), we first built a reference NJ tree from all 21 species and computed the descendant leaf set of each species’ parent node (excluding the species itself). In the LOSO-pruned tree, the internal node with the closest matching descendant set (exact match preferred; otherwise maximized Jaccard similarity) was selected. Fitch parsimony was then applied to reconstruct the ancestral sequence at this node, resolving ambiguities by preferring bases from the Fitch candidate set or the most frequent base in the column. The model’s 2000 bp predicted fragment for the held-out species was compared to this LOSO-reconstructed ancestral sequence using Smith–Waterman local alignment.

MUSCLE and MAFFT are both widely used for whole-mitogenome alignment and produce highly similar columns for conserved regions. ML in MEGA X provides a single, consistent all-species reference, while an NJ guide on reduced sets minimizes runtime and scripting complexity during LOSO. The biological validation in this study focuses on short, contiguous fragments and species-level identification rather than on model choice testing per se. The conclusions within this validation context rely on relative comparisons.

To assess whether methodological choices drive the results, sensitivity checks were performed by re-running selected folds with alternative settings: (i) MUSCLE in place of MAFFT in the LOSO pipeline, (ii) ML trees in place of NJ for the reduced 20-taxon sets, and (iii) recomputation of LOSO ancestral sequences under these variants. For each variant, predicted fragments were re-scored against the corresponding LOSO ancestor with the same Smith–Waterman parameters, and BLAST species identification was repeated under the fixed megablast configuration. A zero gap-open penalty was intentionally chosen for the Smith–Waterman local alignments to avoid penalizing short indels or alignment ‘jitter’ potentially introduced at the boundaries of the 200 bp sliding-window predictions. This setting, paired with a −2.5 extension penalty, prioritizes the identification of homologous blocks while remaining permissive of local uncertainties inherent in this window-based proof of concept. Importantly, identical parameters were applied to all DNA-mimicking nulls and baselines to ensure a controlled statistical comparison.

2.5.4. Local Alignment (Smith-Waterman)

Each predicted fragment, generated under the LOSO scheme, was locally aligned against the corresponding MEGA-inferred ancestral sequence using the Smith–Waterman algorithm. SW was implemented with Parasail using match/mismatch = 1/−2 and affine gaps (open = 0, extend = −2.5). Percent identity and alignment scores were calculated to assess whether predicted sequences shared detectable homology with ancestral nodes beyond random expectation. Both forward and reverse complement orientations were evaluated to capture the best possible local match. As an orthogonal validation step, all predicted sequences were queried using BLASTn against the NCBI nt database, restricted to vertebrate mitochondrial sequences. High-scoring segment pairs (HSPs) corresponding to the expected species or clade confirmed that the CNN-BiLSTM-generated fragments were consistent with known mitochondrial sequence structure, even in the absence of explicit evolutionary priors during model training. As a basic control, each predicted fragment was also compared against a randomly shuffled version of its corresponding ancestral sequence. This null model produced a baseline local identity of approximately 25% (Section 2.5.6), consistent with a uniform nucleotide distribution, supporting that the CNN–BiLSTM predictions reflect conserved sequence regularities rather than random composition

2.5.5. Sensitivity to Alignment and Tree/ASR Choices

To quantify how methodological choices affect the predicted-to-ancestor identity, we compared two pipelines on the same 21-species panel. The MEGA arm used the curated MUSCLE alignment and maximum-likelihood (Tamura–Nei) tree from the original analyses to obtain the ancestral sequence for each species. The LOSO arm re-aligned the remaining 20 species with MAFFT (--auto) and built a Neighbor-Joining (NJ) tree using an identity-based distance in Biopython. To target the same ancestral position as the excluded taxon, we computed the descendant leaf set of that taxon’s parent in a 21-species reference NJ tree and, in each LOSO tree, selected the internal node with the highest Jaccard similarity of descendant sets. We then reconstructed the sequence at that node with Fitch parsimony. Predicted fragments were scored against both ancestors with Smith–Waterman (Biopython PairwiseAligner; match = 1, mismatch = −2, gap-open = 0, gap-extend = −2.5), and we report gap-inclusive % identity (matches/alignment columns), alignment length (including gap columns), and alignment score. The objective here is relative sensitivity (Δ%ID between arms) rather than a new absolute benchmark; both arms are evaluated with the same scoring engine and parameters.

2.5.6. DNA-Mimicking Null Simulations

To address potential bias from Smith–Waterman selecting the best local match (potentially inflating identities above a 25% random baseline), we simulated two DNA-mimicking null models using only the training-species mtDNA (excluding the held-out species). The first null trained an order-4 (k = 5) Markov model on nucleotide frequencies and transitions, sampling 2000 bp sequences. The second null extracted random 2000 bp circular windows from training mtDNA. For each species, we computed Monte Carlo p-values as the proportion of 2000 null samples with Smith–Waterman %ID ≥ observed (vs. the LOSO ancestral sequence), using adaptive sampling (up to 2000 per null) with early stopping if the one-sided 99.9% Wilson confidence interval excluded α = 0.05. p-values were FDR-corrected (Benjamini–Hochberg) across species.

2.5.7. Statistical Evaluation of Sequence Identity

To assess whether the predicted fragments retained significantly greater identity to reconstructed ancestral nodes than expected by chance, a one-sided binomial test was performed. For each species, the observed number of matching bases was derived from the percent identity between the 2000 bp predicted fragment and the corresponding MEGA X ancestral fragment. The null hypothesis assumed a baseline identity of 25%, corresponding to random base guessing under uniform nucleotide distribution. P-values were computed using scipy.stats.binomtest (SciPy v1.11.4). All predicted fragments exhibited significantly higher identity than the random baseline (p < 0.001). As an additional control, a 2-mer (first-order Markov) baseline model was applied under the same LOSO.

To strengthen the comparison beyond the 2-mer Markov baseline, we evaluated additional baselines using only training-species mtDNA: (i) k-mer language models (k = 3–6, order 2–5 Markov) sampling 2000 bp sequences, (ii) position-specific scoring matrices (PSSM, akin to profile-HMM) from LOSO MSAs (MAFFT-aligned training species), sampling sequences and reporting best/mean %ID over 50 samples, (iii) per-position consensus (majority vote) from LOSO MSAs, and (iv) nearest-neighbor copy (2000 bp window from the training species with highest full-mtDNA identity to the held-out species, selected via random fixed seed). For each baseline, we computed Smith–Waterman %ID vs. the LOSO ancestral sequence, with Monte Carlo p-values (p_obs = Pr[baseline %ID ≥ observed]) for k-mer models (up to 2000 samples). We post-processed results to add lift (%ID observed minus baseline mean) and z-scores for k = 5 and k = 6, significance flags (p_obs ≤ 0.05), FDR q-values (Benjamini–Hochberg), and aggregate metrics (median lift, mean z, Stouffer’s Z with one-sided p).

2.5.8. Definition of Sequence Identity Metrics

To comprehensively evaluate the predicted fragments, this study employs three distinct percent identity (%ID) metrics, each suited for a specific analytical purpose. Readers should note that the values generated by these metrics are calculated differently and are not directly comparable.

Ungapped, Equal-Length Identity: Used for initial comparisons against the ground-truth sequence and the all-species ancestral nodes is calculated as (Matches/SequenceLength) × 100. It provides a straightforward measure of nucleotide similarity without considering insertions or deletions.
Gap-Excluded Smith-Waterman Identity: Used in the fair LOSO validation against the re-calculated ancestors. This is calculated as (Matches/(Matches + Mismatches)) × 100. This metric focuses on the accuracy of the aligned regions only, ignoring gaps, which is useful for assessing the conservation of homologous blocks.
Gap-Inclusive Smith-Waterman Identity: Used for the DNA-mimicking null simulations. This is calculated as (Matches/TotalAlignmentColumns) × 100. By including gap columns in the denominator, this metric penalizes for insertions and deletions, providing the most stringent assessment of overall alignment quality. This was chosen for the null comparison because it holistically accounts for the structure that the Smith-Waterman algorithm optimizes.

The three identity formulations address different biological and algorithmic questions, so each is paired to a specific comparison. Ungapped, equal-length %ID quantifies pure substitutional agreement at a fixed locus and is alignment-free; it avoids artifacts from variable alignment lengths and is appropriate when the intended span length is fixed or when sequences have been trimmed to equal length. Gap-excluded SW %ID isolates conservation within confidently aligned blocks by removing gaps from the denominator; this emphasizes base substitutions while reducing sensitivity to lineage-specific indel processes or to local-alignment boundary effects, which is desirable in LOSO validation against re-estimated ancestors. Gap-inclusive SW %ID measures overall alignment fidelity, penalizing insertions/deletions by counting all alignment columns; this more stringent score is well suited for null comparisons, where gappy alignments could otherwise inflate match rates. Because denominators differ (span length vs. matched+mis-matched bases vs. total alignment columns), the absolute values are not directly comparable across metrics; conclusions should be drawn within a metric, chosen to match the scientific question and the expected indel behavior of the comparison.

2.6. Robustness to Starting Position and Region Stratification

To verify that model performance was not an artifact of specific genomic coordinates or regional biases, we conducted two complementary robustness evaluations. First, positional invariance was assessed via randomized circular shifting of predicted 2000 bp fragments to ensure that results were not dependent on an arbitrary window start position. Specifically, each predicted fragment was circularly shifted ($N = 10$ shifts per species) and re-aligned to the corresponding LOSO ancestral sequence using Smith–Waterman local alignment. Comparable alignment statistics across shifts indicated that the observed non-random similarity was robust to starting-position selection.

Second, a regional stratification analysis was performed to assess whether model performance was confined to particular mitochondrial regions. Using GenBank annotations retrieved via Entrez, base-wise masks were constructed for the control region (D-loop) and coding regions. These masks were projected onto the raw mtDNA sequences via MAFFT pairwise alignments, and predicted fragment hits were identified using Bio.Align.PairwiseAligner, with regions assigned based on a ≥70% overlap criterion. This analysis confirmed that conditional prediction performance was not restricted to a single genomic region but extended across both regulatory and protein-coding portions of the mitochondrial genome.

2.7. Two-Flank Masked-Span Imputation

Two-flank masked-span imputation of mitochondrial DNA was evaluated under a leave-one-species-out (LOSO) protocol across 21 vertebrate species (one FASTA per species). This task constitutes the most stringent and methodologically critical evaluation in the study, as it requires fully free-running reconstruction of long mtDNA spans in the absence of ground-truth conditioning. Sequences were upper-cased, U → T, non-ACGT characters replaced with A, and genomes treated as circular for slicing. For each fold, one species was held out for evaluation and models were trained on all remaining species.

Two compact TensorFlow/Keras next-base predictors were used: a forward model (F) that predicts left → right and a reverse model (R) trained on reverse-complemented windows so that, at test time, it predicts right → left from the right flank; R outputs were reverse-complemented back to forward orientation. Both models shared the architecture Conv1D(48, k = 7, ReLU) → Conv1D (48, k = 5, dilation = 2, ReLU) → BiLSTM (48, return_sequences) → Dropout (0.25) → Dense (4, softmax), optimized with categorical cross-entropy (label smoothing 0.05) and Adam (1 × 10⁻³; clipnorm 1.0).

Training employed on-the-fly random 400-nt windows (left context 399 nt) via tf.data, with denoising (5% random base flips confined to the left context) and context dropout (retaining a random trailing tail of 200–399 nt and zero-padding the remainder) to mitigate exposure bias. The “FAST” budget per fold was batch size 128, 6 epochs, 80 steps per epoch (validation 12 steps), early stopping (patience 3) and ReduceLROnPlateau (patience 2, factor 0.5); seeds were fixed for Python (v 3.13.5)/NumPy (v 2.3.2)/TensorFlow (v 2.20.0-rc0) and CPU multithreading was enabled.

In the held-out genome, three fixed, deterministic anchor positions defined masked regions, and 500 bp spans were imputed. The F reconstruction was generated by free-running from the left flank; the R reconstruction was generated by free-running in reverse-complement space from the right flank and reverse-complemented back; a per-base consensus (CONS) selected between F and R by higher predicted probability (ties to F).

Two baselines were evaluated: Nearest-Neighbor (NN), which selected the training species with maximum 3-mer cosine similarity computed over the full mtDNA and copied the 500 bp span from that genome at the matched circular start; and Flank-Copy (FlankCopy), which tiled the left or right flank, reporting the higher-identity result. The primary metric was equal-length percent identity (character-wise identity without gaps); for detailed inspection only (not used for headline comparisons or the 5-column summary), global Needleman–Wunsch percent identity was also computed with match +2, mismatch −1, gap-open −5, and gap-extend −1. For each species, results were summarized as the median across the three masked spans in a compact table.

3. Results

Section 3 reports results from evaluating a compact CNN–BiLSTM framework for context-conditioned reconstruction of fixed-length mtDNA fragments under a leave-one-species-out (LOSO) design. Section 3.1 summarizes next-nucleotide performance (accuracy, macro-F1) and equal-length identity to the corresponding 2000 bp region of each held-out genome. Section 3.2 provides biological validation via Smith–Waterman alignment to maximum-likelihood ancestral nodes and BLASTn species checks. Section 3.3 calibrates identities against DNA-mimicking nulls. Section 3.4 compares predictions to a 2-mer Markov baseline. Section 3.5 extends baseline analyses to consensus-of-training, profile/PSSM, and nearest-neighbor copy models. Section 3.6 assesses robustness to starting position and regional stratification (D-loop vs. coding). Section 3.7 evaluates sensitivity to alignment engine, tree inference, and ancestral-state reconstruction choices. Section 3.8 presents two-flank masked-span imputation as the primary leakage-free evaluation, providing the most stringent test of free-running conditional reconstruction. Taken together, the results indicate that, under leave-one-species-out conditioning, the CNN–BiLSTM framework is likely to recover mtDNA fragments that retain detectable evolutionary patterns within the intended scope of this proof of concept. Throughout the Results, each percent identity metric is reported and interpreted exclusively within the specific evaluation context for which it is defined, and numerical comparisons are made only within the same identity formulation.

3.1. Model Performance on Next-Nucleotide Prediction

Under teacher-forced conditioning, the CNN-BiLSTM model demonstrated consistently high predictive performance across all 21 species in LOSO evaluation. In this sliding-window next-nucleotide prediction task, accuracy remained above 99.4% for all species, with macro-averaged precision, recall, and F1 scores closely matching overall accuracy (Table 2). The percent identity (%ID) of reconstructed sequences to their ground-truth sequences over the first 2000 bp ranged from 99.55% to 99.75%, with correspondingly low mismatch counts (5–9 per 2000 bp segment). These results indicate that the model effectively captures short-range sequence dependencies and reproduces species-specific nucleotide patterns with minimal error.

3.2. Biological Validation of Predicted Sequences

To assess whether the deep learning–predicted mtDNA fragments exhibited non-random similarity to reference mitochondrial sequences, each 2000 bp fragment generated under the LOSO scheme was subjected to a two-step in silico validation pipeline. First, predicted sequences were locally aligned to their corresponding MEGA X–inferred ancestral nodes using the Smith–Waterman algorithm (Table 3). Local percent identity ranged from 63.73% (Caiman crocodilus) to 67.31% (Accipiter gentilis), substantially exceeding both the ~26% identity observed for a 2-mer Markov baseline and the ~25% random baseline (p < 0.001; Section 2.5.6). Mismatch counts (654–725) are reported relative to the best local alignment, not full-length genome comparison. These results indicate that the CNN-BiLSTM preserved detectable non-random evolutionary similarity in short mtDNA fragments. As an additional biological plausibility check, the predicted 2000 bp mitochondrial fragments were queried against the NCBI database using BLASTn. Identification was repeated under a fixed megablast configuration, with the search space restricted to vertebrate mitochondrial sequences to focus the evaluation on relevant evolutionary targets. Across a representative vertebrate panel (encompassing birds, crocodylians, additional reptile lineages, and an amphibian), all predicted fragments returned the expected species as the top hit (21/21 cases; Supplementary Table S1). Searches were performed against the core_nt database (updated 31 August 2025) using a word size of 28, an E-value cutoff of 0.05, match/mismatch scores of 1/−2, gap costs of 0/2.5, and low-complexity filtering (DUST; mask for lookup table only, filter string L;m;). Top hits typically exhibited 94–100% query coverage and 97.29–100.00% BLAST-reported pairwise identity, with E-values ranging from approximately 1 × 10⁻⁹⁸ to 2 × 10⁻⁹⁶ and maximum scores between 368 and 375, consistent with standard mitogenome alignments. Representative examples include Accipiter gentilis (99% coverage, 97.29% identity), Alligator mississippiensis (99% coverage, 100.00% identity), and Gekko gecko (99% coverage, 100.00% identity). Where informative, secondary hits corresponded to the closest non-target references. In specific cases, such as Gallus gallus, both primary and secondary hits represented intra-specific database entries. Lower apparent coverage values (e.g., 26%) occurred when the BLAST algorithm prioritized statistically superior alignments to partial mitochondrial regions—such as isolated D-loop entries—over alignments to complete mitogenomes. Complete results are provided in Supplementary Table S2. These BLASTn results are reported solely as sanity checks confirming the biological plausibility and database consistency of the predicted fragments under conditional prediction; they are not interpreted as evidence of phylogenetic inference or broader evolutionary generalization.

Table 3 reports ungapped percent identity (%ID) and mismatch counts for equal-length (2000 bp) sequence comparisons, as described in Section 2.5.3. In the initial run, gap-related metrics such as alignment length, number of gaps, and gap-open events were not calculated, as the analysis focused on direct nucleotide similarity without considering gaps. Consequently, these metrics were omitted from Table 3 by design. Table 4 reports identities between the predicted ~2 kb fragments and the LOSO ancestral sequences reconstructed after removing the test species from the alignment and tree. Using a gap-excluded Smith–Waterman identity (matches divided by base–base aligned pairs), local identities span 58.04% (Gekko gecko) to 88.95% (Dromaius novaehollandiae), with a mean of 63.82% across the 21 species. Although the identity definition here differs from the all-species snapshot in Table 3 (which uses ungapped, equal-length position-wise identity), the central tendency is similar (≈63.73–67.31%), indicating that the observed similarity is not dependent on inclusion of the test taxon in the ancestral reconstruction. The elevated Dromaius value reflects SW aligning a shorter, unusually low-divergence core (1186 base–base columns), which raises gap-excluded %ID relative to species with longer aligned cores. Minor per-species deviations are expected from (i) changes in tree context/branch lengths when one taxon is removed, (ii) SW’s local trimming (which can raise or lower %ID relative to equal-length comparisons), and (iii) small differences in homologous coverage near indel-rich regions. Overall, concordance between LOSO and all-species identities supports the interpretation that the predicted fragments retain non-random, phylogenetically consistent information under a fair, leakage-free ancestral reconstruction.

Sensitivity analyses indicated that toolchain choice did not alter the qualitative conclusions. Replacing MAFFT with MUSCLE in the LOSO pipeline and replacing NJ with ML on the reduced sets yielded identities within a narrow band relative to the primary configuration, with no change to the 21/21 BLAST top-hit assignments. Minor per-species fluctuations are consistent with expected differences in gap placement and branch-length estimation rather than systematic bias.

Note on definitions. Table 4 uses gap-excluded SW %ID and reports the corresponding SW base–base alignment length. Table 3 reports ungapped, equal-length identities. Because the denominators differ, absolute values are not directly comparable; SW local trimming can increase or decrease %ID relative to equal-length metrics. (Table 4 also lists MSA columns and MSA gap columns for the LOSO multiple alignment; these are not used to compute SW %ID).

3.3. Results for the DNA-Mimicking Null Simulations

Table 5 summarizes, for each LOSO fold, the observed Smith–Waterman identities of the predicted fragments to their LOSO ancestors (50.18–53.50%) alongside two DNA-mimicking backgrounds. For the k-mer language-model nulls (order-(k − 1), k ≈ 5–6), the null distributions are tightly concentrated, with means of approximately 50.4–50.6% (σ ≈ 0.5%). Consequently, raw excesses of +0.5 to +3.1 percentage points correspond to non-trivial tail probabilities; per-species empirical p-values ranged up to 0.17, with Benjamini–Hochberg–adjusted q-values ≤ 0.188. Although multiple-testing correction attenuates formal significance for some species, effect sizes are positive in nearly all cases, and the direction of deviation is consistent with enrichment beyond composition- and context-matched chance levels.

For the real-DNA background (length-matched windows sampled from non-homologous vertebrate mtDNA), inference is based on empirical percentiles and confidence intervals derived from the Monte Carlo null. Across species, empirical tail probabilities reach the minimum attainable value under the discrete null (p ≈ 1/(N + 1); ≈ 5 × 10⁻⁴ when N = 2000), resulting in strong saturation effects. Under these conditions, percentile-based reporting provides a more informative and interpretable summary of deviation from the null than multiple-testing correction. Accordingly, we report empirical percentiles and confidence intervals as the primary inferential measures for the real-DNA background.

In LOSO runs, SW alignment length reflects the number of alignment columns, i.e., L = M + X + GL = M + X + GL = M + X + G, where MMM are matches, XXX mismatches, and GGG gap columns. When the LOSO ancestral sequence contains insertions relative to the ~2 kb prediction, SW introduces gaps on the query side so those extra ancestral bases can be aligned, which increases GGG and therefore LLL beyond ~2000 bp even though the number of non-gap query letters aligned never exceeds ~2000. Because our DNA-mimicking nulls (k-mer LMs and real-DNA windows) are aligned with the same SW parameters and the same definition of LLL, any length inflation from gap handling is present in both observation and null. As a direct consequence, the SW percent identity

{% I D}_{S W}

= M/L is lower than an ungapped, position-wise identity (the denominator includes GGG), yet comparisons to the null remain fair: z-scores and percentiles measure excess identity beyond what SW would achieve by chance on composition- and context-matched DNA. Put simply, the difference between the ~2 kb query length and the larger SW alignment length arises from gap columns introduced by SW, not from extra predicted bases; and because our simulation nulls use the identical alignment and scoring, this effect is built into the null and does not inflate significance.

3.4. Baseline Comparison

To assess whether the CNN-BiLSTM model outperforms trivial sequence models, a 2-mer (first-order Markov) baseline was applied using a LOSO scheme. For each excluded species, a 2000 bp fragment was generated from the 2-mer model trained on the remaining species and compared to the closest MEGA X ancestral node. Baseline fragments showed only ~26–27% identity to the corresponding ancestral sequences, consistent with near-random performance. Results shown in Table 6.

By contrast, CNN–BiLSTM predictions achieved 63.7–67.3 percent identity to the same ancestral nodes, approximately doubling the performance of the 2-mer Markov baseline. These preliminary results indicate that the model captures structured sequence regularities shared across related mitochondrial genomes within the limited scope of this proof-of-concept study. However, because low-order Markov models are known to be weak comparators, this contrast alone should not be interpreted as strong evidence of generalization and instead motivated the additional, more stringent analyses reported below.

3.5. Expanded Baseline Comparisons

Supplementary Table S3 contrasts the model’s LOSO identities to a suite of stronger baselines computed on the training taxa only. As expected, profile-derived references that reuse training information provide high-performance comparative baselines: the consensus-of-train achieves 83.85–100% identity to the LOSO ancestor, PSSM (best sample) reaches 50.00–73.07%, and the nearest-neighbor (copy) baseline attains 53.83–85.91%. These baselines are treated as strong comparative methods; performance is calibrated against composition-controlled nulls.: for k-mer language-model mimics the null means are tightly centered at ~50.4–50.6%. Relative to these nulls, the observed identities show positive, consistent enrichment with median raw lift (obs − μ) of 1.16 percentage points (k = 5) and 1.14 pp (k = 6), mean z-scores of 2.35 (k = 5) and 2.33 (k = 6) across species, and a cross-species Stouffer’s Z = 10.76 (two-sided p ≈ 3.4 × 10⁻²⁷) for both k values. After Benjamini–Hochberg correction, 14/21 species are significant at q ≤ 0.05 for k = 5 (13/21 for k = 6). In aggregate, the evidence indicates that the model’s fragments are significantly above what realistic composition-matched chance would yield, while profile/consensus/nearest-neighbor baselines, by design, remain higher.

3.6. Robustness to Starting Position and Region Stratification Results

Table 7 reports the unshifted baseline per species. SW identity to the LOSO ancestor (obs_%ID_vs_ASR) spans 50.18–53.50% across the 21 taxa, and the corresponding SW alignment lengths (obs_aln_len) range 2049–2221 bp. Alignment length exceeds ~2 kb because SW counts gap columns. Table 7 also carries the region annotation for each 2000 bp window (region_label, frac_dloop, frac_coding) and the mapped local-block coordinates (hit_start, hit_end). Four windows are labeled D-loop with high control-region occupancy (Accipiter gentilis frac_dloop 1.000; Anas platyrhynchos 0.946; Gallus gallus 1.000; Xenopus laevis 1.000). The remaining windows are coding-dominated, with only minor control-region overlap in a few cases (for example, Crocodylus porosus frac_dloop 0.005; Caiman crocodilus 0.001). On this unshifted baseline, D-loop cases show 50.97–53.50% identity (from Gallus gallus to Accipiter gentilis), while coding cases show 50.18–53.33% (for example, Gekko gecko 50.18%, Chelonia mydas 53.33%). D-loop alignment lengths are 2105–2166 bp (from Xenopus laevis to Anas platyrhynchos), and coding lengths are 2049–2221 bp (for example, Dromaius novaehollandiae 2049 bp, Struthio camelus 2221 bp).

Table 8 provides the reference accessions and the genomic-context labels and fractions that underpin Table 7 (accession, region_label, frac_dloop, frac_coding) together with the start–end coordinates of the top-scoring local block. The D-loop windows align over extended control-region tracts, for example, Xenopus laevis 1–1.205 and Accipiter gentilis 1–1.109, consistent with their frac_dloop values

Table 9 summarizes the randomized-start experiment (10 circular offsets per species) and reports the unshifted identity as obs_%ID in a ×10 scale (for example, 535.002 corresponds to 53.500%). On that same scale, the mean across shifts (shift_mean_%ID) ranges 497.757–538.685 (that is, 49.7757–53.8685%), the standard deviation across shifts (shift_sd_%ID) is 0.2157–0.7944 (that is, 0.0216–0.0794 percentage points), and the within-species range from shift_min_%ID to shift_max_%ID corresponds to 0.0656–2.3410 percentage points overall (shift_range_%ID 0.6560–23.410 in table units). Examples of small variability include Sphenodon punctatus (shift_range_%ID 0.6560) and Melopsittacus undulatus (0.9034), whereas larger yet still modest variability is seen in Meleagris gallopavo (23.410), Dromaius novaehollandiae (21.716), and Anas platyrhynchos (20.585) in table units, i.e., ≤2.341 percentage points on the original scale.

Table 10 lists the full per-offset results for the same experiment, including the circular offset used (shift_offset) and the corresponding identity (shift_%ID, ×10 scale) for each of the 10 shifts per species. All values fall within the narrow bands summarized in Table 9. For both Table 9 and Table 10 the SW engine is parasail, and scoring and masking settings match those used elsewhere.

Collectively, Table 7, Table 8, Table 9 and Table 10 show that within-species variation in SW identity across randomized start positions is small and that coding versus D-loop windows exhibit closely similar identity and alignment-length distributions. These results indicate that the conclusions do not depend on the exact window boundary or on regional context under the constant SW configuration applied here. Parasail is not directly comparable to pairwise2 figures due to engine/scoring differences.

3.7. Sensitivity Analysis

For 21 species, identities of predicted fragments to LOSO (MAFFT + NJ + Fitch) ancestors were similar to those to MEGA (MUSCLE + ML Tamura–Nei) ancestors (Table 11). The per-species change, Δ%ID = LOSO − MEGA, had a median of +0.743 percentage points (range −2.610 to +2.264 pp; 16/21 positive). Smith–Waterman alignment lengths were comparable between arms (~3.0–3.1 kb), and alignment scores tracked identity. These observations indicate that estimates are robust to substituting MUSCLE/ML with MAFFT/NJ/Fitch for ancestor reconstruction when scoring is otherwise identical. Absolute identity values in this table are not directly comparable to earlier parasail-based reports because Biopython PairwiseAligner was used here; the interpretation is based on within-table comparisons.

3.8. LOSO Two-Flank Imputation

As the most stringent evaluation in this study, the two-flank masked-span imputation task was assessed across 21 LOSO folds (500 bp spans; three anchors per species). The consensus reconstruction exceeded the Nearest-Neighbor (NN) baseline in 17/21 species (81%), with an across-species median Δ(CONS–NN) of +3.2 percentage points (IQR ≈ 4.4 pp) and a two-sided sign-test p-value of approximately 0.007, indicating a statistically consistent improvement over NN under this training budget. In comparison with the Flank-Copy baseline, the consensus reconstruction was higher in 13/21 species (62%), with a median Δ(CONS–Flank) of +2.2 percentage points (IQR ≈ 5.2 pp); however, this difference did not reach statistical consistency (sign-test p ≈ 0.38), indicating broadly comparable performance between the two approaches rather than uniform dominance. The largest gains relative to NN were observed in Boa constrictor (+13.8 pp), Sphenodon punctatus (+9.6 pp), Accipiter gentilis (+7.2 pp), Anas platyrhynchos (+6.0 pp), and Crocodylus porosus and Nerodia sipedon (+5.4 pp). Conversely, a small subset of avian species underperformed NN (e.g., Aptenodytes forsteri −5.8 pp, Struthio camelus −5.4 pp, Dromaius novaehollandiae −4.4 pp), consistent with strong nearest-neighbor signal when closely related training genomes are available (Table 12).

Overall, two-flank masked-span imputation demonstrates a statistically consistent advantage over NN while achieving performance broadly comparable to Flank-Copy, suggesting that the model captures mtDNA regularities beyond simple compositional matching without implying systematic superiority over homology-tiling baselines. As expected, baselines that explicitly reuse homologous sequence information provide strong reference points for masked-span recovery; accordingly, we interpret the two-flank results as demonstrating a consistent advantage over nearest-neighbor heuristics and performance comparable to simple homology reuse, rather than a claim of dominance over alignment-driven methods.

4. Discussion

This study presents a purely in silico proof of concept exploring whether deep learning models can learn non-random evolutionary sequence regularities from mitochondrial DNA (mtDNA) sequences under controlled conditioning. By combining a CNN–BiLSTM next-nucleotide prediction framework with a leave-one-species-out (LOSO) evaluation scheme, the study demonstrates that the network can accurately predict short mtDNA fragments for species excluded from training in a context-dependent next-nucleotide framework, generating predictions that retain non-random similarity to biologically meaningful reference sequences. The model achieved greater than 99 percent nucleotide-level accuracy and near-perfect macro-averaged F1 scores across 21 vertebrate species, reflecting a strong capacity to learn local sequence dependencies. This behavior is plausibly supported by the bidirectional LSTM layers, which capture context in both directions, and by overlapping 200 bp windows that preserve local sequence continuity. High predictive accuracy in a next-nucleotide framework is expected for relatively conserved mitochondrial genomes and is therefore interpreted here as a diagnostic prerequisite rather than a standalone indicator of generalization. Each LOSO iteration required approximately 60–90 min of training on a standard workstation (HP ProBook 450 G9, Intel Core i5-1235U, 16 GB RAM), totaling roughly 50 h for the full 21-species cycle. This computational efficiency reflects the compact model architecture and modest dataset size, rendering the approach feasible for a small-scale proof of concept. At the same time, retraining the model for each LOSO fold is not scalable to substantially larger or more diverse datasets; future extensions could instead employ multi-species training, transfer learning, or k-fold cross-validation to improve scalability. Biological plausibility was assessed using a combination of complementary analyses, including MUSCLE alignments, MEGA X ancestral node reconstruction, Smith–Waterman local alignments, BLASTn searches, and shuffled-sequence null models. Predicted fragments exhibited 63.73–67.31 percent local identity to reconstructed ancestral nodes, substantially exceeding the approximately 25 percent identity expected under random base distributions (p much less than 0.001). These values also exceeded those obtained with a first-order Markov (2-mer) baseline, which achieved only 26–27 percent identity, indicating that the CNN–BiLSTM predictions capture sequence structure beyond simple k-mer statistics. Higher identities (approximately 67 percent) were observed for Accipiter gentilis and Xenopus laevis, whereas lower identities (approximately 63 percent) were observed in some avian taxa. These variations are reported descriptively and may reflect lineage-specific substitution dynamics, while remaining dependent on alignment quality and the assumptions inherent to maximum-likelihood ancestral reconstruction in MEGA X.

Additional ancestral sequence reconstructions performed with MAFFT yielded local identities ranging from 58.04 percent (Gekko gecko) to 88.95 percent (Dromaius novaehollandiae), with a mean of 63.82 percent across species. These values are comparable to those obtained using the original all-species ancestral reconstruction (63.73–67.31 percent), indicating that predicted-to-ancestor similarities remain stable even when the reference ASR excludes the held-out species. Replacing MUSCLE with MAFFT and maximum-likelihood trees with neighbor-joining plus Fitch parsimony altered predicted-to-ancestor identities by a median of +0.743 percentage points (range −2.610 to +2.264 percentage points, 16 of 21 higher), suggesting that the reported observations are not sensitive to reasonable alignment or tree/ASR choices.

Expanded baseline analyses further contextualized performance. Relative to composition-matched nulls (k-mer language models and real-DNA windows), observed percent identities typically fell at or above the 98th–99th percentile, with a small number around the 96.5th–97.5th percentile. Median lift values were 1.16 for k = 5 and 1.14 for k = 6, with mean z-scores of 2.35 (k = 5) and 2.33 (k = 6). A meta-analytic Stouffer’s Z of 10.76 (p approximately 3.4 × 10⁻²⁷) was observed across species, with 14 of 21 species significant at q ≤ 0.05 for k = 5 and 13 of 21 for k = 6. Profile/consensus and nearest-neighbor baselines, which explicitly exploit homology and alignment-derived information unavailable to the generative model, consistently yielded higher absolute identities overall (consensus 83.85–100 percent; PSSM best 50.00–73.07 percent; nearest neighbor 53.83–85.91 percent). These methods therefore serve as strong homology-informed reference points rather than direct performance targets, and results are interpreted relative to the null models that control for compositional and short-range sequence effects. Identities to LOSO ancestral sequences were stable across 10 random starts per species, with median within-species ranges of approximately 1–2 percentage points and median standard deviations below 0.5 percentage points. Results were also consistent across genomic regions, with windows overlapping coding regions versus the D-loop showing comparable identities and alignment lengths, and BLAST top-hit assignments remaining correct in 21 of 21 cases in both strata. Together, these controls indicate that the reported observations do not depend on window placement or genomic region. BLASTn searches further confirmed that in all 21 cases, the top hit corresponded to the correct species, serving as a biological plausibility check rather than evidence of phylogenetic inference. For example, a predicted Caiman crocodilus fragment aligned to “Caiman crocodilus, mitochondrion, complete genome” (NC_002744.2) with 99.51 percent identity and 100 percent query coverage. However, a targeted search against the C. crocodilus cytochrome b gene did not yield a significant match, indicating that while the model preserves broad mtDNA sequence structure, it does not necessarily predict specific functional loci without explicit guidance.

Two-flank masked-span imputation under LOSO provided a leakage-free evaluation of flank-conditioned reconstruction in unseen species. The observed gains over the nearest-neighbor baseline, together with performance comparable to the flank-copy baseline, constitute the core proof of concept that a compact CNN–BiLSTM architecture can support conditional mtDNA span completion beyond simple flank-tiling heuristics. This approach is intended as a complement, not a replacement, for traditional phylogenetic and paleogenomic tools. Maximum-likelihood and Bayesian ancestral reconstruction methods, such as PAML and BEAST, remain the gold standard due to their explicit evolutionary modeling and reliance on high-quality alignments. In contrast, the present deep learning framework operates directly on raw nucleotide sequences and offers a data-driven exploratory pathway for conditional sequence completion.

While multiple identity metrics are reported for different analytical purposes, conclusions in each Results subsection are drawn within the corresponding percent identity definition, and differences across sections reflect distinct analytical questions rather than directly comparable identity scales.

Although the predicted fragments are short, they could serve as alignment anchors or exploratory guide sequences in highly fragmented mtDNA datasets, potentially bridging AI-generated sequences with classical phylogenetic workflows. Overall, this work should be regarded as an exploratory computational proof of concept, providing preliminary evidence that deep learning can identify non-random sequence patterns in mtDNA that are consistent with evolutionary structure, without performing phylogenetic inference. The specific constraints of this study, including short fragment length, limited dataset size, and the absence of ancient DNA damage modeling, are discussed in Section 4.1 (Limitations). Potential directions for addressing these issues and extending the approach are outlined in Section 4.2 (Suggestions for Future Work), where larger and more taxonomically diverse datasets, simulated or empirical ancient DNA, longer predicted sequences, and functional annotation could help evolve this framework into a practical complement to established evolutionary analysis pipelines.

4.1. Limitations

Several limitations must be acknowledged. The dataset is small (21 species) and biased toward archosaurs, which increases the risk of clade-specific overfitting and limits generalizability. The predicted fragment length (2000 bp) is short compared to complete mitochondrial genomes (16–20 kb), constraining immediate value for de novo reconstruction. The study does not incorporate aDNA damage modeling, such as cytosine deamination or realistic fragmentation, which is critical for practical paleogenomic applications. From a computational standpoint, LOSO training scales poorly, for example, extending to 50 species could require ~96 h on comparable hardware. This underscores the need for more scalable strategies such as k-fold cross-validation or transfer learning. In addition, the use of overlapping 200 bp windows with a 50 bp stride may introduce redundancy between training and test inputs, potentially inflating reported accuracy (>99%). Biological validation is subject to further caveats. Ancestral references were reconstructed from multiple-sequence alignments (MUSCLE/MEGA X ML for the all-species snapshot; MAFFT/NJ/Fitch for LOSO), and absolute identity values depend on alignment and tree/ASR choices and are therefore meaningful only within the specific identity formulation used in each analysis. Misalignments, particularly among divergent outgroups such as Xenopus laevis, or model limitations could introduce error into the inferred nodes, potentially affecting the reported identity values. Finally, the predicted fragments were not assessed for functional content. For example, a targeted BLASTn search for the Caiman crocodilus cytochrome b gene did not yield a significant match, indicating that predicted fragments may not consistently capture coding regions or other functionally conserved loci. This limits biological interpretation, since functional conservation is a key aspect of mitochondrial evolution. In summary, this work should be regarded as a computational proof of concept. The findings suggest that deep learning can potentially recover non-random sequence patterns consistent with phylogenetic relationships, but they should not be interpreted as benchmarks for ancestral reconstruction. The model has not been tested on empirical aDNA, does not simulate damage patterns, and has not been applied to phylogenetic placement. Its relevance to paleogenomics therefore remains hypothetical until validated with simulated or empirical data.

4.2. Suggestions for Future Work

This PoC is intentionally scoped and demonstrates a novel, context-conditioned reconstruction framework for mitochondrial DNA under in-species flanks. As a preliminary study, it delivers clear feasibility signals with transparent reporting of assumptions and controls, and it can serve as a compact, reusable template for subsequent paleogenomic modeling efforts. Building on this foundation, several directions naturally extend the work while maintaining transparency about current scope. Broadening taxonomic coverage to include mammals, fish, and other vertebrates would allow assessment of generalizability across more divergent lineages. Incorporating realistic ancient-DNA damage patterns (e.g., fragmentation and cytosine deamination) would enable more direct evaluation in paleogenomic contexts. Generating longer predictions toward full mitochondrial reconstructions could proceed via iterative sequence reconstruction or architectures capable of long-range dependencies (e.g., transformers). Adding functional annotation of predicted fragments, covering coding genes, tRNAs, and conserved motifs, would provide biological context beyond sequence identity. Computational scalability could be enhanced with k-fold cross-validation and transfer learning, supporting application to larger and more diverse datasets; exploration of advanced deep-learning architectures may further capture subtle evolutionary sequence regularities. Design choices in this PoC are documented to aid interpretability. The model targets archosaur-specific mitochondrial patterns; accordingly, leave-one-clade-out analysis was not included in the present scope. To assess learning beyond this clade, future work includes leave-one-clade-out experiments (e.g., training on crocodilians to predict birds) and the addition of non-archosaur outgroups (e.g., squamates, mammals). The PoC evaluates context-conditioned reconstruction and does not claim de novo or cross-species generation; the two-flank masked-span test is included as the primary leakage-free evaluation, providing a stringent free-running assessment without excessive error accumulation. Because overlapping windows can overstate token-level accuracy, the >99% next-base figure is presented as an optimistic high-end estimate, with blocked or non-overlapping tests designated for future work. For sequence comparison, sensitive Smith–Waterman penalties tuned for short fragments were used consistently across baselines and nulls; robustness to more standard gap costs and global alignment is future work. Given small, discrete nulls, emphasis is placed on empirical percentiles and effect sizes rather than asymptotic p-values; expanded permutations, larger k-mer samples, and a pre-registered alpha are future work. Biological checks (BLAST hits and simple gene-context inspection) are positioned as qualitative validation checks appropriate to a PoC; comprehensive phylogenetic placement and functional analyses are future work. The N = 21, archosaur-leaning dataset reflects a focused feasibility setting; broader taxonomic coverage and cross-clade hold-outs to assess generalization are future work. Prediction length was fixed by design in this proof of concept rather than treated as a tunable variable. Teacher-forced fragment assembly was evaluated at 2000 bp to demonstrate stable, position-resolved next-base prediction under full conditioning, whereas leakage-free masked-span imputation was evaluated at 500 bp to provide a stringent free-running test without excessive error accumulation. Systematic sweeps over prediction length were not performed in this study. Based on known behavior of autoregressive sequence models, longer free-running generations are expected to accumulate errors, whereas teacher-forced assembly is substantially less sensitive to length. Accordingly, the two-flank masked-span results are interpreted as a conservative estimate of generalization performance. Quantifying the relationship between prediction length and reconstruction accuracy (e.g., 250 bp, 500 bp, 1000 bp, and longer spans) is designated as future work. Importantly, none of these extensions are required to support the present feasibility claims but are outlined to clarify the boundaries between this proof of concept and future methodological development. Future work may consider incorporating additional alignment-informed or homology-based imputation baselines in order to further refine the contextual interpretation of two-flank masked-span performance. Such baselines would enable a more granular comparison against methods that explicitly exploit positional homology, complementing the present analysis, which is focused on demonstrating statistically consistent gains over nearest-neighbor heuristics and performance comparable to simple homology reuse within a controlled proof-of-concept framework. An important direction for future work is the evaluation of fully closed-loop autoregressive generation over longer genomic spans. Unlike the present study, which focuses on context-conditioned reconstruction anchored by known flanking sequence, unconstrained free-running generation is subject to error accumulation and span-length-dependent degradation as predicted bases are recursively fed back as input. Assessing reconstruction fidelity as a function of generated span length and evolutionary divergence would provide a more complete characterization of generative limits but lies beyond the scope of this proof-of-concept, which is intentionally restricted to conditional sequence completion.

5. Conclusions

This study presents a focused computational proof of concept demonstrating that deep learning models can learn non-random sequence regularities in mitochondrial genomes under controlled, context-conditioned prediction. Without relying on predefined substitution models, the proposed framework shows that neural sequence models can perform conditional mtDNA fragment completion for species excluded from training, producing biologically plausible sequences that exceed simple homology-based and stochastic baselines. Under controlled and leakage-free evaluation using two-flank masked-span imputation, the consensus reconstruction outperformed a nearest-neighbor baseline in 17 out of 21 species (81%), providing quantitative support for transferable sequence regularities beyond trivial sequence reuse

Rather than replacing classical phylogenetic or evolutionary inference methods, this work illustrates how modern sequence modeling can serve as an exploratory and complementary tool for studying mitochondrial DNA under partial-information scenarios. The results indicate that deep learning can generate mtDNA fragments that retain non-random evolutionary similarity under leakage-free evaluation, supporting further investigation into conditional reconstruction tasks without claiming phylogenetic placement or ancestral inference. Beyond evolutionary genomics, the underlying approach may have longer-term biomedical relevance. Models capable of reconstructing short mtDNA fragments from partial context could eventually support low-coverage clinical mitochondrial sequencing workflows, where recovering uncertain or missing positions may improve variant calling. Similar principles may be applicable to forensic or archaeological mtDNA samples that are highly degraded or incomplete. In addition, identifying conserved sequence regularities may, in the longer term, contribute to studies of evolutionary constraints overlapping with mtDNA loci implicated in metabolic or neurodegenerative disorders.

Although these applications lie beyond the scope of the present proof of concept, they highlight how AI-assisted, context-conditioned reconstruction could ultimately complement established mitochondrial genomics pipelines in both research and applied settings.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai7010027/s1, Table S1. Species and NCBI Accession Numbers used in this Study. Table S2. BLASTN Results. Table S3. Stronger baselines computed on the training taxa only.

Author Contributions

Conceptualization, D.A.; methodology, D.A.; software, D.A.; validation, D.A., D.C., P.A.A., D.T.G., S.A.K., E.I.A., and I.K.K.; formal analysis, D.A.; investigation, D.A.; resources, D.A.; data curation, D.A.; writing—original draft preparation, D.A.; writing—review and editing, D.A. and P.A.A.; visualization, D.A.; supervision, D.A.; project administration, D.A.; funding acquisition, D.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All mtDNA sequences were retrieved from the NCBI Nucleotide database. No new sequence data were generated.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Allentoft, M.E.; Collins, M.; Harker, D.; Haile, J.; Oskam, C.L.; Hale, M.L.; Campos, P.F.; Samaniego, J.A.; Gilbert, M.T.P.; Willerslev, E.; et al. The half-life of DNA in bone: Measuring decay kinetics in 158 dated fossils. Proc. R. Soc. B Biol. Sci. 2012, 279, 4724–4733. [Google Scholar] [CrossRef] [PubMed]
Recalibrating EQUUS Evolution Using the Genome Sequence of an Early Middle Pleistocene Horse. Available online: https://www.researchgate.net/publication/242333094_Recalibrating_Equus_evolution_using_the_genome_sequence_of_an_early_Middle_Pleistocene_horse (accessed on 6 August 2025).
Meyer, M.; Fu, Q.; Aximu-Petri, A.; Glocke, I.; Nickel, B.; Arsuaga, J.-L.; Martínez, I.; Gracia, A.; de Castro, J.M.B.; Carbonell, E.; et al. A mitochondrial genome sequence of a hominin from Sima de los Huesos. Nature 2014, 505, 403–406. [Google Scholar] [CrossRef] [PubMed]
van der Valk, T.; Pečnerová, P.; Díez-Del-Molino, D.; Bergström, A.; Oppenheimer, J.; Hartmann, S.; Xenikoudakis, G.; Thomas, J.A.; Dehasque, M.; Sağlıcan, E.; et al. Million-year-old DNA sheds light on the genomic history of mammoths. Nature 2021, 591, 265–269. [Google Scholar] [CrossRef] [PubMed]
Yang, Z. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol. Biol. Evol. 2007, 24, 1586–1591. [Google Scholar] [CrossRef] [PubMed]
Bouckaert, R.; Vaughan, T.G.; Barido-Sottani, J.; Duchêne, S.; Fourment, M.; Gavryushkina, A.; Heled, J.; Jones, G.; Kühnert, D.; De Maio, N.; et al. BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 2019, 15, e1006650. [Google Scholar] [CrossRef] [PubMed]
Zeng, H.; Edwards, M.D.; Liu, G.; Gifford, D.K. Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics 2016, 32, i121–i127. [Google Scholar] [CrossRef] [PubMed]
Alipanahi, B.; Delong, A.; Weirauch, M.T.; Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015, 33, 831–838. [Google Scholar] [CrossRef] [PubMed]
Avsec, Ž.; Agarwal, V.; Visentin, D.; Ledsam, J.R.; Grabska-Barwinska, A.; Taylor, K.R.; Assael, Y.; Jumper, J.; Kohli, P.; Kelley, D.R. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 2021, 18, 1196–1203. [Google Scholar] [CrossRef] [PubMed]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
Singh, J.; Hanson, J.; Paliwal, K.; Zhou, Y. RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat. Commun. 2019, 10, 5407. [Google Scholar] [CrossRef] [PubMed]
Generating and Designing DNA with Deep Generative Models. ResearchGate. Available online: https://www.researchgate.net/publication/321902509_Generating_and_designing_DNA_with_deep_generative_models (accessed on 8 August 2025).
Kenneweg, P.; Dandinasivara, R.; Luo, X.; Hammer, B.; Schönhuth, A. Generating synthetic genotypes using diffusion models. Bioinformatics 2025, 41, i484–i492. [Google Scholar] [CrossRef] [PubMed]
De Leonardis, M.; Pagnani, A.; Barrat-Charlaix, P. Reconstruction of Ancestral Protein Sequences Using Autoregressive Generative Models. Mol. Biol. Evol. 2025, 42, msaf070. [Google Scholar] [CrossRef] [PubMed]
Matthews, D.S.; Spence, M.A.; Mater, A.C.; Nichols, J.; Pulsford, S.B.; Sandhu, M.; Kaczmarski, J.A.; Miton, C.M.; Tokuriki, N.; Jackson, C.J. Leveraging ancestral sequence reconstruction for protein representation learning. Nat. Mach. Intell. 2024, 6, 1542–1555. [Google Scholar] [CrossRef]
Quang, D.; Xie, X. DanQ: A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 2016, 44, e107. [Google Scholar] [CrossRef] [PubMed]
Nesterenko, L.; Blassel, L.; Veber, P.; Boussau, B.; Jacob, L. Phyloformer: Fast, Accurate, and Versatile Phylogenetic Reconstruction with Deep Neural Networks. Mol. Biol. Evol. 2025, 42, msaf051. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Sun, J.; Gao, Y.; Xue, Y.; Zhang, Y.; Li, K.; Zhang, W.; Zhang, C.; Zu, J.; Zhang, L. Fusang: A framework for phylogenetic tree inference via deep learning. Nucleic Acids Res. 2023, 51, 10909–10923. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Wu, C.; Du, B.; Zhang, L. Enhanced Multiscale Feature Fusion Network for HSI Classification. IEEE Trans. Geosci. Remote. Sens. 2021, 59, 10328–10347. [Google Scholar] [CrossRef]
Wang, D.; Hu, M.; Jin, Y.; Miao, Y.; Yang, J.; Xu, Y.; Qin, X.; Ma, J.; Sun, L.; Li, C.; et al. HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 6427–6444. [Google Scholar] [CrossRef] [PubMed]
Paradiso, C.; Gratton, P.; Trucchi, E.; López-Delgado, J.; Gargano, M.; Garizio, L.; Carr, I.M.; Colosimo, G.; Sevilla, C.; Welch, M.E.; et al. Genomic insights into the biogeography and evolution of Galápagos iguanas. Mol. Phylogenetics Evol. 2025, 204, 108294. [Google Scholar] [CrossRef] [PubMed]
Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 2015, 12, 931–934. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Analysis pipeline.

Figure 2. Maximum Likelihood phylogenetic tree of the 21 vertebrate mitochondrial genomes used in this study. The tree was reconstructed in MEGA X v12 using the Tamura–Nei substitution model from the MUSCLE-aligned complete mtDNA sequences. Numbers in parentheses correspond to species identifiers from Supplementary Table S1, followed by their GenBank accession numbers. The “NC” codes refer to the accession numbers listed in Supplementary Table S1. Letters “A” denote internal ancestral nodes inferred by MEGA X; “C” indicates consensus sequences at ambiguous nodes; “AC” reflects positions with multiple equally likely base calls; and “–” indicates missing data at specific ancestral positions. These reconstructed ancestral sequences served as reference nodes for evaluating the phylogenetic plausibility of CNN-BiLSTM predictions via local alignment. The scale bar (0.20) represents substitutions per site.

Table 1. Species.

Birds (Aves)	Crocodilians (Crocodylia)	Non-Archosaur Reptiles (Outgroups)
Gallus gallus (chicken)	Crocodylus porosus (saltwater crocodile)	Chelonia mydas (green sea turtle)
Struthio camelus (ostrich)	Alligator mississippiensis (American alligator)	Pelodiscus sinensis (Chinese softshell turtle)
Archilochus colubris (ruby-throated hummingbird)	Caiman crocodilus (spectacled caiman)	Varanus salvator (water monitor)
Accipiter gentilis (northern goshawk)		Gekko gecko (tokay gecko)
Dromaius novaehollandiae (emu)		Boa constrictor (boa constrictor)
Meleagris gallopavo (wild turkey)		Nerodia sipedon (northern water snake)
Anas platyrhynchos (mallard)		Sphenodon punctatus (tuatara)
Columba livia (rock dove)		Single amphibian outgroup
Aptenodytes forsteri (emperor penguin)		Xenopus laevis (African clawed frog)
Melopsittacus undulatus (budgerigar)

Table 2. Model performance on next nucleotide prediction under LOSO evaluation.

Species	%Accuracy	%Precision	%Recall	%F1	%ID	Mismatches
Accipiter gentilis	99.65	99.70	99.67	99.68	99.65	7
Alligator mississippiensis	99.60	99.65	99.60	99.62	99.60	8
Anas platyrhynchos	99.65	99.69	99.67	99.68	99.65	7
Aptenodytes forsteri	99.70	99.71	99.69	99.70	99.70	6
Archilochus colubris	99.60	99.66	99.58	99.62	99.60	8
Boa constrictor	99.65	99.66	99.54	99.60	99.65	7
Caiman crocodilus	99.60	99.60	99.59	99.59	99.60	8
Chelonia mydas	99.65	99.70	99.60	99.65	99.65	7
Columba livia	99.55	99.60	99.56	99.58	99.55	9
Crocodylus porosus	99.65	99.66	99.61	99.63	99.65	7
Dromaius novaehollandiae	99.65	99.71	99.64	99.67	99.65	7
Gallus gallus	99.60	99.62	99.61	99.62	99.60	8
Gekko gecko	99.60	99.66	99.57	99.61	99.60	8
Meleagris gallopavo	99.60	99.66	99.57	99.61	99.60	8
Melopsittacus undulatus	99.70	99.68	99.68	99.68	99.70	6
Nerodia sipedon	99.70	99.70	99.74	99.72	99.70	6
Pelodiscus sinensis	99.60	99.60	99.64	99.62	99.60	8
Sphenodon punctatus	99.75	99.79	99.78	99.79	99.75	5
Struthio camelus	99.65	99.67	99.61	99.64	99.65	7
Varanus salvator	99.65	99.64	99.63	99.63	99.65	7
Xenopus laevis	99.75	99.75	99.74	99.75	99.75	5

Note: Model performance on sliding-window next-nucleotide prediction under LOSO evaluation. Metrics include overall accuracy, macro-averaged precision, recall, and F1 score, as well as percent identity (%ID) computed over a total of 2000 bp reconstructed in a teacher-forced, position-resolved manner from overlapping windows. Each reconstructed position is compared to the corresponding ground-truth mtDNA position of the left-out species, rather than to a single contiguous reconstructed locus. The corresponding mismatch count is also reported.

Table 3. Biological Validation Metrics for Predicted 2000 bp Fragments.

Species	Predicted Length (bp)	% Identity	Mismatches
Accipiter gentilis	2000	67.31	654
Xenopus laevis	2000	66.98	660
Sphenodon punctatus	2000	65.90	682
Melopsittacus undulatus	2000	65.06	699
Chelonia mydas	2000	64.85	703
Varanus salvator	2000	64.76	705
Pelodiscus sinensis	2000	64.74	705
Boa constrictor	2000	64.69	706
Nerodia sipedon	2000	64.58	708
Meleagris gallopavo	2000	64.51	710
Anas platyrhynchos	2000	64.43	711
Struthio camelus	2000	64.40	712
Dromaius novaehollandiae	2000	64.19	716
Gallus gallus	2000	64.15	717
Crocodylus porosus	2000	64.06	719
Archilochus colubris	2000	63.98	720
Alligator mississippiensis	2000	63.92	722
Aptenodytes forsteri	2000	63.89	722
Gekko gecko	2000	63.87	723
Columba livia	2000	63.86	723
Caiman crocodilus	2000	63.73	725

Table 4. Ancestral Sequence Reconstruction MAFFT.

Species	pred_len	MSA Columns	Identity	SW Alignment (Base–Base Pairs)	Mismatches	MSA Gap Columns	SW gap_opens	Score
Accipiter gentilis	2000	27,298	0.613193	1895	733	25,508	102	917.00
Alligator mississippiensis	2000	29,719	0.586746	1856	767	28,007	107	720.00
Anas platyrhynchos	2000	29,701	0.593453	1894	770	27,913	113	762.00
Aptenodytes forsteri	2000	29,393	0.701721	1569	468	28,255	92	812.00
Archilochus colubris	2000	30,112	0.787385	1411	300	29,290	79	856.00
Boa constrictor	2000	29,375	0.610521	1882	733	27,611	113	792.00
Caiman crocodilus	2000	29,593	0.595745	1786	722	28,021	97	720.00
Chelonia mydas	2000	31,672	0.617834	1884	720	29,904	113	869.00
Columba livia	2000	29,890	0.590393	1936	793	28,018	105	805.00
Crocodylus porosus	2000	29,738	0.596234	1912	772	27,914	123	728.00
Dromaius novaehollandiae	2000	29,926	0.889545	1186	131	29,554	67	850.00
Gallus gallus	2000	29,915	0.587595	1838	758	28,239	109	688.00
Gekko gecko	2000	30,059	0.580376	1916	804	28,227	111	691.00
Meleagris gallopavo	2000	29,762	0.697360	1553	470	28,656	85	834.00
Melopsittacus undulatus	2000	29,350	0.605306	1847	729	27,656	100	847.00
Nerodia sipedon	2000	29,458	0.586207	1856	768	27,746	105	751.00
Pelodiscus sinensis	2000	30,059	0.651532	1762	614	28,535	107	856.00
Sphenodon punctatus	2000	29,168	0.597790	1810	728	27,548	93	796.00
Struthio camelus	2000	29,880	0.711608	1654	477	28,572	96	934.00
Varanus salvator	2000	29,612	0.593174	1846	751	27,920	98	756.00
Xenopus laevis	2000	29,867	0.608275	1861	729	28,145	103	876.00

Table 5. DNA-mimicking null simulations results.

Species	obs_%ID	obs_aln_len	klm_mu	klm_sd	klm_p	klm_q	klm_CI_low99.9	klm_CI_up99.9	klm_n	real_mu	real_sd	real_percentile	real_CI_low99.9	real_CI_up99.9
Accipiter gentilis	53.50	2157	50.47	0.52	<0.001	<0.001	0.000000	0.045572	200	68.04	7.98	0.965000	0.899219	0.988399
Alligator mississippiensis	51.20	2127	50.46	0.51	0.072000	0.088941	0.050585	0.101513	1000	65.83	6.91	1.000000	0.954428	1.000000
Anas platyrhynchos	51.89	2166	50.44	0.54	0.005000	0.011667	0.000755	0.032329	400	68.63	8.13	0.985000	0.928812	0.996983
Aptenodytes forsteri	51.51	2120	50.48	0.54	0.034000	0.051000	0.023527	0.048902	2000	69.53	8.10	0.990000	0.936851	0.998489
Archilochus colubris	51.30	2160	50.44	0.53	0.057500	0.075469	0.043418	0.075788	2000	67.84	7.80	0.990000	0.936851	0.998489
Boa constrictor	51.75	2201	50.49	0.52	0.007500	0.015750	0.001507	0.036461	400	66.57	6.93	0.985000	0.928812	0.996983
Caiman crocodilus	50.89	2089	50.39	0.47	0.170000	0.187895	0.103452	0.266625	200	66.20	7.11	1.000000	0.954428	1.000000
Chelonia mydas	53.33	2162	50.44	0.50	<0.001	<0.001	0.000000	0.045572	200	67.22	7.78	0.985000	0.928812	0.996983
Columba livia	51.67	2212	50.45	0.54	0.018333	0.032083	0.007470	0.044289	600	69.26	8.40	1.000000	0.954428	1.000000
Crocodylus porosus	51.59	2206	50.34	0.51	0.010000	0.019091	0.002415	0.040435	400	66.06	8.04	0.995000	0.945320	0.999564
Dromaius novaehollandiae	51.39	2049	50.43	0.54	0.038500	0.053900	0.027250	0.054136	2000	68.56	8.21	1.000000	0.954428	1.000000
Gallus gallus	50.97	2109	50.46	0.52	0.145000	0.169167	0.084292	0.238064	200	69.33	7.95	1.000000	0.954428	1.000000
Gekko gecko	50.18	2198	50.50	0.51	0.720000	0.720000	0.613601	0.806347	200	65.15	6.55	0.995000	0.945320	0.999564
Meleagris gallopavo	51.90	2079	50.49	0.51	0.002500	0.006562	0.000218	0.027982	400	68.37	7.93	0.995000	0.945320	0.999564
Melopsittacus undulatus	52.86	2115	50.37	0.47	<0.001	<0.001	0.000000	0.045572	200	67.38	8.52	0.980000	0.921089	0.995162
Nerodia sipedon	51.81	2077	50.59	0.47	<0.001	<0.001	0.000000	0.045572	200	67.67	6.85	0.995000	0.945320	0.999564
Pelodiscus sinensis	53.20	2175	50.61	0.55	<0.001	<0.001	0.000000	0.045572	200	67.29	6.83	0.980000	0.921089	0.995162
Sphenodon punctatus	51.75	2085	50.57	0.51	0.020000	0.032308	0.008451	0.046589	600	68.05	7.17	1.000000	0.954428	1.000000
Struthio camelus	52.99	2221	50.49	0.54	<0.001	<0.001	0.000000	0.045572	200	67.37	8.36	0.975000	0.913612	0.993095
Varanus salvator	50.84	2140	50.62	0.54	0.370000	0.388500	0.272687	0.479161	200	69.56	7.35	1.000000	0.954428	1.000000
Xenopus laevis	53.35	2105	50.40	0.54	<0.001	<0.001	0.000000	0.045572	200	69.39	8.94	0.980000	0.921089	0.995162

Note. Empirical tail probabilities are computed as

p = (r + 1) / (N + 1)

, where

r

is the number of null samples with identity greater than or equal to the observed value and

N

is the number of Monte Carlo null samples (up to 2000). Values shown as <0.001 indicate attainment of the minimum possible empirical p-value under this discrete null. For the k-mer language-model nulls, Benjamini–Hochberg correction is applied across species and both p- and q-values are reported. For the real-DNA background, inference is based on empirical percentiles and 99.9% confidence intervals, which indicate the position of the observed identity within the null distribution.

Table 6. First Baseline Comparison.

Species	CNN-BiLSTM %ID vs. Ground-Truth	Markov 2-mer %ID vs. Ground-Truth	CNN-BiLSTM %ID vs. Ancestral Node	Markov 2-mer %ID vs. Ancestral Node	Predicted Length (bp)	Ancestor Length (bp)
Accipiter gentilis	99.65	65.29	67.31	26.72	2000	12,484
Alligator mississippiensis	99.60	65.30	66.98	26.98	2000	13,631
Anas platyrhynchos	99.65	64.90	65.90	26.05	2000	13,652
Aptenodytes forsteri	99.70	64.92	65.06	27.26	2000	13,652
Archilochus colubris	99.60	64.87	64.85	27.09	2000	13,649
Boa constrictor	99.65	65.17	64.76	26.81	2000	13,590
Caiman crocodilus	99.60	64.93	64.74	27.53	2000	13,635
Chelonia mydas	99.65	65.38	64.69	26.78	2000	13,641
Columba livia	99.55	65.20	64.58	26.21	2000	13,651
Crocodylus porosus	99.65	65.30	64.51	27.03	2000	13,636
Dromaius novaehollandiae	99.65	65.29	64.43	26.19	2000	13,651
Gallus gallus	99.60	64.84	64.40	27.09	2000	13,648
Gekko gecko	99.60	66.27	64.19	27.77	2000	13,548
Meleagris gallopavo	99.60	65.41	64.15	26.98	2000	13,651
Melopsittacus undulatus	99.70	65.05	64.06	27.20	2000	13,652
Nerodia sipedon	99.70	65.18	63.98	26.99	2000	13,546
Pelodiscus sinensis	99.60	64.81	63.92	26.69	2000	13,638
Sphenodon punctatus	99.75	65.16	63.89	27.15	2000	11,162
Struthio camelus	99.65	64.87	63.87	26.77	2000	13,652
Varanus salvator	99.65	65.00	63.86	26.95	2000	13,594
Xenopus laevis	99.75	64.64	63.73	26.53	2000	13,603

Table 7. Starting-position robustness: unshifted baseline per species.

Species	pred_len	obs_%ID_vs_ASR	obs_aln_len	region_label	frac_dloop	frac_coding	hit_start	hit_end	%ID_vs_speciesRAW
Accipiter gentilis	2000	53.50	2157	DLOOP	1.000	0.000	1	1109	55.03
Alligator mississippiensis	2000	51.20	2127	CODING	0.000	0.999	1	1099	55.05
Anas platyrhynchos	2000	51.89	2166	DLOOP	0.946	0.054	1	1109	55.03
Aptenodytesforsteri	2000	51.51	2120	CODING	0.000	0.990	1	1099	55.15
Archilochus colubris	2000	51.30	2160	CODING	0.000	1.000	1	1099	55.05
Boaconstrictor	2000	51.75	2201	CODING	0.000	1.000	1	1109	55.10
Caimancrocodilus	2000	50.89	2089	CODING	0.001	0.999	1	1109	55.13
Cheloniamydas	2000	53.33	2162	CODING	0.000	1.000	1	1099	55.10
Columbalivia	2000	51.67	2212	CODING	0.000	1.000	1	1099	55.10
Crocodylusporosus	2000	51.59	2206	CODING	0.005	0.995	1	1108	55.15
Dromaiusnovaehollandiae	2000	51.39	2049	CODING	0.000	1.000	1	1100	55.18
Gallus gallus	2000	50.97	2109	DLOOP	1.000	0.000	1	1110	55.15
Gekko gecko	2000	50.18	2198	CODING	0.000	0.997	1	1107	55.03
Meleagris gallopavo	2000	51.90	2079	CODING	0.000	1.000	1	1105	55.20
Melopsittacus undulatus	2000	52.86	2115	CODING	0.000	1.000	1	1100	55.18
Nerodia sipedon	2000	51.81	2077	CODING	0.000	0.965	1	1109	54.98
Pelodiscus sinensis	2000	53.20	2175	CODING	0.000	1.000	1	1100	55.13
Sphenodon punctatus	2000	51.75	2085	CODING	0.000	1.000	1	1111	55.22
Struthio camelus	2000	52.99	2221	CODING	0.000	1.000	1	1100	55.03
Varanus salvator	2000	50.84	2140	CODING	0.000	1.000	1	1100	55.08
Xenopus laevis	2000	53.35	2105	DLOOP	1.000	0.000	1	1205	55.95

Table 8. Genomic context and local-block coordinates.

Species	Accession	hit_start	hit_end	region_label	frac_dloop	frac_coding
Accipiter gentilis	NC_011818.1	1	1109	DLOOP	1.000	0.000
Alligator mississippiensis	NC_001922.1	1	1099	CODING	0.000	0.999
Anas platyrhynchos	NC_009684.1	1	1109	DLOOP	0.946	0.054
Aptenodytes forsteri	NC_027938.1	1	1099	CODING	0.000	0.990
Archilochus colubris	NC_010094.1	1	1099	CODING	0.000	1.000
Boa constrictor	NC_007398.1	1	1109	CODING	0.000	1.000
Caiman crocodilus	NC_002744.2	1	1109	CODING	0.001	0.999
Chelonia mydas	NC_000886.1	1	1099	CODING	0.000	1.000
Columba livia	NC_013978.1	1	1099	CODING	0.000	1.000
Crocodylus porosus	NC_008143.1	1	1108	CODING	0.005	0.995
Dromaius novaehollandiae	NC_002784.1	1	1100	CODING	0.000	1.000
Gallus gallus	NC_001323.1	1	1110	DLOOP	1.000	0.000
Gekko gecko	NC_007627.1	1	1107	CODING	0.000	0.997
Meleagris gallopavo	NC_010195.2	1	1105	CODING	0.000	1.000
Melopsittacus undulatus	NC_009134.1	1	1100	CODING	0.000	1.000
Nerodia sipedon	NC_015793.1	1	1109	CODING	0.000	0.965
Pelodiscus sinensis	NC_068236.1	1	1100	CODING	0.000	1.000
Sphenodon punctatus	NC_004815.1	1	1111	CODING	0.000	1.000
Struthio camelus	NC_002785.1	1	1100	CODING	0.000	1.000
Varanus salvator	NC_010974.1	1	1100	CODING	0.000	1.000
Xenopus laevis	NC_001573.1	1	1205	DLOOP	1.000	0.000

Table 9. Starting-position robustness: across-shift summary.

Species	Pred len	obs_%ID	shift_mean_%ID	shift_sd %ID	shift_min %ID	shift_max %ID	shift_range %ID	n Shifts
Accipiter gentilis	2000	535.002	527.450	0.5612	520.055	536.563	16.507	10
Alligator mississippiensis	2000	511.989	505.175	0.2728	501.892	511.080	0.9188	10
Anas platyrhynchos	2000	518.929	516.238	0.7002	502.793	523.378	20.585	10
Aptenodytes forsteri	2000	515.094	517.512	0.5896	507.531	525.797	18.267	10
Archilochus colubris	2000	512.963	515.607	0.3139	511.510	519.667	0.8157	10
Boa constrictor	2000	517.492	518.680	0.5296	511.855	526.392	14.537	10
Caiman crocodilus	2000	508.856	502.821	0.4685	496.070	508.310	12.240	10
Cheloniamydas	2000	533.302	531.647	0.2796	526.437	534.411	0.7974	10
Columba livia	2000	516.727	517.629	0.4275	508.024	521.800	13.776	10
Crocodylus porosus	2000	515.866	523.659	0.7944	512.749	532.449	19.700	10
Dromaius novaehollandiae	2000	513.909	524.185	0.7652	513.896	535.611	21.716	10
Gallus gallus	2000	509.720	497.757	0.6165	488.899	506.548	17.649	10
Gekko gecko	2000	501.820	506.012	0.2900	501.590	511.683	10.093	10
Meleagris gallopavo	2000	519.000	529.260	0.6319	515.691	539.101	23.410	10
Melopsittacus undulatus	2000	528.605	529.779	0.3075	523.585	532.619	0.9034	10
Nerodia sipedon	2000	518.055	517.793	0.2972	513.858	521.574	0.7716	10
Pelodiscus sinensis	2000	531.954	529.123	0.4390	521.159	534.138	12.979	10
Sphenodon punctatus	2000	517.506	519.960	0.2157	517.092	523.652	0.6560	10
Struthio camelus	2000	529.941	526.857	0.3905	521.024	532.680	11.656	10
Varanus salvator	2000	508.411	514.951	0.3618	507.914	519.327	11.413	10
Xenopus laevis	2000	533.492	538.685	0.5428	530.784	546.494	15.711	10

Table 10. Starting-position robustness: per-offset identities.

Species	shift_index	shift_offset	shift_%ID Scaled ×10
Accipiter gentilis	0	1265	526.603
Accipiter gentilis	1	1895	536.563
Accipiter gentilis	2	1092	527.548
Accipiter gentilis	3	1453	520.092
Accipiter gentilis	4	749	528.311
Accipiter gentilis	5	1169	523.701
Accipiter gentilis	6	1199	535.974
Accipiter gentilis	7	1498	520.055
Accipiter gentilis	8	339	529.078
Accipiter gentilis	9	1587	526.577
Alligator mississippiensis	0	1688	511.080
Alligator mississippiensis	1	681	501.892
Alligator mississippiensis	2	785	503.817
Alligator mississippiensis	3	1300	502.557
Alligator mississippiensis	4	1617	507.069
Alligator mississippiensis	5	736	504.556
Alligator mississippiensis	6	1887	505.254
Alligator mississippiensis	7	630	503.700
Alligator mississippiensis	8	801	504.274
Alligator mississippiensis	9	1427	507.554
Anas platyrhynchos	0	419	514.023
Anas platyrhynchos	1	1559	519.095
Anas platyrhynchos	2	1344	502.793
Anas platyrhynchos	3	1769	523.378
Anas platyrhynchos	4	1868	521.938
Anas platyrhynchos	5	747	517.593
Anas platyrhynchos	6	223	519.159
Anas platyrhynchos	7	871	517.734
Anas platyrhynchos	8	1680	521.533
Anas platyrhynchos	9	1300	505.131
Aptenodytes forsteri	0	816	519.511
Aptenodytes forsteri	1	1843	511.993
Aptenodytes forsteri	2	133	518.325
Aptenodytes forsteri	3	1035	525.797
Aptenodytes forsteri	4	1431	509.451
Aptenodytes forsteri	5	1368	521.292
Aptenodytes forsteri	6	712	518.571
Aptenodytes forsteri	7	1819	507.531
Aptenodytes forsteri	8	826	521.327
Aptenodytes forsteri	9	826	521.327
Archilochus colubris	0	1641	519.667
Archilochus colubris	1	1510	511.510
Archilochus colubris	2	1638	519.667
Archilochus colubris	3	636	512.459
Archilochus colubris	4	1359	517.383
Archilochus colubris	5	36	514.479
Archilochus colubris	6	1704	516.129
Archilochus colubris	7	723	514.880
Archilochus colubris	8	1217	518.332
Archilochus colubris	9	242	511.561
Boa constrictor	0	943	524.793
Boa constrictor	1	504	521.615
Boa constrictor	2	1961	519.067
Boa constrictor	3	350	526.392
Boa constrictor	4	1629	513.702
Boa constrictor	5	1744	520.626
Boa constrictor	6	642	522.233
Boa constrictor	7	1391	511.855
Boa constrictor	8	1414	512.905
Boa constrictor	9	1887	513.612
Caiman crocodilus	0	1874	501.415
Caiman crocodilus	1	1238	496.070
Caiman crocodilus	2	936	508.310
Caiman crocodilus	3	1235	496.070
Caiman crocodilus	4	542	507.435
Caiman crocodilus	5	1646	506.824
Caiman crocodilus	6	411	504.753
Caiman crocodilus	7	1121	497.925
Caiman crocodilus	8	137	503.646
Caiman crocodilus	9	1307	505.758
Chelonia mydas	0	1076	534.202
Chelonia mydas	1	1985	534.411
Chelonia mydas	2	1267	529.813
Chelonia mydas	3	1986	534.411
Chelonia mydas	4	1182	526.437
Chelonia mydas	5	1584	529.086
Chelonia mydas	6	89	530.790
Chelonia mydas	7	147	532.689
Chelonia mydas	8	834	530.220
Chelonia mydas	9	1971	534.411
Columba livia	0	1337	519.157
Columba livia	1	1847	518.312
Columba livia	2	1269	518.934
Columba livia	3	110	515.525
Columba livia	4	1383	508.024
Columba livia	5	35	521.252
Columba livia	6	1300	517.774
Columba livia	7	403	521.800
Columba livia	8	403	521.800
Columba livia	9	214	513.715
Crocodylus porosus	0	1918	518.194
Crocodylus porosus	1	1250	530.835
Crocodylus porosus	2	683	525.234
Crocodylus porosus	3	885	525.224
Crocodylus porosus	4	1171	532.398
Crocodylus porosus	5	1618	513.953
Crocodylus porosus	6	468	530.303
Crocodylus porosus	7	1125	532.449
Crocodylus porosus	8	93	515.250
Crocodylus porosus	9	1589	512.749
Dromaius novaehollandiae	0	1137	521.458
Dromaius novaehollandiae	1	1487	520.935
Dromaius novaehollandiae	2	1322	523.902
Dromaius novaehollandiae	3	629	535.611
Dromaius novaehollandiae	4	1737	518.221
Dromaius novaehollandiae	5	1825	516.325
Dromaius novaehollandiae	6	437	534.253
Dromaius novaehollandiae	7	1252	524.479
Dromaius novaehollandiae	8	1995	513.896
Dromaius novaehollandiae	9	370	532.768
Gallus gallus	0	1083	497.530
Gallus gallus	1	1078	497.530
Gallus gallus	2	1082	497.530
Gallus gallus	3	378	502.770
Gallus gallus	4	1587	502.719
Gallus gallus	5	1357	488.919
Gallus gallus	6	103	506.548
Gallus gallus	7	1528	492.169
Gallus gallus	8	1013	502.956
Gallus gallus	9	1441	488.899
Gekko gecko	0	1508	511.683
Gekko gecko	1	1195	503.461
Gekko gecko	2	932	505.564
Gekko gecko	3	1993	501.590
Gekko gecko	4	1840	506.132
Gekko gecko	5	50	506.996
Gekko gecko	6	1648	508.204
Gekko gecko	7	1069	503.076
Gekko gecko	8	998	507.685
Gekko gecko	9	1110	505.731
Meleagris gallopavo	0	1540	527.725
Meleagris gallopavo	1	1339	536.316
Meleagris gallopavo	2	1492	530.769
Meleagris gallopavo	3	1910	526.676
Meleagris gallopavo	4	722	515.691
Meleagris gallopavo	5	1504	531.114
Meleagris gallopavo	6	1143	525.912
Meleagris gallopavo	7	1635	530.732
Meleagris gallopavo	8	1165	528.565
Meleagris gallopavo	9	1320	539.101
Melopsittacus undulatus	0	526	532.619
Melopsittacus undulatus	1	536	531.875
Melopsittacus undulatus	2	136	531.966
Melopsittacus undulatus	3	1451	527.700
Melopsittacus undulatus	4	267	529.661
Melopsittacus undulatus	5	1941	525.968
Melopsittacus undulatus	6	120	531.966
Melopsittacus undulatus	7	1850	523.585
Melopsittacus undulatus	8	348	530.317
Melopsittacus undulatus	9	647	532.131
Nerodia sipedon	0	1679	516.651
Nerodia sipedon	1	160	520.665
Nerodia sipedon	2	392	515.539
Nerodia sipedon	3	900	516.729
Nerodia sipedon	4	1951	513.988
Nerodia sipedon	5	1106	513.858
Nerodia sipedon	6	1707	517.241
Nerodia sipedon	7	518	520.299
Nerodia sipedon	8	1797	521.384
Nerodia sipedon	9	109	521.574
Pelodiscus sinensis	0	1135	531.063
Pelodiscus sinensis	1	307	521.159
Pelodiscus sinensis	2	874	523.062
Pelodiscus sinensis	3	715	525.599
Pelodiscus sinensis	4	1241	532.934
Pelodiscus sinensis	5	1468	530.658
Pelodiscus sinensis	6	1804	534.138
Pelodiscus sinensis	7	1856	530.575
Pelodiscus sinensis	8	1906	532.715
Pelodiscus sinensis	9	87	529.328
Sphenodon punctatus	0	880	520.561
Sphenodon punctatus	1	588	517.466
Sphenodon punctatus	2	1517	519.542
Sphenodon punctatus	3	1989	517.092
Sphenodon punctatus	4	318	521.236
Sphenodon punctatus	5	114	518.776
Sphenodon punctatus	6	1298	518.818
Sphenodon punctatus	7	944	519.535
Sphenodon punctatus	8	711	523.652
Sphenodon punctatus	9	779	522.919
Struthio camelus	0	140	528.578
Struthio camelus	1	426	532.680
Struthio camelus	2	721	524.816
Struthio camelus	3	1345	522.769
Struthio camelus	4	1386	521.024
Struthio camelus	5	1318	526.145
Struthio camelus	6	1230	524.966
Struthio camelus	7	539	529.412
Struthio camelus	8	1469	525.564
Struthio camelus	9	437	532.619
Varanus salvator	0	392	517.098
Varanus salvator	1	278	514.967
Varanus salvator	2	991	515.306
Varanus salvator	3	802	518.451
Varanus salvator	4	1255	507.914
Varanus salvator	5	879	513.182
Varanus salvator	6	1132	511.153
Varanus salvator	7	517	519.327
Varanus salvator	8	327	513.590
Varanus salvator	9	453	518.519
Xenopus laevis	0	1456	545.098
Xenopus laevis	1	1725	539.915
Xenopus laevis	2	683	540.246
Xenopus laevis	3	804	546.494
Xenopus laevis	4	1278	530.784
Xenopus laevis	5	398	543.160
Xenopus laevis	6	1642	537.827
Xenopus laevis	7	1562	537.234
Xenopus laevis	8	1961	535.305
Xenopus laevis	9	1279	530.784

Note Percent identity columns are reported at ×10 scale (e.g., 535.002 = 53.500%).

Table 11. Sensitivity to alignment and tree/ASR choices results.

Species_Key	mega_pid	mega_len	mega_score	loso_pid	loso_len	loso_score	delta_pid_pp
Accipiter gentilis	43.46	3104	1315.5	42.238	3092	1295.5	−1.222
Alligator mississippiensis	41.106	3092	1248.5	41.145	3128	1273.0	0.038
Anas platyrhynchos	42.741	3065	1278.5	43.055	3038	1279.5	0.314
Aptenodytes forsteri	42.16	3074	1265.0	43.828	2973	1285.0	1.668
Archilochus colubris	42.166	3019	1265.5	42.848	3006	1272.5	0.681
Boa constrictor	42.034	3107	1292.0	42.778	3053	1288.0	0.743
Caiman crocodilus	42.775	2955	1248.5	42.453	3001	1259.0	−0.322
Chelonia mydas	42.631	3094	1299.0	43.902	2968	1287.0	1.271
Columba livia	41.608	3134	1271.5	43.307	2988	1278.0	1.698
Crocodylus porosus	41.754	3056	1260.0	42.742	2983	1271.0	0.988
Dromaius novaehollandiae	43.34	2958	1273.5	44.166	2957	1297.0	0.826
Gallus gallus	42.145	3030	1252.5	41.934	3081	1268.0	−0.211
Gekko gecko	43.148	2948	1254.0	41.643	3129	1272.0	−1.505
Meleagris gallopavo	41.863	3060	1260.5	42.28	3070	1277.5	0.417
Melopsittacus undulatus	41.68	3131	1286.0	43.394	3035	1294.0	1.714
Nerodia sipedon	41.672	3086	1265.5	42.998	3042	1287.5	1.326
Pelodiscus sinensis	42.396	3038	1274.5	43.595	3060	1301.0	1.198
Sphenodon punctatus	44.534	2964	1282.5	41.924	3108	1279.0	−2.61
Struthio camelus	42.031	3043	1265.0	44.295	2971	1296.5	2.264
Varanus salvator	42.768	3042	1279.0	42.797	3103	1296.0	0.029
Xenopus laevis	43.662	3037	1310.0	44.477	3051	1352.5	0.816

Table 12. Summary of per-species median %ID (Equal-length) for consensus two-flank imputation and baselines.

Species	C_EqualLen_%ID_median	NN_EqualLen_%ID_median	FlankCopy_EqualLen_%ID_median	MaskReps
Accipiter gentilis	34.6	27.4	29.4	3
Alligator mississippiensis	28.2	26.2	26.2	3
Anas platyrhynchos	33.6	27.6	31.0	3
Aptenodytes forsteri	25.8	31.6	30.8	3
Archilochus colubris	26.6	25.2	27.0	3
Boa constrictor	38.2	24.4	33.6	3
Caiman crocodilus	29.0	23.8	27.8	3
Chelonia mydas	28.6	23.8	26.4	3
Columba livia	27.0	25.6	30.2	3
Crocodylus porosus	33.6	28.2	30.6	3
Dromaius novaehollandiae	26.6	31.0	27.2	3
Gallus gallus	31.8	28.6	34.2	3
Gekko gecko	25.6	24.6	30.2	3
Meleagris gallopavo	31.0	30.2	27.0	3
Melopsittacus undulatus	29.6	27.8	30.6	3
Nerodia sipedon	39.8	34.4	34.2	3
Pelodiscus sinensis	29.6	30.2	24.8	3
Sphenodon punctatus	32.2	22.6	28.0	3
Struthio camelus	25.0	30.4	22.8	3
Varanus salvator	27.2	23.8	30.4	3
Xenopus laevis	31.4	28.0	26.0	3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Angelakis, D.; Cavouras, D.; Glotsos, D.T.; Kostopoulos, S.A.; Athanasiadis, E.I.; Kalatzis, I.K.; Asvestas, P.A. In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs. AI 2026, 7, 27. https://doi.org/10.3390/ai7010027

AMA Style

Angelakis D, Cavouras D, Glotsos DT, Kostopoulos SA, Athanasiadis EI, Kalatzis IK, Asvestas PA. In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs. AI. 2026; 7(1):27. https://doi.org/10.3390/ai7010027

Chicago/Turabian Style

Angelakis, Dimitris, Dionisis Cavouras, Dimitris Th. Glotsos, Spiros A. Kostopoulos, Emmanouil I. Athanasiadis, Ioannis K. Kalatzis, and Pantelis A. Asvestas. 2026. "In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs" AI 7, no. 1: 27. https://doi.org/10.3390/ai7010027

APA Style

Angelakis, D., Cavouras, D., Glotsos, D. T., Kostopoulos, S. A., Athanasiadis, E. I., Kalatzis, I. K., & Asvestas, P. A. (2026). In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs. AI, 7(1), 27. https://doi.org/10.3390/ai7010027

Article Menu

In Silico Proof of Concept: Conditional Deep Learning-Based Prediction of Short Mitochondrial DNA Fragments in Archosaurs

Abstract

1. Introduction

Related Work

2. Materials and Methods

2.1. Data Collection

2.2. Data Preprocessing

2.3. Feature Extraction

2.4. Model Development

2.5. Biological Validation

2.5.1. Multiple Sequence Alignment (MSA)

2.5.2. Ancestral Sequence Reconstruction

2.5.3. Leave-One-Species-Out Ancestral Sequence Reconstruction MAFFT

2.5.4. Local Alignment (Smith-Waterman)

2.5.5. Sensitivity to Alignment and Tree/ASR Choices

2.5.6. DNA-Mimicking Null Simulations

2.5.7. Statistical Evaluation of Sequence Identity

2.5.8. Definition of Sequence Identity Metrics

2.6. Robustness to Starting Position and Region Stratification

2.7. Two-Flank Masked-Span Imputation

3. Results

3.1. Model Performance on Next-Nucleotide Prediction

3.2. Biological Validation of Predicted Sequences

3.3. Results for the DNA-Mimicking Null Simulations

3.4. Baseline Comparison

3.5. Expanded Baseline Comparisons

3.6. Robustness to Starting Position and Region Stratification Results

3.7. Sensitivity Analysis

3.8. LOSO Two-Flank Imputation

4. Discussion

4.1. Limitations

4.2. Suggestions for Future Work

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI