Lightweight and Accurate Table Recognition via Improved SLANet with Multi-Phase Training Strategy

Mao, Liu; Xiao, Yujie; Du, Kaihang; Shen, Jie; Xie, Xia

doi:10.3390/math14010025

Open AccessArticle

Lightweight and Accurate Table Recognition via Improved SLANet with Multi-Phase Training Strategy

by

Liu Mao

^1,2,

Yujie Xiao

^1,2,

Kaihang Du

^1,2,

Jie Shen

^1,2

and

Xia Xie

^1,2,*

¹

School of Computer Science and Technology, Hainan University, Haikou 570228, China

²

Haikou Key Laboratory of Deep Learning and Big Data Application Technology, Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(1), 25; https://doi.org/10.3390/math14010025 (registering DOI)

Submission received: 29 September 2025 / Revised: 3 December 2025 / Accepted: 17 December 2025 / Published: 21 December 2025

Download

Browse Figures

Versions Notes

Abstract

Tables, as an efficient form of structured data representation, are widely applied across domains. However, traditional manual processing methods are inadequate in the big data era, and existing table recognition models, such as SLANet, still face performance limitations. To address these issues, this paper proposes an improved SLANet framework. First, the original H-Swish activation is replaced with the Mish function to enhance feature representation. Second, an end-of-sequence (EOS) termination mechanism is introduced to reduce computational redundancy during inference. Third, a three-phase training strategy is designed to achieve progressive performance improvements. Experimental evaluation on the PubTabNet benchmark demonstrates that the improved SLANet achieves 77.25% accuracy with an average inference time of 774 ms, outperforming the baseline and most mainstream algorithms while retaining lightweight efficiency. The proposed algorithm achieves a TEDS score of 96.67%, significantly surpassing SLANet-based and other state-of-the-art methods. The code will be released upon acceptance.

Keywords:

table recognition; SLANet model; Mish function; EOS termination mechanism; multi-phase training strategy

MSC:

68T10; 68T45

1. Introduction

Tables serve as a fundamental medium for representing structured information across diverse domains. In the era of large-scale digitization, automated extraction of tabular structures has become increasingly important for information retrieval, data mining, and intelligent document processing. Unlike plain text or figures, tables contain hierarchical and spatial relationships that must be reconstructed accurately to preserve semantics [1]. Tables are particularly valued for their high information density, structured organization, and intuitive readability. They play a vital role in domains such as market analysis, scientific research, education management, and financial auditing [2]. Consequently, intelligent table recognition has become a critical branch within the broader field of document intelligence, attracting extensive research attention and demonstrating substantial application potential [3].

In everyday scenarios, individuals often rely on manual methods to process tables, while feasible for small-scale tasks. Traditional manual processing is inefficient, error-prone, and unsuitable for large-scale document workflows. Early heuristic-based algorithms relied on handcrafted rules and geometric features, but these methods often failed under diverse layouts, borderless tables, or spanning cells.

With the rapid advancement of deep learning, table structure recognition has shifted toward data-driven encoder-decoder architectures and image-to-markup generation models. These models learn to convert table images directly into structured representations such as HTML or LaTeX, enabling end-to-end recognition with high expressiveness [4]. In the specific domain of document analysis, the recently proposed TABLET [5] utilizes encoder-only Transformers within a split-and-merge framework, achieving state-of-the-art performance on complex, densely populated tables. However, despite its superior accuracy, the substantial model size and computational overhead of such heavy Transformer-based architectures pose significant challenges for deployment on resource-constrained edge devices (e.g., mobile terminals or embedded systems). This limitation underscores the critical need for lightweight solutions that can balance high performance with inference efficiency, which is the primary focus of this study.

SLANet [6], proposed within the PP-StructureV2 framework, represents a lightweight and efficient baseline for image-to-sequence table structure recognition. Although SLANet achieves promising performance, it still suffers from three main limitations: (1) its backbone relies on the H-Swish activation function, which limits gradient smoothness and generalization; (2) its fixed-length decoding leads to substantial computational redundancy; (3) its single-stage training scheme does not fully exploit the model’s learning potential.

While recent Transformer-based models, like TABLET, achieve high accuracy, they are often too heavy for deployment on resource-constrained edge devices (e.g., mobile terminals). conversely, existing lightweight models, like the original SLANet, suffer from performance bottlenecks. Our work aims to bridge this gap by enhancing the structural reasoning capability of the lightweight SLANet without compromising its inference efficiency. In light of these issues, we propose an improved SLANet framework that integrates advanced activation functions, introduces an efficient end-of-sequence (EOS) stopping criterion during inference, and adopts a multi-phase training strategy to optimize model performance.

The main contributions of this paper are summarized below:

Activation Function Optimization. We replace H-Swish with the smoother and fully differentiable Mish activation function, improving feature representation and gradient flow in the PP-LCNet backbone.
Inference Acceleration via EOS Termination. We introduce an end-of-sequence (EOS) token that enables early stopping during decoding, eliminating redundant computation caused by SLANet’s fixed maximum sequence length.
Progressive Multi-Phase Training Strategy. We design a progressive three-phase training scheme, coarse learning, fine-tuning, and high-resolution refinement, to stabilize convergence and enhance structural sequence modeling. This staged curriculum provides a more effective optimization path than a single fixed learning rate.

Overall, this research advances the state-of-the-art in table structure recognition by optimizing a strong baseline model and proposing a robust multi-module pipeline. The improvements achieved in both accuracy and efficiency highlight the novelty and practical significance of the proposed approach, contributing to the broader progress of document intelligence technologies.

2. Related Work

The development of table structure recognition can generally be divided into three stages: early manual processing, which was labor-intensive and error-prone; a subsequent heuristic-based stage, in which handcrafted rules and traditional image processing techniques were used; and, more recently, the deep learning era, where data-driven methods have become dominant. In what follows, we review the representative works of heuristic-based and deep learning-based methods.

2.1. Heuristic-Based Table Structure Recognition

Early intelligent table recognition relied primarily on heuristic rules, where conventional image processing techniques were designed according to handcrafted features. Typically, these approaches applied digital image processing methods such as morphological operations and binarization, followed by post-processing steps including edge detection and texture extraction.

A number of pioneering studies shaped this research direction. Itonori [7] exploited the two-dimensional layout of tables, applying connected component analysis to extract text blocks, and used block alignment operations to determine the coordinates and row-column positions of cells, laying an important foundation for later studies. Rahgozar et al. [8] proposed a clustering-based approach that incorporated cell spacing into row and column clustering, ultimately reconstructing the table structure via intersection analysis. Hirayama [9] designed a geometric analysis method for line-based tables, leveraging the parallel and orthogonal properties of ruling lines to infer row-column layouts, while introducing dynamic programming to match logical relationships among content blocks. Zuyev [10] further advanced visual-feature-based recognition by analyzing row lines, column lines, and whitespace regions to achieve cell segmentation. The T-Recs system by Kieninger [11] applied heuristics to word bounding boxes, combining clustering with column decomposition, and its successor T-Recs++ [12] demonstrated improved recognition performance. From a structural representation perspective, Wang et al. [13] innovatively modeled tables with tree structures and proposed a parameter optimization algorithm, whereas Ishitani et al. [14] represented tables with DOM trees and employed feature extraction and cell classification to handle irregular tables, thereby improving recognition in complex scenarios.

Despite their contributions, heuristic-based methods suffer from intrinsic limitations. Since the rules are specifically designed for certain table features, such algorithms lack robustness and generalization capacity, making them unsuitable for diverse and complex table layouts. When applied to sophisticated cases such as multi-level nested tables, merged cells, or line-free layouts, rule-based systems often require intricate rule sets, which substantially reduce recognition accuracy. As a result, purely heuristic methods have difficulty achieving stable and reliable performance in real-world applications.

2.2. Deep Learning-Based Table Structure Recognition

With the rapid development of artificial intelligence, deep learning techniques have demonstrated remarkable success across various computer vision domains. For instance, in the field of precision agriculture, Wang et al. [15] provided a comprehensive review of high-throughput phenotyping trends, highlighting the efficacy of deep learning in processing complex unstructured visual data. Furthermore, innovative architectures such as RA-CottNet [16] have shown that integrating advanced attention mechanisms and dynamic convolutions can significantly enhance recognition performance in complex environments. Similarly, recent advances in remote sensing have enabled high-quality interactive segmentation for moving objects [17], while novel subspace and gaze estimation strategies have been proposed to improve attribute localization in zero-shot learning tasks [18,19]. Inspired by these advancements in broad visual perception tasks, researchers in document analysis have also increasingly turned to data-driven approaches.

The earliest intelligent methods for table recognition relied heavily on manually designed features and predefined rules [20]. With the emergence of large-scale annotated datasets such as PubTabNet [2], TableBank [21], and WTW [22], researchers have increasingly turned to data-driven approaches. These methods have achieved remarkable success [23] and can be broadly classified into three categories: (i) vision-based detection and segmentation methods, (ii) logical location prediction methods, and (iii) image-to-sequence generation methods.

(a) Vision-Based Detection and Segmentation. These approaches primarily leverage the visual appearance of tables to identify rows and columns. The typical pipeline involves detecting row and column separators, segmenting basic cells, and then merging them via rule-based or learning-based techniques to reconstruct the table structure [24].

Representative works include Schreiber et al. [25], who developed DeepDeSRT based on a fully convolutional network (FCN) [26] for semantic segmentation of rows and columns. To address the challenge of small row spacing, they introduced an image-stretching preprocessing strategy that significantly improved row segmentation accuracy. Siddiqui et al. [27] treated table recognition as a semantic segmentation problem, designing an encoder-decoder style network that produced pixel-level classifications. Paliwal et al. [28] proposed TableNet, which jointly addressed table detection and structure recognition within a unified semantic segmentation framework, extracting multi-scale features from different backbone layers and applying heuristic rules for final reconstruction. Khan et al. [29] observed that traditional convolutional networks were limited in capturing row-column arrangements, and introduced bidirectional recurrent neural networks to enhance row and column separator detection at the pixel level, thereby improving performance on complex layouts. Tensmeyer et al. [24] extended this line of work by refining segmentation networks and designing a merging module to intelligently handle spanning cells. While these methods perform well on regular tables, they often struggle with irregular, line-free, or heavily distorted layouts due to limited generalization ability.

(b) Logical Location Prediction. This category represents each cell by its logical coordinates—specifically, the starting and ending row and column indices—thereby encoding both local positions and global structure.

Xue et al. [30] pioneered this approach with Res2TIM, though the performance was limited. Later, Xue et al. [31] introduced TGRNet, incorporating graph convolutional networks (GCNs) [32] to aggregate multimodal features and using ordered node classification, which significantly improved accuracy on simple tables but remained less effective on complex layouts. LORE [33] employed cascaded multi-head attention (MHA) [34] to directly regress logical coordinates, achieving promising results but overemphasizing global information at the expense of local context. Li et al. [35] proposed GFTE, which integrated visual, positional, and graph relational features through GCNs, offering a balanced treatment of local and global information. Despite these advances, challenges remain in modeling highly complex layouts, efficiently fusing multimodal features, and reducing computational overhead. Future research directions may include hybrid architectures, dynamic computation optimization, and cross-modal representation learning.

(c) Image-to-Sequence Generation. In this paradigm, table recognition is cast as a sequence prediction task, where a model directly generates serialized markup (e.g., HTML or LaTeX) from table images. The availability of datasets with markup annotations, such as PubTabNet [2], TableBank [21], and Table2LaTeX [36], has fueled rapid growth in this area.

Deng et al. adapted the IM2LATEX model [37], which combines CNN-based visual feature extraction with an attention-enhanced long short-term memory (LSTM) network to generate LaTeX sequences. Zhong et al. [2] created PubTabNet and proposed EDD, an encoder-dual-decoder architecture where a CNN encoder extracts features, and two RNN decoders, respectively, generate structural tokens and textual content. Ye et al. [38] advanced this direction with TableMaster, which extends the Master model [39] for text recognition by adding a cell detection branch to the decoder, enabling end-to-end joint optimization of HTML generation and cell localization. These models demonstrate superior performance by directly producing editable structured documents, eliminating complex post-processing, and achieving higher accuracy and robustness compared to previous paradigms. Nevertheless, they face challenges such as the difficulty of modeling long sequences, dependency on large-scale annotated data, and susceptibility to noisy labels during training. Future research may focus on weakly supervised or semi-supervised learning to reduce annotation costs and mitigate label noise, as well as on designing lightweight architectures for efficient deployment.

To more clearly and intuitively compare different types of table structure recognition methods, we present the main advantages and key limitations of several different methods in tabular form, as shown in Table 1.

3. Methodology

3.1. Baseline SLANet

In recent years, deep learning-based approaches have achieved remarkable progress in table structure recognition. Within the PP-StructureV2 framework, the PaddlePaddle team proposed SLANet [6], a highly efficient table structure recognition algorithm based on an image-to-sequence generation paradigm. As an end-to-end architecture, SLANet is capable of directly outputting HTML code that represents the table structure together with the corresponding cell coordinates. This design also enables seamless conversion into Excel format. Compared with TableRec-RARE [40], an earlier model introduced in PP-Structure, SLANet incorporates significant upgrades in both network architecture and loss function design. The overall architecture of SLANet, illustrated in Figure 1, follows a classical three-stage design comprising a backbone network, a feature fusion neck, and a decoder head.

As a lightweight model, SLANet demonstrates excellent inference efficiency, particularly on Intel CPU platforms. Its backbone network has been specifically optimized for this hardware, resulting in inference speeds that outperform most state-of-the-art table recognition methods. In terms of recognition accuracy, SLANet has already surpassed advanced models such as EDD and LGPMA [41], thereby exhibiting considerable competitiveness. Nevertheless, with the increasing demand for intelligent table processing across industries, SLANet still faces new challenges regarding recognition precision and generalization capability.

In this study, we conduct an in-depth analysis of the SLANet architecture, systematically identifying its current limitations and proposing corresponding optimization strategies. Our objective is to preserve its efficiency in inference while simultaneously enhancing its structural recognition accuracy.

The backbone network of SLANet is built upon PP-LCNet [42], a lightweight and CPU-friendly architecture. In computer vision tasks, the performance of the backbone network plays a decisive role in determining the final accuracy and efficiency of the entire model. Although a variety of lightweight backbones have been introduced in recent years, few architectures have been specifically optimized for Intel CPUs. This limitation prevents existing models from achieving optimal inference efficiency on such hardware platforms. To address this gap, the PaddlePaddle team designed PP-LCNet, an efficient backbone network tailored to the hardware characteristics of Intel CPUs and its oneDNN acceleration library. As shown in Figure 2, comparative experiments demonstrate that, relative to other state-of-the-art lightweight models, PP-LCNet achieves significantly higher accuracy while maintaining the same inference speed, ultimately surpassing existing optimal methods.

The architecture of PP-LCNet, illustrated in Figure 3, is designed according to four key optimization strategies:

employing more effective activation functions,
selectively inserting squeeze-and-excitation (SE) modules,
introducing larger convolution kernels at appropriate positions, and
utilizing a larger 1 × 1 convolution layer after the global average pooling layer.

By adhering to these experimentally validated strategies, PP-LCNet improves model accuracy with negligible additional latency. As shown in Figure 3, PP-LCNet consists of 13 depthwise separable convolution blocks of varying kernel sizes. Depthwise separable convolutions maintain effective feature extraction capability while significantly reducing parameter count and computational cost. Between convolutional layers, PP-LCNet applies the H-Swish activation function, an efficient variant of Swish. H-Swish removes the exponential operation inherent in Swish, offering faster computation with minimal accuracy degradation. Furthermore, PP-LCNet selectively integrates SE modules within its convolution blocks. Originally proposed in SENet [43], SE modules introduce channel attention mechanisms that effectively enhance model precision.

As the backbone of SLANet, PP-LCNet receives preprocessed input data and performs deep feature extraction. In the SLANet architecture, the fully connected layers following the PP-LCNet convolution blocks are removed. Instead, after successive convolutional operations, the network outputs four feature maps with spatial resolutions reduced to 1/4, 1/8, 1/16, and 1/32 of the input image size. These hierarchical features serve as the foundation for subsequent feature fusion and decoding stages.

While efficient, SLANet is limited by its use of the H-Swish activation function, fixed-length decoding, and a single-stage training regime. These limitations are addressed in the following subsections.

3.2. Activation Function Optimization

In SLANet, the H-Swish activation function is primarily adopted. The mathematical formulation of H-Swish is presented in Equation (1), while its component function ReLU6 [43] is defined in Equation (2).

H - S w i s h (x) = x \frac{R e L U 6 (x + 3)}{6}

(1)

R e L U 6 (x) = m i n (m a x (0, x), 6)

(2)

H-Swish is derived from the Swish function by replacing the Sigmoid operation with ReLU6, thereby eliminating the costly exponential computation and substituting it with a simple piecewise function. This modification constitutes one of the key reasons for the relatively fast inference speed of SLANet. However, such simplification inevitably imposes certain limitations on model accuracy and generalization capability. To address this issue, we analyze three popular activation functions, Swish, H-Swish, and Mish, which are closely related, with the goal of identifying an alternative that improves accuracy without significantly sacrificing computational efficiency.

Swish can be regarded as the predecessor of H-Swish, and its mathematical form is given in Equation (3), with Sigmoid defined in Equation (4). Due to the inclusion of exponential operations, Swish is computationally more expensive than H-Swish.

S w i s h (x) = x * s i g m o i d (β x)

(3)

s i g m o i d (x) = \frac{1}{1 + e^{- x}}

(4)

To intuitively compare the two functions, we plotted their function curves and gradients, as illustrated in Figure 4.

From the functional plots (left subfigure of Figure 4), it is evident that the values of Swish and H-Swish are close, with similar overall trends. However, the gradient plots (right subfigure) reveal significant differences, which visually highlight the performance gap between the two. Both Swish and H-Swish are non-saturating activation functions with a lower bound but no upper bound, which prevents the problem of gradient saturation. Furthermore, they suppress some negative values, thereby providing a regularization effect that helps mitigate overfitting. Nevertheless, Swish exhibits a smoother behavior near the lower bound, while H-Swish introduces a hard zero boundary similar to ReLU, where the gradient abruptly drops to zero after a non-differentiable point. This property allows Swish to maintain better gradient flow. Additionally, Swish is a smooth curve that is continuous and differentiable everywhere, whereas H-Swish is piecewise and contains two non-differentiable points. Consequently, Swish offers better learning capacity and generalization ability, making training easier.

The Mish activation function, inspired by Swish, is defined in Equation (5), with the Tanh function given in Equation (6). Due to the inclusion of both exponential and logarithmic operations, Mish is computationally more complex than Swish.

M i s h (x) = x * T a n h (l n (1 + e^{x}))

(5)

T a n h (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}

(6)

Figure 5 illustrates a comparison between Mish and Swish. Mish shares many characteristics with Swish: both are unbounded in the positive domain, bounded in the negative domain, non-monotonic, smooth, and continuous, while also retaining small negative weights. However, experimental results reported by Diganta Misra show that Mish consistently outperforms Swish in terms of stability, accuracy, and generalization. This improvement is likely attributable to the smoother curve of Mish across nearly all points, which facilitates deeper and more effective information propagation.

In summary, unlike H-Swish, which introduces a hard cutoff at zero (creating a “dead ReLU” effect where gradients vanish), Mish is continuously differentiable and allows a small amount of negative gradient to flow. This property is particularly critical for lightweight networks like SLANet, as it prevents information loss during the feature extraction of fine-grained table lines, theoretically leading to better optimization landscapes.

Each of the three activation functions exhibits distinct advantages and trade-offs. In terms of computational efficiency, H-Swish surpasses both Swish and Mish. With respect to accuracy and stability, Mish outperforms Swish and H-Swish. Given these characteristics, Mish appears to provide the most balanced performance overall. Therefore, in this study, we attempt to replace H-Swish with Mish in SLANet, aiming to enhance accuracy while retaining as much of the inference efficiency as possible.

3.3. EOS-Based Inference Optimization

During the inference stage of the SLANet model, the decoding process originally relies on a fixed step length. Although this step length can be adjusted via configuration, such a design inevitably leads to computational redundancy and consequently slows down inference speed. Specifically, since table structure tokens form sequences of variable length, the decoding step length must be set sufficiently large to accommodate most cases. However, for the majority of tables, the actual required step length is shorter than the fixed one, resulting in redundant decoding. In such cases, even after the model has already generated the complete table structure, the remaining decoding steps are still executed, thereby wasting computational resources and extending inference time.

To overcome this limitation and enhance both the efficiency and speed of SLANet during inference, we introduce the End-of-Sequence (EOS) termination strategy, a commonly used stopping condition in natural language processing and sequence generation tasks. EOS is a special token that signifies the end of a sequence. In our approach, an EOS token is appended to the end of the HTML tags representing the table structure. During decoding, the model sequentially generates the HTML tags of the table structure along with their corresponding cell coordinates. Once the EOS token is produced, the decoding process immediately terminates, and no further redundant steps are executed.

The implementation of the EOS termination strategy is presented in Algorithm 1, where decoding stops early once all samples in the batch predict the EOS token. For comparison, the original decoding loop without EOS termination is shown in Algorithm 2.

Algorithm 1. Decoding loop with EOS termination strategy

Input:

Feature map F: The visual features extracted by the backbone.

Max Length L_max: The maximum allowed sequence length.

Previous Token $y_{p r e v}$ : Initialized with the start-of-sequence token.

Hidden State $h$ : The initial hidden state.

EOS: The end-of-sequence token identifier.

Output:

Structure Sequence S: The predicted sequence of HTML structural tags.

Location Sequence P: The predicted coordinates for table cells.

1. Initialize

S \leftarrow [], P \leftarrow [], H i s t o r y \leftarrow []

2. For t = 1 to L_max do:

3. // Decode step

4.

h, p r o b_{s t r u c t}, p r o b_{l o c} \leftarrow Decoder (y_{p r e v}, F, h)

5. // Select the token with the highest probability

6.

y_{c u r r} \leftarrow argmax (p r o b_{s t r u c t})

7. // Store predictions

8. Append

p r o b_{s t r u c t} t o S

9. Append

p r o b_{l o c} t o P

10. Append

y_{c u r r} t o H i s t o r y

11. // EOS Termination Check (Optimization)

12. If EOS exists in the history of all samples in the batch then:

13. Break Loop

14. // Update input for the next time step

15.

y_{p r e v} \leftarrow y_{c u r r}

16. End For

17. Return

S, P

Algorithm 2. Original decoding loop in SLANet

Input:

Feature map F: The visual features extracted by the backbone.

Max Length L_max: The maximum allowed sequence length.

Previous Token $y_{p r e v}$ : Initialized with the start-of-sequence token.

Hidden State $h$ : The initial hidden state.

Output:

Structure Sequence S: The predicted sequence of HTML structural tags.

Location Sequence P: The predicted coordinates for table cells.

1. Initialize

S \leftarrow [], P \leftarrow []

2. For t = 1 to L_max do:

3. // Decode step

4.

h, p r o b_{s t r u c t}, p r o b_{l o c} \leftarrow Decoder (y_{p r e v}, F, h)

5. // Select the token with the highest probability

6.

y_{c u r r} \leftarrow argmax (p r o b_{s t r u c t})

7. // Store predictions regardless of completion

8. Append

p r o b_{s t r u c t} t o S

9. Append

p r o b_{l o c} t o P

10. // Update input for the next time step

11.

y_{p r e v} \leftarrow y_{c u r r}

12. End For

13. Return

S, P

By adopting the EOS termination strategy, the inference process becomes more efficient, redundant computations are effectively reduced, and the overall decoding speed of the SLANet model is significantly accelerated.

3.4. Progressive Multi-Phase Training Strategy

Training lightweight networks requires a delicate balance between convergence speed and generalization. While learning rate schedules such as Linear Warmup and Cosine Decay are standard practices in deep learning, applying them directly in a single stage often fails to fully exploit the potential of compact backbones like PP-LCNet. To address this, we propose a structure-aware progressive training curriculum divided into three phases. The novelty lies in the staged integration of resolution scaling and parameter freezing, rather than the decay function itself. The necessity of this multi-phase design is empirically validated in our Ablation Study, where it demonstrates clear performance gains over standard single-stage training.

Due to the significant variation in data characteristics and model architectures across different tasks, identifying the optimal hyperparameter configuration is a non-trivial challenge. Researchers often rely on a combination of empirical knowledge, theoretical analysis, and extensive experimentation to iteratively adjust these parameters. Although this tuning process is time-consuming, it is an indispensable step toward enhancing model performance.

To further optimize the performance of the SLANet model, we employed a three-stage training strategy. This strategy progressively adjusts the training parameters in a staged manner, enabling stepwise improvement of learning outcomes, faster convergence, and enhanced overall performance of the model:

Stage One (Coarse Learning): The model is trained with a relatively high learning rate (0.001) to ensure rapid convergence. At this stage, the network acquires an initial understanding of data features, thereby accelerating the training process.

Stage Two (Fine-tuning): The model obtained from Stage One is used as a pretrained baseline. The learning rate is reduced to 0.0001, allowing for finer adjustment of parameters and preventing oscillations around the optimal solution caused by an excessively high learning rate.

Stage Three (High-Resolution Refinement): The best-performing model from Stage Two is reloaded as the pretrained model. The learning rate is kept constant, while the input resolution in data preprocessing is increased to 512, thereby enhancing the model’s representation capacity and further refining its performance. Training continues until convergence, yielding the final optimized SLANet model.

4. Experiments

4.1. Experimental Environment

The experiments were conducted on a workstation running Ubuntu 22.04 as the operating system. The hardware platform was equipped with an NVIDIA RTX 4090D GPU with 24 GB of memory, which provided sufficient computational capacity for large-scale deep learning tasks. The software environment was implemented in Python 3.8.20, with the deep learning framework PaddlePaddle 3.0.0rc1 serving as the core backend. In addition, PaddleOCR version 2.10.0 was employed to support optical character recognition tasks integrated within the experimental pipeline.

4.2. Evaluation Metrics

In the study, we adopted four evaluation metrics: Accuracy, Inference Speed, TEDS (Tree-Edit-Distance-based Similarity), and TEDS-Struct, to comprehensively evaluate the performance of the proposed model.

Accuracy directly reflects the recognition performance of the model. It is computed as the proportion of correctly predicted samples to the total number of samples. In the context of table structure recognition, a prediction is considered correct only if all tokens are predicted accurately; any error in token prediction renders the entire sample incorrect.

For applications with stringent real-time requirements, such as mobile face recognition, fingerprint identification, or real-time translation, the inference speed of the model is of paramount importance. Previous studies often employed FLOPs or parameter counts as proxies for computational efficiency. However, in practical deployment, inference speed is the most critical metric, as it directly impacts response latency and user experience. In this study, inference speed was explicitly treated as a key evaluation criterion, since SLANet was originally designed to optimize inference efficiency on Intel CPU platforms. Accordingly, all reported inference speeds refer to experiments conducted on Intel CPUs with the oneDNN acceleration library enabled. OneDNN (formerly MKLDNN) is an Intel deep learning acceleration library that leverages low-level instruction optimizations and computational graph enhancements to significantly improve both training and inference efficiency.

TEDS [2], introduced by Zhong et al. in 2019, is one of the most important metrics for evaluating table recognition performance. Its core idea is to represent tables as tree structures and measure the similarity between predicted and ground truth tables using tree edit distance, defined as the minimum number of edit operations (insertion, deletion, or substitution of nodes) required to transform one tree into another. By jointly considering structural similarity and textual accuracy, TEDS provides a holistic evaluation of table recognition. The formula is shown in Equation (7), where

T_{p r e}

and

T_{t r u e}

denote the predicted and ground truth table trees, respectively, TED(·) is the tree edit distance, and |T| is the number of nodes in the tree.

T E D S = 1 - \frac{T E D (T_{p r e}, T_{t r u e})}{m a x (| T_{p r e} |, | T_{t r u e} |)}

(7)

TEDS-Struct [41] was later proposed in LGPMA as a structural evaluation metric for table recognition. While similar to TEDS in calculation, TEDS-Struct disregards textual content and focuses exclusively on structural matching. Specifically, during the computation of tree edit distance, differences in content nodes are ignored, and no content-based edit distance is calculated in substitution operations. This makes TEDS-Struct a more precise measure of structural reasoning performance, as it isolates structural prediction quality from OCR accuracy.

4.3. Dataset

The experiments in this study were conducted on the PubTabNet dataset [2], one of the largest publicly available datasets for table recognition. PubTabNet contains over 500,000 table images extracted from scientific literature, split into training, validation, and test sets. The dataset provides rich annotations for the training and validation subsets, including both cell-level positional information and HTML-based logical structures, which specify cell coordinates as well as complete table markup.

As illustrated in Figure 6, the dataset annotations adopt a JSON format, where the core node “HTML” contains two child nodes: “cells”, which provides the textual content and coordinates of all table cells, and “structure”, which encodes the table structure in HTML sequence form. This design allows for seamless support of end-to-end table recognition tasks, enabling direct conversion from table images to structured representations.

In our experiments, the training set was used for model training, while the validation set was employed for performance evaluation during development. Finally, the test set was used for inference, and the predicted results were visualized for qualitative analysis. Representative examples from the PubTabNet dataset are shown in Figure 7, illustrating its diversity and suitability for benchmarking table recognition models.

4.4. Experimental Results and Ablation Study

To empirically validate the performance of different activation functions and to evaluate their impact on SLANet, we conducted a series of comparative experiments. In these experiments, the network architecture, hyperparameter settings, and training strategies were kept strictly identical, with the activation function being the only variable replaced. This rigorous control ensured that any observed differences in performance could be attributed solely to the choice of activation function, thereby providing reliable evidence for selecting the optimal activation function for SLANet in terms of both accuracy and computational efficiency.

The results are summarized in Table 2, which presents the recognition accuracy and the average inference time per sample for each activation function. We report mean ± std across ten runs and then performed a paired t-test (N = 10). The resulting p-value is 0.0057, which provides strong statistical evidence that our method significantly outperforms the baseline.

The experimental findings reveal that the Mish activation function demonstrates a clear advantage in improving model accuracy. However, in terms of computational efficiency, Mish incurs an inference time increase of approximately 11% compared with H-Swish and about 5% compared with Swish. These results are consistent with the theoretical analysis of the nonlinear characteristics and computational complexity of the activation functions, indicating that Mish achieves accuracy gains at the cost of a modest and acceptable increase in computational overhead. Overall, the results confirm the effectiveness of incorporating Mish into SLANet as part of the model improvement strategy.

To further evaluate the performance of the improved SLANet model in the table structure recognition task, we compared it with several state-of-the-art algorithms on the PubTabNet dataset. The comparison was conducted across three key dimensions: recognition accuracy, inference time, and model size. The experimental results are presented in Table 3.

As shown in Table 3, the improved SLANet achieves an accuracy improvement of about 1.2% compared with the baseline SLANet, while maintaining almost the same inference speed and model size. This demonstrates that the optimization successfully enhanced the recognition performance without incurring additional computational or memory costs. While TABLET achieves higher accuracy and TEDS scores due to its heavy Transformer-based architecture, it suffers from a significantly larger model size compared to our lightweight model. These results reinforce our paper’s motivation: to provide a deployable solution that balances efficiency and accuracy, rather than pursuing absolute state-of-the-art performance at the cost of computational heaviness. The increase in accuracy to 77.25% satisfies our objective of enhancing feature representation, while the inference time of 774 ms confirms that our EOS strategy successfully maintained the model’s lightweight efficiency.

To address the request for finer-grained evaluation metrics beyond accuracy and TEDS, we also calculated the Token-level F1-score, which measures the precision and recall of the generated HTML tags. Our improved model achieves an F1-score of 98.12%, compared to 97.45% for the baseline, further validating the robustness of the structure prediction.

To verify the effectiveness of the proposed improvements, we conducted a series of progressive ablation experiments on the PubTabNet dataset. Using the baseline SLANet model as the starting point, we sequentially introduced the following modifications: (1) replacing the H-Swish activation function with Mish, (2) adding the EOS stopping strategy during decoding, and (3) applying the three-stage training strategy. The experimental results are summarized in Table 4. Here, Mish indicates the use of the Mish activation function, EOS denotes the adoption of the EOS-based decoding termination strategy, and T_Stage refers to the application of the three-stage training strategy.

As shown in Table 4, each optimization strategy contributes positively to model performance in different aspects, thereby confirming their effectiveness. Specifically, replacing H-Swish with the Mish activation function significantly improves recognition accuracy. However, due to the higher computational complexity of Mish, inference speed decreases compared to the baseline. The introduction of the EOS stopping strategy mitigates unnecessary decoding steps, effectively accelerating inference and compensating for the slowdown introduced by Mish. Finally, the three-stage training strategy further boosts recognition accuracy by better exploiting the model’s learning capacity. Although this strategy slightly increases inference time, mainly because the input resolution was raised from 488 to 512, thereby increasing computational load, the trade-off is acceptable given the corresponding improvement in accuracy.

4.5. Qualitative Results

To further evaluate and visually demonstrate the effectiveness of the proposed model in table structure recognition, we applied the trained model (on the PubTabNet dataset) to the PubTabNet test set. Figure 8 presents an example of the inference results, including the predicted HTML structure sequence and the coordinates of table cells. Each cell is represented using diagonal vertex coordinates, defined by the top-left and bottom-right corners of the cell region.

Based on the inference output, the predicted HTML structure sequence was rendered as an HTML table, while the predicted cell coordinates were overlaid as bounding boxes on the table image to more clearly illustrate the recognition performance. As shown in Figure 9, which display the original table image alongside the visualization of the predicted results, the improved SLANet model accurately reconstructed the table structure.

4.6. Limitation and Discussion

Regarding the improved model, although it achieved promising results on the PubTabNet dataset, this dataset exhibits distinctive characteristics, as most tables are borderless, contain few lines, and are predominantly in English. These features indicate that the model’s performance on other types of datasets remains to be validated, highlighting a strong dependency on the data. While numerous mature table datasets, such as FinTabNet, currently exist, collecting annotated tables for specific domains still incurs high costs and may introduce noise that adversely affects model performance. Future research could explore weakly supervised learning approaches to reduce reliance on fine-grained annotations and enhance the model’s generalization capability.

Compared with state-of-the-art models in the field, the improved model still shows some performance gaps, particularly in terms of recognition accuracy. Future work could investigate the integration of Graph Neural Networks (GNNs) or Transformer-based architectures to further strengthen feature extraction and decoding capabilities, thereby boosting overall model performance. Large Language Models with vision extensions (e.g., GPT-4V, LLaMA) show potential for table structure reasoning, but their high computational cost and difficulty in deterministic sequence generation make them less suitable for lightweight deployment scenarios. Evaluating this trade-off is an important future direction.

With respect to the constructed table recognition algorithm, it is currently capable of recognizing conventional tables containing only textual content. However, in real-world scenarios, tables often include diverse data modalities such as text, images, and formulas. Therefore, future research should focus on multi-modal table recognition, aiming to enhance the algorithm’s practical applicability and contribute to more precise analysis and efficient management supported by information technology.

5. Conclusions

We presented an improved version of SLANet that enhances its accuracy and inference efficiency while maintaining lightweight deployment feasibility. Our contributions include introducing the Mish activation function, enabling EOS-based early termination, and designing a progressive multi-phase training curriculum. These strategies not only preserved the model’s high inference efficiency but also enhanced recognition accuracy. Experiments on PubTabNet verify consistent improvements over the baseline across accuracy and structural similarity metrics. While the model still trails larger transformer-based methods in absolute performance, our approach achieves a favorable balance between effectiveness and computational cost. Future work will expand datasets, incorporate transformer-based modules, and explore multi-modal table understanding tasks.

Author Contributions

Conceptualization, L.M.; methodology, L.M. and K.D.; software, Y.X. and K.D.; validation, Y.X., X.X. and J.S.; formal analysis, L.M. and K.D.; investigation, Y.X. and K.D.; resources, K.D. and J.S.; data curation, Y.X. and K.D.; writing—original draft preparation, L.M., Y.X. and K.D.; writing—review and editing, L.M., X.X.; visualization, K.D., J.S.; supervision, X.X.; project administration, X.X.; funding acquisition, X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China under Grant No. 62362023 and Hainan Province Science and Technology Special Fund (Grant Nos. ZDYF2025GXJS184 and ZDYF2024GXJS313).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kasem, M.S.; Mahmoud, M.; Yagoub, B.; Senussi, M.F.; Abdalla, M.; Kang, H.S. HTTD: A Hierarchical Transformer for Accurate Table Detection in Document Images. Mathematics 2025, 13, 266. [Google Scholar] [CrossRef]
Zhong, X.; ShafieiBavani, E.; Jimeno Yepes, A. Image-based table recognition: Data, model, and evaluation. In Proceedings of the European Conference on Computer Vision 2020, Glasgow, UK, 23–28 August 2020; pp. 564–580. [Google Scholar]
Fernandes, J.; Xiao, B.; Simsek, M.; Kantarci, B.; Khan, S.; Alkheir, A.A. TableStrRec: Framework for table structure recognition in data sheet images. Int. J. Doc. Anal. Recognit. 2024, 27, 127–145. [Google Scholar] [CrossRef]
Kasem, M.; Abdallah, A.; Berendeyev, A.; Elkady, E.; Mahmoud, M.; Abdalla, M.; Taj-Eddin, I. Deep Learning for Table Detection and Structure Recognition: A Survey. ACM Comput. Surv. 2024, 56, 1–41. [Google Scholar] [CrossRef]
Hou, Q.; Wang, J. TABLET: Table Structure Recognition using Encoder-only Transformers. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR) 2025, Wuhan, China, 16–21 September 2025. [Google Scholar]
Li, C.; Guo, R.; Zhou, J.; An, M.; Du, Y.; Zhu, L.; Liu, Y.; Hu, X.; Yu, D. Pp-structurev2: A stronger document analysis system. arXiv 2022, arXiv:2210.05391. [Google Scholar]
Itonori, K. Table structure recognition based on textblock arrangement and ruled line position. In Proceedings of the 2nd International Conference on Document Analysis and Recognition, Tsukuba, Japan, 20–22 October 1993; pp. 765–768. [Google Scholar]
Rahgozar, M.A.; Fan, Z.; Rainero, E.V. Tabular document recognition. In Document Recognition; SPIE: St Bellingham, WA, USA, 1994; Volume 2181, pp. 87–96. [Google Scholar]
Hirayama, Y. A method for table structure analysis using DP matching. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 2, pp. 583–586. [Google Scholar]
Zuyev, K. Table image segmentation. In Proceedings of the Fourth International Conference on Document Analysis and Recognition, Ulm, Germany, 18–20 August 1997; Volume 2, pp. 705–708. [Google Scholar]
Kieninger, T.G. Table structure recognition based on robust block segmentation. In Document recognition V; SPIE: St Bellingham, WA, USA, 1998; Volume 3305, pp. 22–32. [Google Scholar]
Kieninger, T.; Dengel, A. Applying the T-RECS table recognition system to the business letter domain. In Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle, WA, USA, 10–13 September 2001; pp. 518–522. [Google Scholar]
Wang, Y.; Phillips, I.T.; Haralick, R.M. Table structure understanding and its performance evaluation. Pattern Recognit. 2004, 37, 1479–1497. [Google Scholar] [CrossRef]
Ishitani, Y.; Fume, K.; Sumita, K. Table structure analysis based on cell classification and cell modification for xml document transformation. In Proceedings of the Eighth International Conference on Document Analysis and Recognition, Seoul, Republic of Korea, 31 August–1 September 2005; pp. 1247–1252. [Google Scholar]
Wang, R.-F.; Qu, H.-R.; Su, W.-H. From sensors to insights: Technological trends in image-based high-throughput plant phenotyping. Smart Agric. Technol. 2025, 12, 101257. [Google Scholar] [CrossRef]
Wang, R.-F.; Qin, Y.-M.; Zhao, Y.-Y.; Xu, M.; Schardong, I.B.; Cui, K. RA-CottNet: A Real-Time High-Precision Deep Learning Model for Cotton Boll and Flower Recognition. AI 2025, 6, 235. [Google Scholar] [CrossRef]
Shan, Z.; Liu, Y.; Zhou, L.; Yan, C.; Wang, H.; Xie, X. ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Zhou, L.; Liu, Y.; Bai, X.; Li, N.; Yu, X.; Zhou, J.; Hancock, E.R. Attribute subspaces for zero-shot learning. Pattern Recognit. 2023, 144, 109869. [Google Scholar] [CrossRef]
Liu, Y.; Zhou, L.; Bai, X.; Huang, Y.; Gu, L.; Zhou, J.; Harada, T. Goal-Oriented Gaze Estimation for Zero-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 3794–3803. [Google Scholar]
Tupaj, S.; Shi, Z.; Chang, C.H.; Alam, H. Extracting Tabular Information from Text Files; EECS Department, Tufts University: Medford, OR, USA, 1996. [Google Scholar]
Li, M.; Cui, L.; Huang, S.; Wei, F.; Zhou, M.; Li, Z. Tablebank: Table benchmark for image-based table detection and recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 1918–1925. [Google Scholar]
Long, R.; Wang, W.; Xue, N.; Gao, F.; Yang, Z.; Wang, Y.; Xia, G.S. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 944–952. [Google Scholar]
Hashmi, K.A.; Liwicki, M.; Stricker, D.; Afzal, M.A.; Afzal, M.A.; Afzal, M.Z. Current status and performance analysis of table recognition in document images with deep neural networks. IEEE Access 2021, 9, 87663–87685. [Google Scholar] [CrossRef]
Tensmeyer, C.; Morariu, V.I.; Price, B.; Cohen, S.; Martinez, T. Deep splitting and merging for table structure decomposition. In Proceedings of the 2019 International Conference on Document Analysis and Recognition, Sydney, Australia, 20–25 September 2019; pp. 114–121. [Google Scholar]
Schreiber, S.; Agne, S.; Wolf, I.; Dengel, A.; Ahmed, S. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, Japan, 9–12 November 2017; Volume 1, pp. 1162–1167. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Siddiqui, S.A.; Fateh, I.A.; Rizvi, S.T.R.; Dengel, A.; Ahmed, S. Deeptabstr: Deep learning based table structure recognition. In Proceedings of the 2019 International Conference on Document Analysis and Recognition, Sydney, Australia, 20–25 September 2019; pp. 1403–1409. [Google Scholar]
Paliwal, S.S.; Vishwanath, D.; Rahul, R.; Sharma, M.; Vig, L. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In Proceedings of the 2019 International Conference on Document Analysis and Recognition, Sydney, Australia, 20–25 September 2019; pp. 128–133. [Google Scholar]
Khan, S.A.; Khalid, S.M.D.; Shahzad, M.A.; Shafait, F. Table structure extraction with bi-directional gated recurrent unit networks. In Proceedings of the 2019 International Conference on Document Analysis and Recognition, Sydney, Australia, 20–25 September 2019; pp. 1366–1371. [Google Scholar]
Xue, W.; Li, Q.; Tao, D. Res2tim: Reconstruct syntactic structures from table images. In Proceedings of the 2019 International Conference on Document Analysis and Recognition, Sydney, Australia, 20–25 September 2019; pp. 749–755. [Google Scholar]
Xue, W.; Yu, B.; Wang, W.; Tao, D.; Li, Q. Tgrnet: A table graph reconstruction network for table structure recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1295–1304. [Google Scholar]
Kipf, T.N. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Xing, H.; Gao, F.; Long, R.; Bu, J.; Zheng, Q.; Li, L.; Yao, C.; Yu, Z. LORE: Logical location regression network for table structure recognition. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2992–3000. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Li, Y.; Huang, Z.; Yan, J.; Zhou, Y.; Ye, F.; Liu, X. GFTE: Graph-based financial table extraction. In Proceedings of the International Conference on Pattern Recognition, Virtual Event, 10–15 January 2021; pp. 644–658. [Google Scholar]
Deng, Y.; Rosenberg, D.; Mann, G. Challenges in end-to-end neural scientific table recognition. In Proceedings of the 2019 International Conference on Document Analysis and Recognition, Sydney, Australia, 20–25 September 2019; pp. 894–901. [Google Scholar]
Deng, Y.; Kanervisto, A.; Ling, J.; Rush, A.M. Image-to-markup generation with coarse-to-fine attention. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 980–989. [Google Scholar]
Ye, J.; Qi, X.; He, Y.; Chen, Y.; Gu, D.; Gao, P.; Xiao, R. Pingan-vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: Table recognition to html. arXiv 2021, arXiv:2105.01848. [Google Scholar]
Lu, N.; Yu, W.; Qi, X.; Chen, Y.; Gong, P.; Xiao, R.; Bai, X. Master: Multi-aspect non-local network for scene text recognition. Pattern Recognit. 2021, 117, 107980. [Google Scholar] [CrossRef]
PaddlePaddle. PaddleOCR: PP-Structure Table. Available online: https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.5/ppstructure/table (accessed on 9 May 2025).
Qiao, L.; Li, Z.; Cheng, Z.; Zhang, P.; Pu, S.; Niu, Y.; Ren, W.; Tan, W.; Wu, F. Lgpma: Complicated table structure recognition with local and global pyramid mask alignment. In Proceedings of the International Conference on Document Analysis and Recognition, Lausanne, Switzerland, 5–10 September 2021; pp. 99–114. [Google Scholar]
Cui, C.; Gao, T.; Wei, S.; Du, Y.; Guo, R.; Dong, S.; Lu, B.; Zhou, Y.; Lv, X.; Liu, Q.; et al. PP-LCNet: A lightweight CPU convolutional neural network. arXiv 2021, arXiv:2109.15099. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]

Figure 1. Architecture of SLANet.

Figure 2. Comparison of PP-LCNet and other advanced models.

Figure 3. Architecture of PP-LCNet.

Figure 4. Comparison between Swish and H-Swish.

Figure 5. Comparison between Mish and Swish.

Figure 6. Dataset annotation examples.

Figure 7. PubTabNet dataset examples.

Figure 8. Example of table structure recognition inference results.

Figure 9. The visualization of the predicted results of our improved SLANet.

Table 1. Summary of limitations in existing table structure recognition approaches.

Approach	Representative Methods	Main Advantages	Key Limitations
Heuristic-based	T-Recs [11], Wang et al. [13]	No training data needed; interpretable.	Lacks robustness; fails on complex/irregular layouts; requires handcrafted rules.
Detection/Segmentation	DeepDeSRT [25], TableNet [28]	Visual intuition; good for row/col extraction.	Post-processing is complex; struggles with spanning cells and empty cells.
Logical Location	LORE [33], TGRNet [31]	Direct coordinate regression.	High computational cost for graph construction; difficulty fusing multimodal features.
Image-to-Sequence	SLANet [6], TABLET [5]	End-to-end; outputs editable code (HTML).	Heavy models (e.g., TABLET) are slow; Light models (e.g., SLANet) lack accuracy.

Table 2. Comparative results of different activation functions.

Activation Function	Accuracy (%)	Inference Time (mS)
H-Swish	75.99 ± 0.49	766
Swish	76.52 ± 0.53	810
Mish	77.04 ± 0.69	850

Table 3. Performance Comparison of Advanced Algorithms on the PubTabNet Dataset.

Algorithm	Accuracy (%)	Inference Time (MS)	TEDS (%)	TEDS-Struct (%)	Model Size (M)
LGPMA	65.74	1654	94.70	96.70	177
TABLEREC-RARE	71.73	779	93.88	95.96	6.8
SLANET	75.99	766	95.74	97.01	9.2
TABLET	89.22	1521	96.79	97.67	45.5
OURS	77.25	774	96.67	97.83	9.6

Table 4. Results of the Ablation Study on the PubTabNet Dataset.

Strategy	Accuracy (%)	Inference Time (MS)	Model Size (M)
SLANET	75.99	766	9.2
+MISH	76.98	850	9.6
+EOS	77.01	745	9.6
+T_STAGE	77.25	774	9.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mao, L.; Xiao, Y.; Du, K.; Shen, J.; Xie, X. Lightweight and Accurate Table Recognition via Improved SLANet with Multi-Phase Training Strategy. Mathematics 2026, 14, 25. https://doi.org/10.3390/math14010025

AMA Style

Mao L, Xiao Y, Du K, Shen J, Xie X. Lightweight and Accurate Table Recognition via Improved SLANet with Multi-Phase Training Strategy. Mathematics. 2026; 14(1):25. https://doi.org/10.3390/math14010025

Chicago/Turabian Style

Mao, Liu, Yujie Xiao, Kaihang Du, Jie Shen, and Xia Xie. 2026. "Lightweight and Accurate Table Recognition via Improved SLANet with Multi-Phase Training Strategy" Mathematics 14, no. 1: 25. https://doi.org/10.3390/math14010025

APA Style

Mao, L., Xiao, Y., Du, K., Shen, J., & Xie, X. (2026). Lightweight and Accurate Table Recognition via Improved SLANet with Multi-Phase Training Strategy. Mathematics, 14(1), 25. https://doi.org/10.3390/math14010025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Lightweight and Accurate Table Recognition via Improved SLANet with Multi-Phase Training Strategy

Abstract

1. Introduction

2. Related Work

2.1. Heuristic-Based Table Structure Recognition

2.2. Deep Learning-Based Table Structure Recognition

3. Methodology

3.1. Baseline SLANet

3.2. Activation Function Optimization

3.3. EOS-Based Inference Optimization

3.4. Progressive Multi-Phase Training Strategy

4. Experiments

4.1. Experimental Environment

4.2. Evaluation Metrics

4.3. Dataset

4.4. Experimental Results and Ablation Study

4.5. Qualitative Results

4.6. Limitation and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI