VisFormers —Combining Vision and Transformers for Enhanced Complex Document Classification

: Complex documents have text, figures, tables, and other elements. The classification of scanned copies of different categories of complex documents like memos, newspapers, letters, and more is essential for rapid digitization. However, this task is very challenging as most scanned complex documents look similar. This is because all documents have similar colors of the page and letters, similar textures for all papers, and very few contrasting features. Several attempts have been made in the state of the art to classify complex documents; however, only a few of these works have addressed the classification of complex documents with similar features, and among these, the performances could be more satisfactory. To overcome this, this paper presents a method to use an optical character reader to extract the texts. It proposes a multi-headed model to combine vision-based transfer learning and natural-language-based Transformers within the same network for simultaneous training for different inputs and optimizers in specific parts of the network. A subset of the Ryers Vision Lab Complex Document Information Processing dataset containing 16 different document classes was used to evaluate the performances. The proposed multi-headed VisFormers network classified the documents with up to 94.2% accuracy, while a regular natural-language-processing-based Transformer network achieved 83%, and vision-based VGG19 transfer learning could achieve only up to 90% accuracy. The model deployment can help sort the scanned copies of various documents into different categories.


Introduction
Document classification is a vital process in the field of information management and retrieval [1].It involves categorizing documents into predefined classes or categories based on their content, enabling the efficient organization, search, and analysis of vast amounts of textual data.The digital revolution and the explosion of online material have significantly increased the importance of this process in recent years [2].Document classification is a potent technique for streamlining information access, as information overload is a persistent problem.The urgency of this work is evident as the volume of digital information continues to explode.In recent news, it has been reported that over 2.5 quintillion bytes of data are generated every day, and by 2025, it is estimated that global data will reach 163 zettabytes [3].Most of these data require more organization, making it hard to determine what is essential.Our work attempts to sort and organize these data, facilitating better and quicker decision making using modern technologies.
Complex documents contain a mix of text, images, labels, tables, and various elements, which present a formidable challenge in document classification.These documents can include memos, newspapers, letters, and more, and effectively categorizing them is critical for the rapid digitization of diverse content [4].The difficulty arises from the fact that these scanned images look very similar as they share standard page colors, text characteristics, and paper textures.Their classification based on so few contrasting characteristics further increases the challenge.
Optical character recognition (OCR) and natural language processing (NLP) are two transformative technologies that have revolutionized the way we interact with and extract information from textual content [5].OCR is a technology that permits the conversion of printed or handwritten text into machine-encoded text, as its name suggests.It scans text-containing physical documents or photos, analyzes the data, and translates the characters into a digital format that computers can understand [6].On the other hand, NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.It empowers computers to understand, interpret, and generate human language in a way that is both meaningful and contextually relevant [7].In addition to facilitating human-computer interaction, NLP is essential for data analysis because it enables computers to interpret and extract information from massive amounts of unstructured text data.
Transfer vision is the term used to describe a machine learning model's capacity to transfer knowledge from one task or domain to another, notably in computer vision [8].This idea is crucial to increasing the adaptability and versatility of AI systems.Transfer vision uses previously trained models and their ingrained properties to effectively complete new, related tasks [9].For instance, a model initially trained to identify items in photographs can be fine-tuned to identify particular species of animals in shots of wildlife.Compared to training a model from scratch for every new task, this technique saves time and computational resources [10].A wide range of applications, such as driverless vehicles, medical imaging, and security monitoring, are significantly impacted by transfer vision.
In the realm of enhanced complex document classification, several powerful algorithms play pivotal roles, each exhibiting distinct characteristics.InceptionV3 (Incv3) employs a sophisticated architecture with multiple parallel pathways, enabling it to capture intricate details in diverse document structures.VGG16, renowned for its simplicity and effectiveness, utilizes a deep convolutional network to discern hierarchical features, making it adept at recognizing complex document patterns.ResNet50, known for its deep residual learning, excels in handling intricate relationships between document elements, enhancing accuracy [11].MobileNet, designed for efficiency on mobile devices, balances accuracy and speed in document classification tasks.Lastly, the Transformer model, renowned for its attention mechanism, captures long-range dependencies within documents, offering a unique approach to understanding contextual relationships [12].Collectively, these algorithms contribute to the advancement of enhanced complex document classification by addressing various challenges and nuances inherent in diverse document structures.
Commercially, advanced technologies such as intelligent recognition framework (IRF), enterprise resource planning (ERP), and customer relationship management (CRM) systems play a crucial role in managing document circulation and optimizing workflow efficiency.These systems analyze and process various types of data, including scanned documents, to create organized databases.While ERP and CRM systems excel in managing structured data and streamlining workflow processes, they often face limitations when handling unstructured data, such as printed complex documents.These systems may struggle to accurately classify and process documents containing handwritten annotations, diverse elements like figures and tables, or varying layouts [13].According to various reports, organizations manage an average of 300,000 documents per employee per year, and knowledge workers spend about 20% of their time searching for information across different documents and systems [14].This large number of documents, often in printed form and possibly with handwritten notes, can make it difficult for advanced systems to understand them fully [15].As a result, document circulation and information management efficiency may be compromised, leading to delays and errors in workflow operations.The need for a specialized document classification model becomes apparent given the prevalence of complex printed documents in various industries [16].
The state-of-the-art works perform complex document classification, but there is a big challenge regarding the fact that all documents have similar color gradients and texture [17].This challenge limits the performance of the state-of-the-art models.Furthermore, only some attempts have been made to use OCR for this purpose [18].However, that also requires better classification performance, given that many documents might have similar kinds of written text.However, the document classes can be difficult to classify based on their spatial information.Therefore, a challenge exists: "Can we develop a highly efficient model that can process both vision and language properties of the complex document simultaneously to provide faster and more accurate classification than the state of the art?''.This paper attempts to answer this question, and the primary contributions of the work are:

•
To develop a fast and efficient complex document classification model.

•
To combine computer vision and natural language processing networks in a single model for complex document classification.

•
Facilitating different types of inputs and enabling different optimizers in different parts of the network while training.

•
To benchmark the performance with the state-of-the-art methods using the standard complex document classification dataset RVL-CDIP.
The paper's structure is as follows: In Section 2, recent efforts in document classification and related works are summarized.Section 3 elaborates on the steps taken in experimentation.Section 4 presents results for the proposed vision Transformer.Section 5 focuses on results obtained from individual vision and NLP methods.Finally, Section 6 summarizes the key findings and concludes the research.

Related Works
Many researchers have worked on classifying documents using convolutional neural networks (CNNs) and transfer learning [19].The study by Abdullah et al. [20] addresses Arabic document classification using machine learning.Convolutional neural networks (CNN) with a character-level model attained the greatest accuracy of 98% throughout the authors' experiments with various algorithms, outperforming earlier techniques and proving their usefulness, particularly in social media environments.Shuo et al. [21] address the growing need for efficient technical document classification in technology organizations by highlighting the availability of multimodal information in various papers.In their study, they provide TechDoc, a brand-new multimodal deep-learning architecture that blends relationships between text, images, and documents.This strategy provides scalable solutions for document management in tech firms, outperforming unimodal systems and current benchmarks.
Various other researchers have focused on organizing documents through optical character recognition (OCR).Despite the widespread use of the language in India, Srinivasa Rao et al. [22] addresses the need for more progress in Telugu optical character recognition (OCR) systems.They introduce the ITP-STTR model, leveraging deep learning for improved Telugu text recognition.With challenging character combinations and a dearth of recent OCR advancements, their research fills a critical gap in Telugu OCR development, yielding superior results.In addressing the urgent problem of road safety and car theft in India, Anuj et al. [23] strongly emphasize the significance of automatic number plate detection (ANPD).They use TensorFlow for training and testing, using a variety of datasets to detect plates with an accuracy of 85%.
Certain researchers emphasize the utilization of natural language processing (NLP) and Transformers for document classification.Authors like Irfan et al. [24] address employers' challenges in selecting suitable job applicants from a large pool of resumes.Their study introduces a natural language processing (NLP) and machine learning (ML)-based resume classification system (RCS) to automate and expedite the categorization process.Multiple ML algorithms and NLP techniques are evaluated, with support vector machine (SVM) classifiers achieving over 96% accuracy in multi-class resume classification.Aroush et al. [25] use deep learning and already-trained language models to handle the complex and timeconsuming task of patent classification.Using datasets like USPTO-2M and M-patent, their work investigates the fine-tuning of models like BERT, XLNet, RoBERTa, and ELECTRA to achieve cutting-edge performance on multi-label patent categorization.Iqra et al. [26] address the task of multi-label emotion classification in short social media posts.Using transfer learning, they investigate the utilization of single-and multiple-attention processes in LSTM and Transformer networks (such as XLNet, DistilBERT, and RoBERTa).Their innovative technique exceeds current benchmarks, with the RoBERTa-MA model, in particular, obtaining 62.4% accuracy on the English SemEval-2018 E-c dataset.
Another group of studies delves into the notable progressions in Transformer-based models for natural language processing (NLP) and document classification.Yang et al. [27] propose a novel model, D2BFormer, utilizing the Vision Transformer framework to address the critical task of degraded document binarization.This end-to-end trainable model introduces a dual-branched encoding feature fusion module, effectively combining components from both Vision Transformer and deep convolutional neural networks, leading to improved binarization quality and reduced computational complexity.Rahali et al. [28] developed Transformer-based (TB) models in natural language processing (NLP), emphasizing their expressiveness through self-attention mechanisms.The paper categorizes TB models, compares their architectures, and discusses limitations, offering insights to boost innovation in NLP applications and AI-powered products.Pilicita et al. [29] investigate the utility of five BERT-based pre-trained models in classifying mobile educational applications.Leveraging a dataset enriched with descriptions and categories from the Google Play Store, the study demonstrates the effectiveness of these models, achieving notable accuracy rates ranging from 76% to 81%.
There are various models designed for tasks related to documents, demonstrating how Transformer architectures can be used effectively in various areas.Nasi et al. [30] present PharmKE, a knowledge extraction platform for pharmaceutical texts.Leveraging transfer learning, the platform achieves a 96% F1-score in named entity recognition tasks, outperforming fine-tuned BERT and BioBERT models.The open-source modular architecture promotes reproducibility, accessibility, and potential integration into mobile systems, empowering patients with relevant medication information.In the study by Meshrif et al. [31], the authors address the challenge of classifying Arabic tweets by developing ARABERT4TWC, a BERT-based text classification model.The paper highlights the model's effectiveness in achieving high classification results across various datasets, outperforming other deep learning and conventional techniques.The proposed model showcases the potential of transfer learning with BERT for automating the classification of Arabic tweets.Tang et al.'s [32] work introduces Universal Document Processing (UDOP), a groundbreaking document AI model that unifies text, image, and layout modalities, achieving high-quality document editing and content customization.Leveraging a vsion-textlayout Transformer, UDOP sets the state of the art in eight document AI tasks, showcasing its versatility and dominance in the field.
Table 1 highlights several significant contributions that have been made in the state of the art across diverse domains.The various contributions include Arabic text classification, where machine learning models, particularly CNNs with character-level models, exhibit superior performance [20].Nonetheless, this research could benefit from a more comprehensive dataset size and scope discussion.In the realm of tech document classification, the TechDoc architecture has proven its efficiency, enhancing categorization and scalability, particularly for tech companies [21].However, its domain-specific focus limits broader applicability.Similarly, digitizing handwritten Devanagari text with CNN-based DHTR models preserves ancient knowledge but is confined to the Devanagari script [22].Indian number plate detection using CNN-based models shows promise in enhancing road safety and theft prevention but necessitates scalability and diverse datasets [23].Resume classifica-tion with NLP and ML algorithms improves accuracy and automation but requires further exploration of scalability and broader applications [24].Patent document classification, enhanced by transfer learning, shows improved performance [25], yet it remains confined to patent documents and classification tasks.Lastly, digitizing handwritten Devanagari script effectively preserves it, but broader applications warrant further study [33].Therefore, a knowledge gap exists throughout the literature for creating an optimized accurate vision and Transformer combined model for classifying scanned copies of various documents into different categories.

Methodology
To carry out this experiment, we followed the process outlined in Figure 1.This process involved combining two main steps.In the first step, we classified the document images using transfer learning for vision tasks.In the second step, we used OCR (text extraction) and Transformer models to classify the text.After that, we combined vison and OCR to create a strong network for classification.

Data Collection
The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset [34] is a substantial collection of grayscale images comprising 400,000 samples, meticulously organized into 16 distinct document categories.Each category letter, form, email, handwritten, advertisement, scientific report, scientific publication, specification, file folder, news article, budget, invoice, presentation, questionnaire, resume, and memo is numerically labeled from 0 to 15.The dataset has been thoughtfully partitioned into subsets, with 320,000 images allocated for training, 40,000 for validation, and an additional 40,000 for testing.Notably, the images have been resized to ensure that their largest dimension does not exceed 1000 pixels, ensuring uniformity in data representation.This dataset provides a robust foundation for developing and evaluating document classification and information processing models.

Optical Character Recognition
In the context of our proposed model, optical character recognition (OCR) is applied to all distinct documents.This preprocessing performs noise reduction, contrast enhancement, and resizing tasks to ensure optimal recognition accuracy.Following preprocessing, it uses the deep learning model, which identifies characters, words, the spatial position of the characters, and even complex layouts within the image [35].Once the text is extracted from the image, postprocessing techniques are applied to enhance the accuracy of the recognized text, such as spell-checking and formatting correction.The final machine-readable text is generated, which is later converted to vectors.

Text Preprocessing
Text preprocessing is a vital step in natural language processing tasks, enhancing the quality and efficiency of text analysis.Techniques like stemming and lemmatization are initially applied to the text, standardizing words to their root forms and reducing complexity [36].A vocabulary size of 10,000 words was used for text processing.Additionally, common stopwords are removed to focus on meaningful content.This list can contain 200 common words.The processed text is then tokenized, breaking it into individual words or tokens.An average token count per document of 300 was used.To prepare the text for machine learning, it is converted into numerical form using count vectorization, where each word is represented as a vector with its frequency [37].The Adam optimizer was used to update the weights of this portion of the network.

Image Preprocessing
All distinct document images are initially standardized to a fixed size, such as 224 × 224 pixels, through resizing [38].Following this, various image augmentation techniques are applied, including altering parameters like rotation, zoom, and brightness to create multiple versions of the same image, which helps reduce overfitting and improve model generalization [39].Once the images have been standardized and augmented, they are converted into numerical vectors.For image classification, a typical neural network architecture includes an input layer with an image size of 224 × 224 pixels, convolutional layers with 32 to 256 filters, max-pooling layers, fully connected layers with 128 to 1024 nodes, dropout layers with a dropout rate of 0.2 to 0.5, ReLU activation functions, and 16 output layers with a softmax activation function for 16 distinct class probabilities [40].The RMSprop optimizer was used to update network weights during training.

Proposed Multi-Headed Vision-Transformer Network
The proposed VisFormers model is created by combining the vision and Transformer networks.The network is summarized in Figure 2. The preprocessed vectorized language data are used as the input to the Transformer model with up to 200 features per sample.The output of this network is a 16-node vector, with each node corresponding to a particular category of document.On the other hand, the pre-trained VGG-19 model has been used for the vision network with ImageNet weights.This ensured that the spatial and visual features of the documents were also utilized while training the model.The VGG-19 was intended to create a reduced feature map that can complement the Transformer head's decision making.The parameters of this network were not updated during the training to retain the transfer learning properties.Images of 224 × 224 pixels were used as the input to the network, which was then flattened and passed through a dense layer of 64 nodes [38].The target of the vision network was a reshaped and resized version of the same input.The vectors obtained from both the vision and the Transformer's output layers were then concatenated and passed through a series of layers until the final output layer of 16 nodes was reached.Adam optimizer with categorical cross-entropy loss was used to train both the tails of the network and the Transformer part of the network.The tail of the vision part of the network was optimized with a root mean squared propagation (RMSprop) optimizer using mean squared error loss [41].The proposed network, therefore, has a total of three objective functions.The first one is the Transformer head, which is used to process the textual information; the second one is the VGG-19 head, which is used to process the visual information; and the third one is the tail of the network, which combines the information received from the Transformer and VGG-19 heads.Let the text vector be given by the tensor Here, a i indicates the intensity of the vector, and ⃗ x i represents the ith component.This, according to Einstein's summation notation, can be given by This is fed into the input layer of a neural network n 1 given by Here, σ 1 indicates the activation function rectified linear unit, and w represents the weight matrix.Following this, a Transformer block is introduced.Let τ 1 be the target variable for the text network, and V represent the values in the word vector.
From this, global average pooling was used to extract the necessary information.This is given by where f k (x, y) denotes the feature map, and k represents the depth of the map.Let n 1 be the neural network updated according to these steps and d represent a dropout layer.Therefore, the neural network is updated as This part of the neural network is optimized by Adam optimizer.On the other hand, the image vector can be represented by the tensor where a represents the intensity of the pixel ranging from 0 to 255, ⃗ x m represents the position of the pixel, and i indicates the three pixel planes, namely red, green, and blue.
Let n 2 represent the structure of the VGG-19 network [42].This part of the network is optimized with RMSprop.Therefore, the concatenation is given by This is followed by a dense and dropout layer.This completes our objective function, which is given by This part of the network is updated using RMSprop using the mean squared error loss function.

Comparitive Analysis with Transfer Learning and Natural Language Processing
To compare performance, we used state-of-the-art methods like transfer learning and NLP [43].Document images are initially resized to a fixed 224 × 224 pixel size and are then converted to numerical vectors.Then, we employ a diverse set of popular transfer learning networks for our document image classification task, including InceptionV3 (IncV3), VGG16, ResNet50, VGG19, and MobileNet.For the NLP, we extract text from images using optical character recognition.Then, text preprocessing is performed, which involves stemming, lemmatization, and removing stop words [44].The processed text is tokenized and converted into vectors using count vectorization.The vectors are then passed to a Transformer for classification.Then, a comparative study is obtained using various classification metrics between transfer learning and NLP.Subsequently, we integrate both transfer learning and NLP to create a robust hybrid network for classification, harnessing the strengths of both approaches to enhance our overall classification capabilities.

Results
We used heatmap visualization and document classification performance for our Vision Transformer model.The results of our experiments are shown in the tables below and are further discussed in this section.

Heatmap Visualization for the Proposed Vision Transformer
The Grad-CAM (gradient-weighted class activation mapping) heatmap visualization, as shown in Figure 3 for our suggested Vision Transformer applied to 16 different kinds of document images, offers invaluable insights into the model's decision-making process.Grad-CAM aids in both model interpretability and identification of the salient features for precise categorization by emphasizing the regions of interest in these images, enabling us to grasp which parts are crucial for classification.The striking red regions in the heatmap indicate the focal points that the transfer learning models rely on for classification.These regions encapsulate the most discriminative features crucial for distinguishing the different document categories.Essentially, they serve as the model's attention zones, highlighting text patterns, shapes, or other visual cues pivotal for accurate classification.The areas the algorithm considers less significant for its categorization choice are represented by the parts of other colors, which are unremarkable in the heatmap.These areas could include background noise, pointless text, or less instructive visual components.The model maximizes its emphasis on the essential components of the image by ignoring these less important regions during classification, improving its capacity to make accurate document category predictions.By distinguishing between crucial and less relevant regions, Grad-CAM empowers the model to make informed decisions, ultimately improving its classification performance.

Document Classification Performance for the Proposed Vision Transformer
The proposed Vision Transformer, a hybrid model combining NLP Transformer and transfer learning CNN, performs strongly in document classification tasks, as demonstrated in Table 2.The model shows an outstanding accuracy of 94.2% Regarding computational efficiency, the model requires 951 s (approximately 15.85 min) for training, which, while being time consuming, is acceptable given the strong performance.With a test duration of 287 ms, the testing phase is significantly quicker, making it ideal for real-time or nearly real-time document classification applications.Compared to individual models, the Vision Transformer exhibits superior performance, surpassing both the standalone NLP Transformer and transfer learning models regarding accuracy and other classification metrics.

Discussion
In this section, we delve into the document classification performance for Transformers with optical character recognition and natural language processing and the document classification performance for vision-based transfer Learning.Additionally, we provide a comparative analysis with the state-of-the-art methods, offering insights into the effectiveness of our approach.

Document Classification Performance for Transformers with Optical Character Recognition and Natural Language Processing
The performance of Transformers in document classification, aided by optical character recognition (OCR) and natural language processing (NLP), is quite promising, as demonstrated in Table 3.The accuracy of this model is 0.83, meaning that it accurately categorizes 83% of the dataset's documents.With a precision score of 0.87, the model is highly accurate 87% of the time when classifying documents, a measure of the model's ability to prevent false positives.The recall score, in this case, is likewise vital at 0.85.The F1 score, which balances precision and recall, is 0.85, demonstrating the model's sturdiness.In terms of efficiency, the training time is 58 s, which suggests that the model is relatively quick to train.This can be a crucial consideration in real-world applications, especially when dealing with large datasets or the need for frequent model updates.The test time is also quite reasonable at 64 ms, implying that the model can make rapid predictions once trained.These results underscore the effectiveness of combining OCR and NLP with Transformers for document classification tasks.With high accuracy, precision, and recall, this model demonstrates its potential in accurately categorizing documents.

Document Classification Performance for Vision-Based Transfer Learning
Table 4 presents a comprehensive analysis of the document classification performance achieved through vision-based transfer learning using five different pre-trained convolutional neural network (CNN) models: InceptionV3 (IncV3), VGG16, ResNet50, VGG19, and MobileNet.With accuracy scores of 0.89 and 0.90, VGG16 and VGG19 are in first place.This suggests that 89% and 90% of the time, respectively, these algorithms classify documents into the correct categories.Because of their great accuracy, they are likely to be excellent at differentiating across document classes and are a good fit for applications where accuracy is crucial.The model's capacity for avoiding false positives is measured by precision.With a precision score of 0.91, VGG19 outperforms the competition in this area since it accurately predicts a document's class 91% of the time.This characteristic is essential in situations where incorrect classifications can have serious repercussions.Here, VGG16 exhibits the highest recall at 0.83, indicating its proficiency in capturing true positives while minimizing false negatives.A high recall is vital when missing relevant documents can be detrimental.F1 score, which balances precision and recall, also showcases VGG19's superiority with a score of 0.88, highlighting its well-rounded performance.Beyond classification accuracy and precision, the practicality of these models depends on their training and testing times.In this regard, MobileNet stands out as the most efficient model, with a training time of 779 s and a rapid testing time of 112 ms.This makes MobileNet particularly suitable for real-time applications and scenarios with limited computational resources.VGG16, although accurate, requires more training time (802 s) and testing time (231 ms).The other models fall in between in terms of computational demands.In summary, VGG16 and VGG19 offer high accuracy and precision, making them ideal for applications demanding precise categorization.

Comparison with the State of the Art
There are several advantages of our proposed VisFormers model relative to the stateof-the-art approaches.The work conducted by [45,46] achieves an accuracy of less than 90% and requires over 900-1200 s for training using primary image classification neural networks.Nevertheless, our proposed model is designed by combining the transfer learning image classification model and Transformers and obtained an accuracy of over 96%.In the paper [47], the author has only used OCR and classified them using Transformers, achieving an accuracy of less than 80%, which is very low compared to our proposed model.The work conducted by the authors of [48,49] used both vision as well as NLP for classification, but the accuracy they achieved was less than 90%, and training time was also very high.In contrast, our model obtained an accuracy of over 96% and a training time of 951 s, surpassing all other models.In the paper [29], the author used BERT-based pre-trained models for classifying documents.Their model achieved an accuracy of 81%, which is very low compared to our proposed vision Transformer.The D2BFormer model developed by the author in their paper [27,28] excels in degraded document binarization.One limitation is the potential sensitivity to variations in document types.Our work surpasses this limitation, achieving superior performance across diverse document datasets.Universal Document Processing, developed by the author in their paper [30,32], excels in document-related tasks; one limitation is its potential computational intensity during training.However, our model surpasses this challenge, achieving superior performance in terms of computational efficiency and overall accuracy.In the paper [31], the effectiveness of the ARABERT4TWC model in classifying Arabic tweets is showcased.One limitation lies in the lack of exploration into domain and cross-domain pre-training for BERT.Our work surpasses this limitation by demonstrating superior performance in tokenized vectors and learning rate fine-tuning across three datasets.The provided Table 5 shows a comprehensive summary of the performance comparisons.In summary, the results indicate that our model outperforms existing approaches in accuracy and efficiency, positioning it as a valuable contribution to the state-of-the-art visual document classification domain.✓ X <85% >1200 Larson et al. [47] X ✓ <80% >500 Kanchi et al. [48] ✓ ✓ <90% >1500 Bakkali et al. [49] ✓ ✓ <92% >1000 Proposed VisFormers ✓ ✓ >94% <=951

Limitations of the Proposed VisFormers Model
Despite its success in achieving a high accuracy of 94.2% on the RVL-CDIP dataset, the proposed VisFormers model has certain limitations.The model's performance may be sensitive to variations in document characteristics not well-represented in the training set, potentially leading to misclassifications in real-world scenarios.Additionally, the current implementation focuses on a fixed set of 16 document classes, limiting its adaptability to diverse or evolving document categories.Furthermore, the model's effectiveness may vary when applied to documents with highly unconventional layouts or visual elements not encountered during training.

Conclusions
Complex document classification from scanned images is challenging but essential for rapid digitization.Given the prevalence of printed complex documents in various industries, the need for a specialized document classification model is apparent.Our proposed model addresses this gap by leveraging optical character recognition (OCR) to extract text and employing a multi-headed VisFormers network for simultaneous vision-based transfer learning and natural-language-based Transformers.By combining these approaches, our model accurately classifies complex documents, filling the void left by traditional ERP and CRM systems.Integrating our model into existing software solutions enhances their capability to manage and process diverse document types effectively, ultimately improving organizational workflow efficiency and information management.In the proposed model, the vision network receives image input of the data, and the Transformer network receives vectorized OCR data of the same image.Therefore, different parts of the network receive different inputs; they are trained differently using different loss functions and different optimizers.The proposed VisFormers network is tested against a standard complex document classification dataset called RVL-CDIP, and it outperforms the state-of-the-art works by achieving an accuracy score of 94.2%.
The model is tested for 16 document classes, which can be increased in future works.The model can be deployed at several workplaces to digitize paper documents.

Figure 1 .
Figure 1.The framework workflow includes two key steps: document image classification using transfer learning, followed by OCR and Transformer-based text classification, ultimately integrating vision and OCR for robust classification.

Figure 2 .
Figure 2. The VisFormers model combines a Transformer and a pre-trained VGG-19 network for document classification with specific architecture details.

Figure 3 .
Figure 3.The Grad-CAM heatmap visualization highlights critical regions in document image classification.The reddish color indicates a higher probability density of having decision-making features.

Table 1 .
A summary of the current state of the art in document classification, including an overview of recent relevant studies and their associated limitations.

Table 2 .
Performance overview for the proposed Vision Transformer model, showcasing key metrics and execution times for evaluation.

Table 3 .
Performance metrics for the state-of-the-art OCR and NLP-enhanced document classification model.

Table 4 .
Document classification performance using different state-of-the-art pre-trained CNN transfer learning models.

Table 5 .
Performance of the proposed Vision Transformer model compared to other state-of-the-art models.