MDPI - Publisher of Open Access Journals

34 pages, 36077 KB

Open AccessArticle

Modular Multi-Attribute Vehicle Analysis by Color, License Plate, Make and Sub-Model Using YOLO and OCR: A Benchmark Across YOLO Versions

by Cristian Japhet Islas-Yañez, Viridiana Hernández-Herrera and Moisés Márquez-Olivera

Sensors 2026, 26(9), 2785; https://doi.org/10.3390/s26092785 - 29 Apr 2026

Viewed by 937

Abstract

We present a modular multi-attribute vehicle analysis pipeline that integrates YOLO-based models and an OCR engine into a single workflow. The system detects vehicles, classifies color, recognizes make and sub-model, detects license plates, and extracts plate characters to generate a structured vehicle record. [...] Read more.

We present a modular multi-attribute vehicle analysis pipeline that integrates YOLO-based models and an OCR engine into a single workflow. The system detects vehicles, classifies color, recognizes make and sub-model, detects license plates, and extracts plate characters to generate a structured vehicle record. Vehicle detection is reported with standard metrics (precision, recall, and mAP@0.5), while license plate detection is reported at IoU = 0.3 to reflect the small-object nature of plates and downstream OCR usability. Among the evaluated versions, YOLOv8 provides the most balanced overall performance across modules, while maintaining real-time-equivalent throughput of approximately 18–22 FPS for the full pipeline on recorded traffic videos, depending on scene complexity. We emphasize module-level evaluation and runtime benchmarking; instance-level end-to-end identification across unique vehicles is defined as future work once track-based ground truth becomes available. Full article

(This article belongs to the Topic Deep Visual Recognition: Methods, and Applications)

► Show Figures

Figure 1

31 pages, 2433 KB

Open AccessArticle

Quality vs. Populism in Short-Video Political Communication: A Multimodal Study of TikTok

by Alicia Rodas-Coloma, Marcos Cabezas-González, Sonia Casillas-Martín and Pedro Nevado-Batalla Moreno

Journal. Media 2026, 7(1), 46; https://doi.org/10.3390/journalmedia7010046 - 25 Feb 2026

Viewed by 1451

Abstract

The article examines how framing and actor identity structure attention in short-video politics using a country-level corpus from Ecuador. It assembles 4612 public TikTok videos from official accounts and politically salient hashtags, extracts multimodal text via automatic speech recognition and on-screen OCR, and [...] Read more.

The article examines how framing and actor identity structure attention in short-video politics using a country-level corpus from Ecuador. It assembles 4612 public TikTok videos from official accounts and politically salient hashtags, extracts multimodal text via automatic speech recognition and on-screen OCR, and constructs two continuous indices: a quality index (programmatic, efficacy-oriented content) and a populism index (antagonistic, people-versus-elite cues). Engagement is modeled as a fractional response (binomial GLM with logit link), with robustness checks using OLS on logit(ER) and Poisson counts with an offset for log(plays + 1). Models include affect (positive sentiment and anger), hour/day controls, and actor fixed effects (leader, creator, institution, party, and media). The indices display construct validity: quality aligns with positive/joyful tone and populism with anger. Net of controls, populism is positively and consistently associated with engagement across estimators; quality is small and often null or negative. Effects are heterogeneous: leaders gain under both frames, creators primarily under populism, and media modestly under populism, while institutions face penalties under both, and parties show limited returns. Monthly series reveal event-linked intensification of populism, and hashtag networks are modular, mapping onto institutional, partisan, and creator ecosystems. A design analysis identifies a non-populist pathway—benefit-first micro-explanations, concise captions, targeted hashtags, and joyful/efficacy affect—that raises engagement without antagonism. The study contributes a reproducible, open-source pipeline for survey-free, multimodal framing measurement and clarifies how persona × frame interactions and meso-level discursive structure jointly organize attention in short-video politics. Full article

► Show Figures

Figure 1

25 pages, 1558 KB

Open AccessArticle

Towards Scalable Monitoring: An Interpretable Multimodal Framework for Migration Content Detection on TikTok Under Data Scarcity

by Dimitrios Taranis, Gerasimos Razis and Ioannis Anagnostopoulos

Electronics 2026, 15(4), 850; https://doi.org/10.3390/electronics15040850 - 17 Feb 2026

Viewed by 694

Abstract

Short-form video platforms such as TikTok (TikTok Pte. Ltd., Singapore) host large volumes of user-generated, often ephemeral, content related to irregular migration, where relevant cues are distributed across visual scenes, on-screen text, and multilingual captions. Automatically identifying migration-related videos is challenging due to [...] Read more.

Short-form video platforms such as TikTok (TikTok Pte. Ltd., Singapore) host large volumes of user-generated, often ephemeral, content related to irregular migration, where relevant cues are distributed across visual scenes, on-screen text, and multilingual captions. Automatically identifying migration-related videos is challenging due to this multimodal complexity and the scarcity of labeled data in sensitive domains. This paper presents an interpretable multimodal classification framework designed for deployment under data-scarce conditions. We extract features from platform metadata, automated video analysis (Google Cloud Video Intelligence), and Optical Character Recognition (OCR) text, and compare text-only, OCR-only, and vision-only baselines against a multimodal fusion approach using Logistic Regression, Random Forest, and XGBoost. In this pilot study, multimodal fusion consistently improves class separation over single-modality models, achieving an F1-score of 0.92 for the migration-related class under stratified cross-validation. Given the limited sample size, these results are interpreted as evidence of feature separability rather than definitive generalization. Feature importance and SHAP analyses identify OCR-derived keywords, maritime cues, and regional indicators as the most influential predictors. To assess robustness under data scarcity, we apply SMOTE to synthetically expand the training set to 500 samples and evaluate performance on a small held-out set of real videos, observing stable results that further support feature-level robustness. Finally, we demonstrate scalability by constructing a weakly labeled corpus of 600 videos using the identified multimodal cues, highlighting the suitability of the proposed feature set for weakly supervised monitoring at scale. Overall, this work serves as a methodological blueprint for building interpretable multimodal monitoring pipelines in sensitive, low-resource settings. Full article

(This article belongs to the Special Issue Multimodal Learning for Multimedia Content Analysis and Understanding)

► Show Figures

Figure 1

42 pages, 2797 KB

Open AccessReview

Decoding Technical Diagrams: A Survey of AI Methods for Image Content Extraction and Understanding

by Nick Bray, Michael Hempel, Matthew Boeding and Hamid Sharif

Information 2026, 17(2), 165; https://doi.org/10.3390/info17020165 - 6 Feb 2026

Viewed by 2999

Abstract

With artificial intelligence (AI) rapidly increasing in popularity and presence in everyday life, new applications utilizing AI are being explored across virtually all domains, from banking and healthcare to cybersecurity to generative AI for images, voice, and video content creation. With that trend [...] Read more.

With artificial intelligence (AI) rapidly increasing in popularity and presence in everyday life, new applications utilizing AI are being explored across virtually all domains, from banking and healthcare to cybersecurity to generative AI for images, voice, and video content creation. With that trend comes an inherent need for increased AI capabilities. One cornerstone of AI applications is the ability of generative AI to consume documents and utilize their content to answer questions, generate new content, correlate it with other data sources, and more. No longer constrained to text alone, we now leverage multimodal AI models to help us understand visual elements within documents, such as images, tables, figures, and charts. Within this realm, capabilities have expanded exponentially from traditional Optical Character Recognition (OCR) approaches towards increasingly utilizing complex AI models for visual content analysis and understanding. Modern approaches, especially those leveraging AI, are now focusing on interpreting more complex diagrams such as flowcharts, block diagrams, Unified Modeling Language (UML) diagrams, electrical schematics, and timing diagrams. These diagram types combine text, symbols, and structured layout, making them challenging to parse and comprehend using conventional techniques. This paper presents a historical analysis and comprehensive survey of scientific literature exploring this domain of visual understanding of complex technical illustrations and diagrams. We explore the use of deep learning models, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformer-based architectures. These models, along with OCR, enable the extraction of both textual and structural information from visually complex sources. Despite these advancements, numerous challenges remain, however. These range from hallucinations, where the content extraction system produces outputs not grounded in the source image, which leads to misinterpretations, to a lack of contextual understanding of diagrammatic elements, such as arrows, grouping, and spatial hierarchy. This survey focuses on five key diagram types: flowcharts, block diagrams, UML diagrams, electrical schematics, and timing diagrams. It evaluates the effectiveness, limitations, and practical solutions—both traditional and AI-driven—that aim to enable the extraction of accurate and meaningful information from complex diagrams in a way that is trustworthy and suitable for real-world, high-accuracy AI applications. This survey reveals that virtually all approaches struggle with accurately extracting technical diagram information. It also illustrates a path forward. Pursuing research to further improve their accuracy is crucial for supporting and enabling various applications, including complex document question answering and Retrieval Augmented Generation (RAG), document-driven AI agents, accessibility applications, and automation. Full article

(This article belongs to the Special Issue Intelligent Image Processing by Deep Learning, 2nd Edition)

► Show Figures

Figure 1

37 pages, 2905 KB

Open AccessArticle

A Slide Annotation System with Multimodal Analysis for Video Presentation Review

by Amma Liesvarastranta Haz, Komang Candra Brata, Nobuo Funabiki, Htoo Htoo Sandi Kyaw, Evianita Dewi Fajrianti and Sritrusta Sukaridhoto

Algorithms 2026, 19(2), 110; https://doi.org/10.3390/a19020110 - 1 Feb 2026

Viewed by 1080

Abstract

With the rapid growth of online presentations, there has been an increasing need for efficient review of recorded materials. In typical presentations, speakers verbally elaborate on each slide, providing details not captured in the slides themselves. Automatically extracting and embedding these verbal explanations [...] Read more.

With the rapid growth of online presentations, there has been an increasing need for efficient review of recorded materials. In typical presentations, speakers verbally elaborate on each slide, providing details not captured in the slides themselves. Automatically extracting and embedding these verbal explanations at their corresponding slide locations can greatly enhance the review process for audiences. This paper presents a Slide Annotation System that employs a robust hybrid two-stage detector to identify slide boundaries, extracts slide text through Optical Character Recognition (OCR), transcribes narration, and employs a multimodal Large Language Model (LLM) to generate concise, context-aware annotations that are added to their corresponding slide locations. For evaluations, the technical performance was validated on five recorded presentations, while the user experience was assessed by 37 participants. The results showed that the system achieved a macro-average $F_{1}$ score of 0.879 (

S D = 0.024

, 95%

C I [0.849, 0.909]

) for slide segmentation and 90.0% accuracy (95%

C I [74.4 %, 96.5 %]

) for annotation alignment. Subjective evaluations revealed high annotation validity and usefulness as rated by presenters, and a high System Usability Scale (SUS) score of 80.5 (

S D = 6.7

, 95%

C I [78.3, 82.7]

). Qualitative feedback further confirmed that the system effectively streamlined the review process, enabling users to locate key information more efficiently than standard video playback. These findings demonstrate the strong potential of the proposed system as an effective automated annotation system. Full article

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

► Show Figures

Graphical abstract

20 pages, 2119 KB

Open AccessArticle

Intelligent Logistics Sorting Technology Based on PaddleOCR and SMITE Parameter Tuning

by Zhaokun Yang, Yue Li, Lizhi Sun, Yufeng Qiu, Licun Fang, Zibin Hu and Shouna Guo

Appl. Sci. 2026, 16(2), 767; https://doi.org/10.3390/app16020767 - 12 Jan 2026

Viewed by 1498

Abstract

To address the current reliance on manual labor in traditional logistics sorting operations, which leads to low sorting efficiency and high operational costs, this study presents the design of an unmanned logistics vehicle based on the Robot Operating System (ROS). To overcome bounding-box [...] Read more.

To address the current reliance on manual labor in traditional logistics sorting operations, which leads to low sorting efficiency and high operational costs, this study presents the design of an unmanned logistics vehicle based on the Robot Operating System (ROS). To overcome bounding-box loss issues commonly encountered by mainstream video-stream image segmentation algorithms under complex conditions, the novel SMITE video image segmentation algorithm is employed to accurately extract key regions of mail items while eliminating interference. Extracted logistics information is mapped to corresponding grid points within a map constructed using Simultaneous Localization and Mapping (SLAM). The system performs global path planning with the A* heuristic graph search algorithm to determine the optimal route, autonomously navigates to the target location, and completes the sorting task via a robotic arm, while local path planning is managed using the Dijkstra algorithm. Experimental results demonstrate that the SMITE video image segmentation algorithm maintains stable and accurate segmentation under complex conditions, including object appearance variations, illumination changes, and viewpoint shifts. The PaddleOCR text recognition algorithm achieves an average recognition accuracy exceeding 98.5%, significantly outperforming traditional methods. Through the analysis of existing technologies and the design of a novel parcel-grasping control system, the feasibility of the proposed system is validated in real-world environments. Full article

► Show Figures

Figure 1

15 pages, 1308 KB

Open AccessArticle

Evolution of Convolutional and Recurrent Artificial Neural Networks in the Context of BIM: Deep Insight and New Tool, Bimetria

by Andrzej Szymon Borkowski, Łukasz Kochański and Konrad Rukat

Infrastructures 2026, 11(1), 6; https://doi.org/10.3390/infrastructures11010006 - 22 Dec 2025

Cited by 1 | Viewed by 893

Abstract

This paper discusses the evolution of convolutional (CNN) and recurrent (RNN) artificial neural networks in applications for Building Information Modeling (BIM). The paper outlines the milestones reached in the last two decades. The article organizes the current state of knowledge and technology in [...] Read more.

This paper discusses the evolution of convolutional (CNN) and recurrent (RNN) artificial neural networks in applications for Building Information Modeling (BIM). The paper outlines the milestones reached in the last two decades. The article organizes the current state of knowledge and technology in terms of three aspects: (1) computer visualization coupled with BIM models (detection, segmentation, and quality verification in images, videos, and point clouds), (2) sequence and time series modeling (prediction of costs, energy, work progress, risk), and (3) integration of deep learning results with the semantics and topology of Industry Foundation Class (IFC) models. The paper identifies the most used architectures, typical data pipelines (synthetic data from BIM models, transfer learning, mapping results to IFC elements) and practical limitations: lack of standardized benchmarks, high annotation costs, a domain gap between synthetic and real data, and discontinuous interoperability. We indicate directions for development: combining CNN/RNN with graph models and transformers for wider use of synthetic data and semi-/supervised learning, as well as explainability methods that increase trust in AECOO (Architecture, Engineering, Construction, Owners & Operators) processes. A practical case study presents a new application, Bimetria, which uses a hybrid CNN/OCR (Optical Character Recognition) solution to generate 3D models with estimates based on two-dimensional drawings. A deep review shows that although the importance of attention-based and graph-based architectures is growing, CNNs and RNNs remain an important part of the BIM process, especially in engineering tasks, where, in our experience and in the Bimetria case study, mature convolutional architectures offer a good balance between accuracy, stability and low latency. The paper also raises some fundamental questions to which we are still seeking answers. Thus, the article not only presents the innovative new Bimetria tool but also aims to stimulate discussion about the dynamic development of AI (Artificial Intelligence) in BIM. Full article

(This article belongs to the Special Issue Modern Digital Technologies for the Built Environment of the Future)

► Show Figures

Figure 1

19 pages, 4480 KB

Open AccessArticle

FE-WRNet: Frequency-Enhanced Network for Visible Watermark Removal in Document Images

by Zhengli Chen, Yuwei Zhang, Jielu Yan, Xuekai Wei, Weizhi Xian, Qin Mao, Yi Qin and Tong Gao

Appl. Sci. 2025, 15(22), 12216; https://doi.org/10.3390/app152212216 - 18 Nov 2025

Viewed by 1005

Abstract

In video pipelines, document content in recorded lectures, surveillance footage, and broadcasted materials is often overlaid with persistent visible watermarks. Such overlays greatly reduce the readability of document images and interfere with downstream tasks such as optical characteristic recognition (OCR). Despite extensive studies, [...] Read more.

In video pipelines, document content in recorded lectures, surveillance footage, and broadcasted materials is often overlaid with persistent visible watermarks. Such overlays greatly reduce the readability of document images and interfere with downstream tasks such as optical characteristic recognition (OCR). Despite extensive studies, no prior work has concurrently addressed the diverse text layouts and watermark styles commonly encountered in real-world scenarios. To address this gap, we introduce TextLogo, the first benchmark dataset specifically designed for this comprehensive setting. TextLogo encompasses 2000 training pairs and 200 test pairs, spanning a wide array of text layouts and 30 distinct watermark styles. Building on this foundation, we propose the frequency-enhanced watermark-removal network (FE-WRNet), a generative network that fuses information from the spatial domain and the wavelet domain. Our Fused Wavelet Convolution Mixer (FWCM) effectively captures both the body and the edge components of watermarks, thereby enhancing removal performance. Training is guided by a hybrid loss function—including pixel, perceptual, and wavelet-domain objectives—to preserve fine details and edge structures. Moreover, while this work focuses on single-image document watermark removal, the proposed spatial–wavelet fusion and high-frequency-aware loss are directly relevant to video processing tasks—e.g., frame-wise watermark removal and temporal restoration—because watermarks in video often persist across frames and require fidelity-preserving, temporally-consistent restoration. Extensive experiments on TextLogo demonstrate that FE-WRNet outperforms the strongest baseline and reduces the perceptual error by 10.6%. Moreover, the proposed model also generalizes effectively to natural-image watermark datasets. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

29 pages, 5213 KB

Open AccessArticle

Design and Implementation of a Novel Intelligent Remote Calibration System Based on Edge Intelligence

by Quan Wang, Jiliang Fu, Xia Han, Xiaodong Yin, Jun Zhang, Xin Qi and Xuerui Zhang

Symmetry 2025, 17(9), 1434; https://doi.org/10.3390/sym17091434 - 3 Sep 2025

Cited by 1 | Viewed by 1326

Abstract

Calibration of power equipment has become an essential task in modern power systems. This paper proposes a distributed remote calibration prototype based on a cloud–edge–end architecture by integrating intelligent sensing, Internet of Things (IoT) communication, and edge computing technologies. The prototype employs a [...] Read more.

Calibration of power equipment has become an essential task in modern power systems. This paper proposes a distributed remote calibration prototype based on a cloud–edge–end architecture by integrating intelligent sensing, Internet of Things (IoT) communication, and edge computing technologies. The prototype employs a high-precision frequency-to-voltage conversion module leveraging satellite signals to address traceability and value transmission challenges in remote calibration, thereby ensuring reliability and stability throughout the process. Additionally, an environmental monitoring module tracks parameters such as temperature, humidity, and electromagnetic interference. Combined with video surveillance and optical character recognition (OCR), this enables intelligent, end-to-end recording and automated data extraction during calibration. Furthermore, a cloud-edge task scheduling algorithm is implemented to offload computational tasks to edge nodes, maximizing resource utilization within the cloud–edge collaborative system and enhancing service quality. The proposed prototype extends existing cloud–edge collaboration frameworks by incorporating calibration instruments and sensing devices into the network, thereby improving the intelligence and accuracy of remote calibration across multiple layers. Furthermore, this approach facilitates synchronized communication and calibration operations across symmetrically deployed remote facilities and reference devices, providing solid technical support to ensure that measurement equipment meets the required precision and performance criteria. Full article

(This article belongs to the Section Computer)

► Show Figures

Figure 1

28 pages, 5387 KB

Open AccessArticle

A Deep Learning Framework of Super Resolution for License Plate Recognition in Surveillance System

by Pei-Fen Tsai, Jia-Yin Shiu and Shyan-Ming Yuan

Mathematics 2025, 13(10), 1673; https://doi.org/10.3390/math13101673 - 20 May 2025

Cited by 2 | Viewed by 6062

Abstract

Recognizing low-resolution license plates from real-world scenes remains a challenging task. While deep learning-based super-resolution methods have been widely applied, most existing datasets rely on artificially degraded images, and common quality metrics poorly correlate with OCR accuracy. We construct a new paired low- [...] Read more.

Recognizing low-resolution license plates from real-world scenes remains a challenging task. While deep learning-based super-resolution methods have been widely applied, most existing datasets rely on artificially degraded images, and common quality metrics poorly correlate with OCR accuracy. We construct a new paired low- and high-resolution license plate dataset from dashcam videos and propose a specialized super-resolution framework for license plate recognition. Only low-resolution images with OCR accuracy ≥5 are used to ensure sufficient feature information for effective perceptual learning. We analyze existing loss functions and introduce two novel perceptual losses—one CNN-based and one Transformer-based. Our approach improves recognition performance, achieving an average OCR accuracy of 85.14%. Full article

(This article belongs to the Section E1: Mathematics and Computer Science)

► Show Figures

Figure 1

31 pages, 6157 KB

Open AccessArticle

A Self-Adaptive Traffic Signal System Integrating Real-Time Vehicle Detection and License Plate Recognition for Enhanced Traffic Management

by Manar Ashkanani, Alanoud AlAjmi, Aeshah Alhayyan, Zahraa Esmael, Mariam AlBedaiwi and Muhammad Nadeem

Inventions 2025, 10(1), 14; https://doi.org/10.3390/inventions10010014 - 5 Feb 2025

Cited by 18 | Viewed by 13120

Abstract

Traffic management systems play a crucial role in smart cities, especially because increasing urban populations lead to higher traffic volumes on roads. This results in increased congestion at intersections, causing delays and traffic violations. This paper proposes an adaptive traffic control and optimization [...] Read more.

Traffic management systems play a crucial role in smart cities, especially because increasing urban populations lead to higher traffic volumes on roads. This results in increased congestion at intersections, causing delays and traffic violations. This paper proposes an adaptive traffic control and optimization system that dynamically adjusts signal timings in response to real-time traffic situations and volumes by applying machine learning algorithms to images captured through video surveillance cameras. This system is also able to capture the details of vehicles violating signals, which would be helpful for enforcing traffic rules. Benefiting from advancements in computer vision techniques, we deployed a novel real-time object detection model called YOLOv11 in order to detect vehicles and adjust the duration of green signals. Our system used Tesseract OCR for extracting license plate information, thus ensuring robust traffic monitoring and enforcement. A web-based real-time digital twin complemented the system by visualizing traffic volume and signal timings for the monitoring and optimization of traffic flow. Experimental results demonstrated that YOLOv11 achieved a better overall accuracy, namely 95.1%, and efficiency compared to previous models. The proposed solution reduces congestion and improves traffic flow across intersections while offering a scalable and cost-effective approach for smart traffic and lowering greenhouse gas emissions at the same time. Full article

(This article belongs to the Special Issue Advanced Technologies and Artificial Intelligence for Sustainable and Intelligent Transportation Systems)

► Show Figures

Figure 1

19 pages, 3800 KB

Open AccessArticle

Fully Open-Source Meeting Minutes Generation Tool

by Amma Liesvarastranta Haz, Yohanes Yohanie Fridelin Panduman, Nobuo Funabiki, Evianita Dewi Fajrianti and Sritrusta Sukaridhoto

Future Internet 2024, 16(11), 429; https://doi.org/10.3390/fi16110429 - 20 Nov 2024

Cited by 3 | Viewed by 9052

Abstract

With the increasing use of online meetings, there is a growing need for efficient tools that can automatically generate meeting minutes from recorded sessions. Current solutions often rely on proprietary systems, limiting adaptability and flexibility. This paper investigates whether various open-source models and [...] Read more.

With the increasing use of online meetings, there is a growing need for efficient tools that can automatically generate meeting minutes from recorded sessions. Current solutions often rely on proprietary systems, limiting adaptability and flexibility. This paper investigates whether various open-source models and methods such as audio-to-text conversion, summarization, keyword extraction, and optical character recognition (OCR) can be integrated to create a meeting minutes generation tool for recorded video presentations. For this purpose, a series of evaluations are conducted to identify suitable models. Then, the models are integrated into a system that is modular yet accurate. The utilization of an open-source approach ensures that the tool remains accessible and adaptable to the latest innovations, thereby ensuring continuous improvement over time. Furthermore, this approach also benefits organizations and individuals by providing a cost-effective and flexible alternative. This work contributes to creating a modular and easily extensible open-source framework that integrates several advanced technologies and future new models into a cohesive system. The system was evaluated on ten videos created under controlled conditions, which may not fully represent typical online presentation recordings. It showed strong performance in audio-to-text conversion with a low word-error rate. Summarization and keyword extraction were functional but showed room for improvement in terms of precision and relevance, as gathered from the users’ feedback. These results confirm the system’s effectiveness and efficiency in generating usable meeting minutes from recorded presentation videos, with room for improvement in future works. Full article

(This article belongs to the Special Issue Deep Learning and Natural Language Processing II)

► Show Figures

Figure 1

16 pages, 14552 KB

Open AccessArticle

Application of Binary Image Quality Assessment Methods to Predict the Quality of Optical Character Recognition Results

by Mateusz Kopytek, Piotr Lech and Krzysztof Okarma

Appl. Sci. 2024, 14(22), 10275; https://doi.org/10.3390/app142210275 - 8 Nov 2024

Cited by 1 | Viewed by 1801

Abstract

One of the continuous challenges related to the growing popularity of mobile devices and embedded systems with limited memory and computational power is the development of relatively fast methods for real-time image and video analysis. One such example is Optical Character Recognition (OCR), [...] Read more.

One of the continuous challenges related to the growing popularity of mobile devices and embedded systems with limited memory and computational power is the development of relatively fast methods for real-time image and video analysis. One such example is Optical Character Recognition (OCR), which is usually too complex for such devices. Considering that images captured by cameras integrated into mobile devices may be acquired in uncontrolled lighting conditions, some quality issues related to non-uniform illumination may affect the image binarization results and further text recognition results. The solution proposed in this paper is related to a significant reduction in the computational burden, preventing the necessity of full text recognition. Conducting only the initial image binarization using various thresholding methods, the computation of the mutual similarities of binarization results is proposed, making it possible to build a simple model of binary image quality for a fast prediction of the OCR results’ quality. The experimental results provided in the paper obtained for the dataset of 1760 images, as well as the additional verification for a larger dataset, confirm the high correlation of the proposed quality model with text recognition results. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

13 pages, 5724 KB

Open AccessArticle

Comparative Approach to De-Noising TEMPEST Video Frames

by Alexandru Mădălin Vizitiu, Marius Alexandru Sandu, Lidia Dobrescu, Adrian Focșa and Cristian Constantin Molder

Sensors 2024, 24(19), 6292; https://doi.org/10.3390/s24196292 - 28 Sep 2024

Cited by 1 | Viewed by 1971

Abstract

Analysis of unintended compromising emissions from Video Display Units (VDUs) is an important topic in research communities. This paper examines the feasibility of recovering the information displayed on the monitor from reconstructed video frames. The study holds particular significance for our understanding of [...] Read more.

Analysis of unintended compromising emissions from Video Display Units (VDUs) is an important topic in research communities. This paper examines the feasibility of recovering the information displayed on the monitor from reconstructed video frames. The study holds particular significance for our understanding of security vulnerabilities associated with the electromagnetic radiation of digital displays. Considering the amount of noise that reconstructed TEMPEST video frames have, the work in this paper focuses on two different approaches to de-noising images for efficient optical character recognition. First, an Adaptive Wiener Filter (AWF) with adaptive window size implemented in the spatial domain was tested, and then a Convolutional Neural Network (CNN) with an encoder–decoder structure that follows both classical auto-encoder model architecture and U-Net architecture (auto-encoder with skip connections). These two techniques resulted in an improvement of more than two times on the Structural Similarity Index Metric (SSIM) for AWF and up to four times for the SSIM for the Deep Learning (DL) approach. In addition, to validate the results, the possibility of text recovery from processed noisy frames was studied using a state-of-the-art Tesseract Optical Character Recognition (OCR) engine. The present work aims to bring to attention the security importance of this topic and the non-negligible character of VDU information leakages. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

9 pages, 3413 KB

Open AccessEditor’s ChoiceReview

Focused Update on Clinical Testing of Otolith Organs

by Stefan C. A. Hegemann, Anand Kumar Bery and Amir Kheradmand

Audiol. Res. 2024, 14(4), 602-610; https://doi.org/10.3390/audiolres14040051 - 2 Jul 2024

Cited by 6 | Viewed by 4470

Abstract

Sensing gravity through the otolith receptors is crucial for bipedal stability and gait. The overall contribution of the otolith organs to eye movements, postural control, and perceptual functions is the basis for clinical testing of otolith function. With such a wide range of [...] Read more.

Sensing gravity through the otolith receptors is crucial for bipedal stability and gait. The overall contribution of the otolith organs to eye movements, postural control, and perceptual functions is the basis for clinical testing of otolith function. With such a wide range of contributions, it is important to recognize that the functional outcomes of these tests may vary depending on the specific method employed to stimulate the hair cells. In this article, we review common methods used for clinical evaluation of otolith function and discuss how different aspects of physiology may affect the functional measurements in these tests. We compare the properties and performance of various clinical tests with an emphasis on the newly developed video ocular counter roll (vOCR), measurement of ocular torsion on fundus photography, and subjective visual vertical or horizontal (SVV/SVH) testing. Full article

(This article belongs to the Special Issue The Vestibular System: Physiology and Testing Methods)

► Show Figures

Figure 1

Search Results (25)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (25)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI