Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (103)

Search Parameters:
Keywords = open vocabulary

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
23 pages, 11860 KB  
Article
HG-RSOVSSeg: Hierarchical Guidance Open-Vocabulary Semantic Segmentation Framework of High-Resolution Remote Sensing Images
by Wubiao Huang, Fei Deng, Huchen Li and Jing Yang
Remote Sens. 2026, 18(2), 213; https://doi.org/10.3390/rs18020213 - 9 Jan 2026
Viewed by 203
Abstract
Remote sensing image semantic segmentation (RSISS) aims to assign a correct class label to each pixel in remote sensing images and has wide applications. With the development of artificial intelligence, RSISS based on deep learning has made significant progress. However, existing methods remain [...] Read more.
Remote sensing image semantic segmentation (RSISS) aims to assign a correct class label to each pixel in remote sensing images and has wide applications. With the development of artificial intelligence, RSISS based on deep learning has made significant progress. However, existing methods remain more focused on predefined semantic classes and require costly retraining when confronted with new classes. To address this limitation, we propose the hierarchical guidance open-vocabulary semantic segmentation framework for remote sensing images (named HG-RSOVSSeg), enabling flexible segmentation of arbitrary semantic classes without model retraining. Our framework leverages pretrained text-embedding models to provide class common knowledge and aligns multimodal features through a dual-stream architecture. Specifically, we propose a multimodal feature aggregation module for pixel-level alignment and a hierarchical visual feature decoder guided by text feature alignment, which progressively refines visual features using language priors, preserving semantic coherence during high-resolution decoding. Extensive experiments were conducted on six representative public datasets, and the results showed that our method has the highest mean mIoU value, establishing state-of-the-art performance in the field of open-vocabulary semantic segmentation of remote sensing images. Full article
Show Figures

Figure 1

25 pages, 1456 KB  
Article
AI-Generated Tailor-Made Pedagogical Picture Books: How Close Are We?
by Branislav Bédi, Hakeem Beedar, Belinda Chiera, Cathy Chua, Stéphanie Geneix-Rabault, Vanessa Kreusch, Christèle Maizonniaux, Manny Rayner, Sophie Rendina, Emily Ryan-Cooper, Vladyslav Sukhyi, Ivana Vargova, Sarah Wright, Chunlin Yao and Rina Zviel-Girshin
Educ. Sci. 2025, 15(12), 1704; https://doi.org/10.3390/educsci15121704 - 17 Dec 2025
Viewed by 582
Abstract
Illustrated digital picture books are widely used for second-language reading and vocabulary growth. We ask how close current generative AI (GenAI) tools are to producing such books on demand for specific learners. Using the ChatGPT-based Learning And Reading (C-LARA) platform with GPT-5 for [...] Read more.
Illustrated digital picture books are widely used for second-language reading and vocabulary growth. We ask how close current generative AI (GenAI) tools are to producing such books on demand for specific learners. Using the ChatGPT-based Learning And Reading (C-LARA) platform with GPT-5 for text/annotation and GPT-Image-1 for illustration, we ran three pilot studies. Study 1 used six AI-generated English books glossed into Chinese, French, and Ukrainian and evaluated them using page-level and whole-book Likert questionnaires completed by teachers and students. Study 2 created six English books targeted at low-intermediate East-Asian adults who had recently arrived in Adelaide and gathered student and teacher ratings. Study 3 piloted an individually tailored German mini-course for one anglophone learner, with judgements from the learner and two germanophone teachers. Images and Chinese glossing were consistently strong; French glossing was good but showed issues with gender agreement, register, and naturalness of phrasing; and Ukrainian glossing underperformed, with morphosyntax and idiom errors. Students rated tailored English texts positively, while teachers requested tighter briefs and curricular alignment. The German pilot was engaging and largely usable, with minor image-consistency and cultural-detail issues. We conclude that for well-supported language pairs (in particular, English–Chinese), the workflow is close to classroom/self-study usability, while other language pairs need improved multi-word expression handling and glossing. All resources are reproducible on the open-source platform. We adopt an interdisciplinary stance which combines aspects taken from computer science, linguistics, and language education. Full article
Show Figures

Figure 1

19 pages, 8340 KB  
Article
Open-Vocabulary Multi-Object Tracking Based on Multi-Cue Fusion
by Liangfeng Xu, Jinqi Bai, Lin Nai and Chang Liu
Appl. Sci. 2025, 15(24), 13151; https://doi.org/10.3390/app152413151 - 15 Dec 2025
Viewed by 451
Abstract
Multi-object tracking (MOT) technology integrates multiple fields such as pattern recognition, machine learning, and object detection, demonstrating broad application potential in scenarios like low-altitude logistics delivery, urban security, autonomous driving, and intelligent navigation. However, in open-world scenarios, existing MOT methods often face challenges [...] Read more.
Multi-object tracking (MOT) technology integrates multiple fields such as pattern recognition, machine learning, and object detection, demonstrating broad application potential in scenarios like low-altitude logistics delivery, urban security, autonomous driving, and intelligent navigation. However, in open-world scenarios, existing MOT methods often face challenges of imprecise target category identification and insufficient tracking accuracy, especially when dealing with numerous target types affected by occlusion and deformation. To address this, we propose a multi-object tracking strategy based on multi-cue fusion. This strategy combines appearance features and spatial feature information, employing BYTE and weighted Intersection over Union (IoU) modules to handle target association, thereby improving tracking accuracy. Furthermore, to tackle the challenge of large vocabularies in open-world scenarios, we introduce an open-vocabulary prompting strategy. By incorporating diverse sentence structures, emotional elements, and image quality descriptions, the expressiveness of text descriptions is enhanced. Combined with the CLIP model, this strategy significantly improves the recognition capability for novel category targets without requiring model retraining. Experimental results on the public TAO benchmark show that our method yields consistent TETA improvements over existing open-vocabulary trackers, with gains of 10% and 16% on base and novel categories, respectively. The results demonstrate that the proposed framework offers a more robust solution for open-vocabulary multi-object tracking in complex environments. Full article
(This article belongs to the Special Issue AI for Sustainability and Innovation—2nd Edition)
Show Figures

Figure 1

28 pages, 583 KB  
Article
Multiple Large AI Models’ Consensus for Object Detection—A Survey
by Marcin Iwanowski and Marcin Gahbler
Appl. Sci. 2025, 15(24), 12961; https://doi.org/10.3390/app152412961 - 9 Dec 2025
Viewed by 1314
Abstract
The rapid development of large artificial intelligence (AI) models—large language models (LLMs), multimodel large language models (MLLMs) and vision–language models (VLMs)—has enabled instruction-driven visual understanding, where a single foundation model can recognize and localize arbitrary objects from natural-language prompts. However, predictions from individual [...] Read more.
The rapid development of large artificial intelligence (AI) models—large language models (LLMs), multimodel large language models (MLLMs) and vision–language models (VLMs)—has enabled instruction-driven visual understanding, where a single foundation model can recognize and localize arbitrary objects from natural-language prompts. However, predictions from individual models remain inconsistent—LLMs hallucinate nonexistent entities, while VLMs exhibit limited recall and unstable calibration compared to purpose-trained detectors. To address these limitations, a new paradigm termed “multiple large AI model’s consensus” has emerged. In this approach, multiple heterogeneous LLMs, MLLMs or VLMs process a shared visual–textual instruction and generate independent structured outputs (bounding boxes and categories). Next, their results are merged through consensus mechanisms. This cooperative inference improves spatial accuracy and semantic correctness, making it particularly suitable for generating high-quality training datasets for fast real-time object detectors. This survey provides a comprehensive overview of the large multi-AI model’s consensus for object detection. We formalize the concept, review related literature on ensemble reasoning and multimodal perception, and categorize existing methods into four frameworks: prompt-level, reasoning-to-detection, box-level, and hybrid consensus. We further analyze fusion algorithms, evaluation metrics, and benchmark datasets, highlighting their strengths and limitations. Finally, we discuss open challenges—vocabulary alignment, uncertainty calibration, computational efficiency, and bias propagation—and identify emerging trends such as consensus-aware training, structured reasoning, and collaborative perception ecosystems. Full article
Show Figures

Figure 1

11 pages, 771 KB  
Article
VisPower: Curriculum-Guided Multimodal Alignment for Fine-Grained Anomaly Perception in Power Systems
by Huaguang Yan, Zhenyu Chen, Jianguang Du, Yunfeng Yan and Shuai Zhao
Electronics 2025, 14(23), 4747; https://doi.org/10.3390/electronics14234747 - 2 Dec 2025
Cited by 1 | Viewed by 365
Abstract
Precise perception of subtle anomalies in power equipment—such as insulator cracks, conductor corrosion, or foreign intrusions—is vital for ensuring the reliability of smart grids. However, foundational vision-language models (VLMs) like CLIP exhibit poor domain transfer and fail to capture minute defect semantics. We [...] Read more.
Precise perception of subtle anomalies in power equipment—such as insulator cracks, conductor corrosion, or foreign intrusions—is vital for ensuring the reliability of smart grids. However, foundational vision-language models (VLMs) like CLIP exhibit poor domain transfer and fail to capture minute defect semantics. We propose VisPower, a curriculum-guided multimodal alignment framework that progressively enhances fine-grained perception through two training stages: (1) Semantic Grounding, leveraging 100 K long-caption pairs to establish a robust linguistic-visual foundation, and (2) Contrastive Refinement, using 24 K region-level and hard-negative samples to strengthen discrimination among visually similar anomalies. Trained on our curated PowerAnomalyVL dataset, VisPower achieves an 18.4% absolute gain in zero-shot retrieval accuracy and a 16.8% improvement in open-vocabulary defect detection (OV-DD) over strong CLIP baselines. These results demonstrate the effectiveness of curriculum-based multimodal alignment for high-stakes industrial anomaly perception. Full article
(This article belongs to the Section Industrial Electronics)
Show Figures

Figure 1

19 pages, 2418 KB  
Article
D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection
by Bintao He, Caixia Yan, Yan Kou, Yinghao Wang, Xin Lv, Haipeng Du and Yugui Xie
Appl. Sci. 2025, 15(23), 12723; https://doi.org/10.3390/app152312723 - 1 Dec 2025
Viewed by 366
Abstract
Continual learning for open-vocabulary object detection aims to enable pretrained vision–language detectors to adapt to diverse specialized domains while preserving their zero-shot generalization capabilities. However, existing methods primarily focus on mitigating catastrophic forgetting, often neglecting the substantial domain shifts commonly encountered in real-world [...] Read more.
Continual learning for open-vocabulary object detection aims to enable pretrained vision–language detectors to adapt to diverse specialized domains while preserving their zero-shot generalization capabilities. However, existing methods primarily focus on mitigating catastrophic forgetting, often neglecting the substantial domain shifts commonly encountered in real-world applications. To address this critical oversight, we pioneer Open-Domain Continual Object Detection (OD-COD), a new paradigm that requires detectors to continually adapt across domains with significant stylistic gaps. We propose Disentangled Domain Knowledge-Aided Learning (D-Know) to tackle this challenge. This framework explicitly disentangles domain-general priors from category-specific adaptation, managing them dynamically in a scalable domain knowledge base. Specifically, D-Know first learns domain priors in a self-supervised manner and then leverages these priors to facilitate category-specific adaptation within each domain. To rigorously evaluate this task, we construct OD-CODB, the first dedicated benchmark spanning six domains with substantial visual variations. Extensive experiments demonstrate that D-Know achieves superior performance, surpassing current state-of-the-art methods by an average of 4.2% mAP under open-domain continual settings while maintaining strong zero-shot generalization. Furthermore, experiments under the few-shot setting confirm D-Know’s superior data efficiency. Full article
Show Figures

Figure 1

29 pages, 166274 KB  
Article
Bridging Vision Foundation and Vision–Language Models for Open-Vocabulary Semantic Segmentation of UAV Imagery
by Fan Li, Zhaoxiang Zhang, Xuanbin Wang, Xuan Wang and Yuelei Xu
Remote Sens. 2025, 17(22), 3704; https://doi.org/10.3390/rs17223704 - 13 Nov 2025
Viewed by 1140
Abstract
Open-vocabulary semantic segmentation (OVSS) is of critical importance for unmanned aerial vehicle (UAV) imagery, as UAV scenes are highly dynamic and characterized by diverse, unpredictable object categories. Current OVSS approaches mainly rely on the zero-shot capabilities of vision–language models (VLMs), but their image-level [...] Read more.
Open-vocabulary semantic segmentation (OVSS) is of critical importance for unmanned aerial vehicle (UAV) imagery, as UAV scenes are highly dynamic and characterized by diverse, unpredictable object categories. Current OVSS approaches mainly rely on the zero-shot capabilities of vision–language models (VLMs), but their image-level pretraining objectives yield ambiguous spatial relationships and coarse-grained feature representations, resulting in suboptimal performance in UAV scenes. In this work, we propose a novel hybrid framework for OVSS in UAV imagery, named HOSU, which leverages the priors of vision foundation models to unleash the potential of vision–language models in representing complex spatial distributions and capturing fine-grained small-object details in UAV scenes. Specifically, we propose a distribution-aware fine-tuning method that aligns CLIP with DINOv2 across intra- and inter-region feature distributions, enhancing the capacity of CLIP to model complex scene semantics and capture fine-grained details critical for UAV imagery. Meanwhile, we propose a text-guided multi-level regularization mechanism that leverages the text embeddings of CLIP to impose semantic constraints on the visual features, preventing their drift from the original semantic space during fine-tuning and ensuring stable vision–language correspondence. Finally, to address the pervasive occlusion in UAV imagery, we propose a mask-based feature consistency strategy that enables the model to learn stable representations, remaining robust against viewpoint-induced occlusions. Extensive experiments across four training settings on six UAV datasets demonstrate that our approach consistently achieves state-of-the-art performance compared with previous methods, while comprehensive ablation studies and analyses further validate its effectiveness. Full article
Show Figures

Graphical abstract

24 pages, 4364 KB  
Article
Determining the Optimal T-Value for the Temperature Scaling Calibration Method Using the Open-Vocabulary Detection Model YOLO-World
by Max Andreas Ingrisch, Rani Marcel Schilling, Ingo Chmielewski and Stefan Twieg
Appl. Sci. 2025, 15(22), 12062; https://doi.org/10.3390/app152212062 - 13 Nov 2025
Cited by 1 | Viewed by 1258
Abstract
Object detection is an important tool in many areas, such as robotics or autonomous driving. Especially in these areas, a wide variety of object classes must be detected or interacted with. Models from the field of Open-Vocabulary Detection (OVD) provide a solution here, [...] Read more.
Object detection is an important tool in many areas, such as robotics or autonomous driving. Especially in these areas, a wide variety of object classes must be detected or interacted with. Models from the field of Open-Vocabulary Detection (OVD) provide a solution here, as they can detect not only base classes but also novel object classes, i.e., those classes that were not seen during training. However, one problem with OVD models is their poor calibration, meaning that the predictions are often too over- or under-confident. To improve the calibration, Temperature Scaling is used in this study. Using YOLO World, one of the best-performing OVD models, the aim is to determine the optimal T-value for this calibration method. For this reason, it is investigated whether there is a correlation between the logit distribution and the optimal T-value and how this can be modeled. Finally, the influence of Temperature Scaling on the Expected Calibration Error (ECE) and the mAP (Mean Average Precision) will be analyzed. The results of this study show that similar logit distributions of different datasets result in the same optimal T-values. This correlation could be best modeled using Kernel Ridge Regression (KRR) and Support Vector Machine (SVM). In all cases, the ECE could be improved by Temperature Scaling without significantly reducing the mAP. Full article
Show Figures

Figure 1

16 pages, 5440 KB  
Article
Pov9D: Point Cloud-Based Open-Vocabulary 9D Object Pose Estimation
by Tianfu Wang and Hongguang Wang
J. Imaging 2025, 11(11), 380; https://doi.org/10.3390/jimaging11110380 - 28 Oct 2025
Viewed by 786
Abstract
We propose a point cloud-based framework for open-vocabulary object pose estimation, called Pov9D. Existing approaches are predominantly RGB-based and often rely on texture or appearance cues, making them susceptible to pose ambiguities when objects are textureless or lack distinctive visual features. In contrast, [...] Read more.
We propose a point cloud-based framework for open-vocabulary object pose estimation, called Pov9D. Existing approaches are predominantly RGB-based and often rely on texture or appearance cues, making them susceptible to pose ambiguities when objects are textureless or lack distinctive visual features. In contrast, Pov9D takes 3D point clouds as input, enabling direct access to geometric structures that are essential for accurate and robust pose estimation, especially in open-vocabulary settings. To bridge the gap between geometric observations and semantic understanding, Pov9D integrates category-level textual descriptions to guide the estimation process. To this end, we introduce a text-conditioned shape prior generator that predicts a normalized object shape from both the observed point cloud and the textual category description. This shape prior provides a consistent geometric reference, facilitating precise prediction of object translation, rotation, and size, even for unseen categories. Extensive experiments on the OO3D-9D benchmark demonstrate that Pov9D achieves state-of-the-art performance, improving Abs IoU@50 by 7.2% and Rel 10° 10 cm by 27.2% over OV9D. Full article
(This article belongs to the Special Issue 3D Image Processing: Progress and Challenges)
Show Figures

Figure 1

20 pages, 644 KB  
Systematic Review
Augmented Reality in English Language Acquisition Among Gifted Learners: A Systematic Scoping Review (2020–2025)
by Nerea Oto-Millera, Silvia Pellicer-Ortín and Juan Carlos Bustamante
Appl. Sci. 2025, 15(21), 11487; https://doi.org/10.3390/app152111487 - 28 Oct 2025
Viewed by 1388
Abstract
Gifted students often display advanced verbal abilities that facilitate second language acquisition; however, when instruction is insufficiently stimulating, they may experience boredom and demotivation. Due to rising interest in immersive technologies such as augmented reality (AR) and limited evidence of their impact on [...] Read more.
Gifted students often display advanced verbal abilities that facilitate second language acquisition; however, when instruction is insufficiently stimulating, they may experience boredom and demotivation. Due to rising interest in immersive technologies such as augmented reality (AR) and limited evidence of their impact on gifted language learners, a systematic scoping review was necessary to synthesise existing research and identify gaps. It examined the impact of AR on both linguistic development and motivational outcomes among gifted learners in ESL/EFL contexts. It was preregistered in the Open Science Framework (OSF) and conducted according to PRISMA-ScR guidelines. Eligible studies included gifted learners in ESL/EFL contexts, published between 2020 and 2025 in English, Spanish, French, or Italian. Exclusion criteria comprised non–peer-reviewed papers and studies unrelated to AR. Searches were conducted in Scopus, Web of Science, ERIC, and Redalyc. A total of 34 studies were included. Findings indicate that AR interventions improve vocabulary, listening, pronunciation, and fluency; writing also benefits, although grammar remains challenging. AR enhances intrinsic motivation, reduces anxiety, and fosters engagement, especially in younger learners. The results suggest that AR can be a valuable tool in EFL/ESL classrooms to support both linguistic development and motivation among gifted students, though sustainable implementation requires overcoming technological and pedagogical barriers. Full article
(This article belongs to the Special Issue ICT in Education, 2nd Edition)
Show Figures

Figure 1

24 pages, 4764 KB  
Article
Mask-Guided Teacher–Student Learning for Open-Vocabulary Object Detection in Remote Sensing Images
by Shuojie Wang, Yu Song, Jiajun Xiang, Yanyan Chen, Ping Zhong and Ruigang Fu
Remote Sens. 2025, 17(19), 3385; https://doi.org/10.3390/rs17193385 - 9 Oct 2025
Viewed by 1502
Abstract
Open-vocabulary object detection in remote sensing aims to detect novel categories not seen during training, which is crucial for practical aerial image analysis applications. While some approaches accomplish this task through large-scale data construction, such methods incur substantial annotation and computational costs. In [...] Read more.
Open-vocabulary object detection in remote sensing aims to detect novel categories not seen during training, which is crucial for practical aerial image analysis applications. While some approaches accomplish this task through large-scale data construction, such methods incur substantial annotation and computational costs. In contrast, we focus on efficient utilization of limited datasets. However, existing methods such as CastDet struggle with inefficient data utilization and class imbalance issues in pseudo-label generation for novel categories. We propose an enhanced open-vocabulary detection framework that addresses these limitations through two key innovations. First, we introduce a selective masking strategy that enables direct utilization of partially annotated images by masking base category regions in teacher model inputs. This approach eliminates the need for strict data separation and significantly improves data efficiency. Second, we develop a dynamic frequency-based class weighting that automatically adjusts category weights based on real-time pseudo-label statistics to mitigate class imbalance issues. Our approach integrates these components into a student–teacher learning framework with RemoteCLIP for novel category classification. Comprehensive experiments demonstrate significant improvements on both datasets: on VisDroneZSD, we achieve 42.7% overall mAP and 41.4% harmonic mean, substantially outperforming existing methods. On DIOR dataset, our method achieves 63.7% overall mAP with 49.5% harmonic mean. Our framework achieves more balanced performance between base and novel categories, providing a practical and data-efficient solution for open-vocabulary aerial object detection. Full article
Show Figures

Figure 1

27 pages, 1793 KB  
Article
Parental Language Mixing in Montreal: Rates, Predictors, and Relation to Infants’ Vocabulary Size
by Alexandra Paquette and Krista Byers-Heinlein
Behav. Sci. 2025, 15(10), 1371; https://doi.org/10.3390/bs15101371 - 8 Oct 2025
Viewed by 1019
Abstract
Language mixing is a common feature of bilingual communication, yet its predictors and effects on children’s vocabulary development remain debated. Most research has been conducted in contexts with clear societal and heritage languages, leaving open questions about language mixing in environments with two [...] Read more.
Language mixing is a common feature of bilingual communication, yet its predictors and effects on children’s vocabulary development remain debated. Most research has been conducted in contexts with clear societal and heritage languages, leaving open questions about language mixing in environments with two societal languages. Montreal provides a unique opportunity to examine this question, as both French and English hold societal status, while many families also maintain heritage languages. Using archival data from 398 bilingual children (7–34 months), we looked at French-English bilinguals (representing societal bilingualism) and heritage-language bilinguals within the same sociolinguistic environment. We assessed the prevalence, predictors, and motivations of parental language mixing and its relationship with vocabulary development. Results revealed that mixing was less frequent among French-English bilinguals compared to heritage-language bilinguals in the same city. The direction of mixing differed between groups: French-English bilinguals mixed based on language dominance, while heritage-language bilinguals mixed based on societal language status. Primary motivations included uncertainty about word meanings, lack of suitable translations, and teaching new words. Mixing showed minimal associations with vocabulary size across participants. These findings suggest that parental mixing practices reflect adaptive strategies that vary by sociolinguistic context rather than detrimental influences on early language acquisition. Full article
(This article belongs to the Special Issue Language and Cognitive Development in Bilingual Children)
Show Figures

Figure 1

20 pages, 3847 KB  
Article
Augmented Reality’s Impact on English Vocabulary and Content Acquisition in the CLIL Classroom
by Mar Fernandez-Alcocer and Jose Belda-Medina
Appl. Sci. 2025, 15(19), 10380; https://doi.org/10.3390/app151910380 - 24 Sep 2025
Viewed by 1301
Abstract
This study interrogates whether Augmented Reality (AR) enhances vocabulary and content acquisition within Content and Language Integrated Learning (CLIL), situating the question in the broader debate on how immersive, multimodal technologies shape achievement and engagement. This study’s novelty lies in its direct AR-versus-print [...] Read more.
This study interrogates whether Augmented Reality (AR) enhances vocabulary and content acquisition within Content and Language Integrated Learning (CLIL), situating the question in the broader debate on how immersive, multimodal technologies shape achievement and engagement. This study’s novelty lies in its direct AR-versus-print comparison in a real CLIL classroom using markerless, smartphone-based technology. Using a mixed-methods, classroom-based experiment, we drew on a convenience sample of 129 secondary students (ages 16–18), assigning them to an AR intervention (n = 64) or a print-based control (n = 65). Both cohorts received parallel instruction covering identical objectives and materials; vocabulary attainment was gauged using matched pretest and post-test measures, while engagement, attitudes, and perceived usefulness were captured through paired pre- and post-surveys and open-ended prompts. Quantitative analyses compared change scores across conditions and were complemented by qualitative summaries of learner comments. Results indicate that exposure to AR exerted a positive influence on learners’ engagement and supported learning processes, with perceptible shifts in students’ views of AR between baseline and post-intervention; nevertheless, effects were heterogeneous across instruments, items, and subgroups, suggesting that benefits accrued in a targeted rather than uniform fashion. Compared to the print-based group, students using AR demonstrated greater gains on visually supported vocabulary and content items, while other items showed no significant differences between groups. We conclude that AR constitutes a promising pedagogical resource for CLIL, capable of scaffolding vocabulary/content development and motivating participation, while the observed variability underscores the need for principled, context-sensitive integration. Future work should specify boundary conditions—such as task type, prior proficiency, cognitive load, and technology familiarity—and employ robust mixed-methods designs to determine for whom, and under which instructional circumstances, AR yields the greatest and most sustainable gains. Full article
Show Figures

Figure 1

21 pages, 3747 KB  
Article
Open-Vocabulary Crack Object Detection Through Attribute-Guided Similarity Probing
by Hyemin Yoon and Sangjin Kim
Appl. Sci. 2025, 15(19), 10350; https://doi.org/10.3390/app151910350 - 24 Sep 2025
Viewed by 1857
Abstract
Timely detection of road surface defects such as cracks and potholes is critical for ensuring traffic safety and reducing infrastructure maintenance costs. While recent advances in image-based deep learning techniques have shown promise for automated road defect detection, existing models remain limited to [...] Read more.
Timely detection of road surface defects such as cracks and potholes is critical for ensuring traffic safety and reducing infrastructure maintenance costs. While recent advances in image-based deep learning techniques have shown promise for automated road defect detection, existing models remain limited to closed-set detection settings, making it difficult to recognize newly emerging or fine-grained defect types. To address this limitation, we propose an attribute-aware open-vocabulary crack detection (AOVCD) framework, which leverages the alignment capability of pretrained vision–language models to generalize beyond fixed class labels. In this framework, crack types are represented as combinations of visual attributes, enabling semantic grounding between image regions and natural language descriptions. To support this, we extend the existing PPDD dataset with attribute-level annotations and incorporate a multi-label attribute recognition task as an auxiliary objective. Experimental results demonstrate that the proposed AOVCD model outperforms existing baselines. In particular, compared to CLIP-based zero-shot inference, the proposed model achieves approximately a 10-fold improvement in average precision (AP) for novel crack categories. Attribute classification performance—covering geometric, spatial, and textural features—also increases by 40% in balanced accuracy (BACC) and 23% in AP. These results indicate that integrating structured attribute information enhances generalization to previously unseen defect types, especially those involving subtle visual cues. Our study suggests that incorporating attribute-level alignment within a vision–language framework can lead to more adaptive and semantically grounded defect recognition systems. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

16 pages, 881 KB  
Article
Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP
by Youlin Liu, Zainal Rasyid Mahayuddin and Mohammad Faidzul Nasrudin
Appl. Sci. 2025, 15(18), 10112; https://doi.org/10.3390/app151810112 - 16 Sep 2025
Viewed by 1316
Abstract
3D Multi-Object Tracking (3D MOT) is a critical task in autonomous systems, where accurate and robust tracking of multiple objects in dynamic environments is essential. Traditional approaches primarily rely on visual or geometric features, often neglecting the rich semantic information available in textual [...] Read more.
3D Multi-Object Tracking (3D MOT) is a critical task in autonomous systems, where accurate and robust tracking of multiple objects in dynamic environments is essential. Traditional approaches primarily rely on visual or geometric features, often neglecting the rich semantic information available in textual modalities. In this paper, we propose Text-Guided 3D Multi-Object Tracking (TG3MOT), a novel framework that incorporates Vision-Language Models (VLMs) into the YONTD architecture to improve 3D MOT performance. Our framework leverages RegionCLIP, a multimodal open-vocabulary detector, to achieve fine-grained alignment between image regions and textual concepts, enabling the incorporation of semantic information into the tracking process. To address challenges such as occlusion, blurring, and ambiguous object appearances, we introduce the Target Semantic Matching Module (TSM), which quantifies the uncertainty of semantic alignment and filters out unreliable regions. Additionally, we propose the 3D Feature Exponential Moving Average Module (3D F-EMA) to incorporate temporal information, improving robustness in noisy or occluded scenarios. Furthermore, the Gaussian Confidence Fusion Module (GCF) is introduced to weight historical trajectory confidences based on temporal proximity, enhancing the accuracy of trajectory management. We evaluate our framework on the KITTI dataset and compare it with the YONTD baseline. Extensive experiments demonstrate that although the overall HOTA gain of TG3MOT is modest (+0.64%), our method achieves substantial improvements in association accuracy (+0.83%) and significantly reduces ID switches (−16.7%). These improvements are particularly valuable in real-world autonomous driving scenarios, where maintaining consistent trajectories under occlusion and ambiguous appearances is crucial for downstream tasks such as trajectory prediction and motion planning. The code will be made publicly available. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

Back to TopTop