MDPI - Publisher of Open Access Journals

18 pages, 3629 KiB

Open AccessArticle

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

by Yakoub Bazi, Laila Bashmal, Mohamad Mahmoud Al Rahhal, Riccardo Ricci and Farid Melgani

Remote Sens. 2024, 16(9), 1477; https://doi.org/10.3390/rs16091477 - 23 Apr 2024

Cited by 34 | Viewed by 9944

In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and [...] Read more.

In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and visual question answering (VQA). In particular, we introduce an improved version of the Large Language and Vision Assistant Model (LLaVA), specifically adapted for RS imagery through a low-rank adaptation approach. To evaluate the model performance, we create the RS-instructions dataset, a comprehensive benchmark dataset that integrates four diverse single-task datasets related to captioning and VQA. The experimental results confirm the model’s effectiveness, marking a step forward toward the development of efficient multi-task models for RS image analysis. Full article

(This article belongs to the Special Issue Advanced Application of Artificial Intelligence and Machine Vision in Remote Sensing (Third Edition))

► Show Figures

Figure 1

16 pages, 3148 KiB

Open AccessArticle

Vision-Language Models for Zero-Shot Classification of Remote Sensing Images

by Mohamad Mahmoud Al Rahhal, Yakoub Bazi, Hebah Elgibreen and Mansour Zuair

Appl. Sci. 2023, 13(22), 12462; https://doi.org/10.3390/app132212462 - 17 Nov 2023

Cited by 11 | Viewed by 4278

Abstract

Zero-shot classification presents a challenge since it necessitates a model to categorize images belonging to classes it has not encountered during its training phase. Previous research in the field of remote sensing (RS) has explored this task by training image-based models on known [...] Read more.

Zero-shot classification presents a challenge since it necessitates a model to categorize images belonging to classes it has not encountered during its training phase. Previous research in the field of remote sensing (RS) has explored this task by training image-based models on known RS classes and then attempting to predict the outcomes for unfamiliar classes. Despite these endeavors, the outcomes have proven to be less than satisfactory. In this paper, we propose an alternative approach that leverages vision-language models (VLMs), which have undergone pre-training to grasp the associations between general computer vision image-text pairs in diverse datasets. Specifically, our investigation focuses on thirteen VLMs derived from Contrastive Language-Image Pre-Training (CLIP/Open-CLIP) with varying levels of parameter complexity. In our experiments, we ascertain the most suitable prompt for RS images to query the language capabilities of the VLM. Furthermore, we demonstrate that the accuracy of zero-shot classification, particularly when using large CLIP models, on three widely recognized RS scene datasets yields superior results compared to existing RS solutions. Full article

(This article belongs to the Special Issue UAV Applications in Environmental Monitoring)

► Show Figures

Figure 1

19 pages, 5459 KiB

Open AccessArticle

A CNN Approach for Emotion Recognition via EEG

by Aseel Mahmoud, Khalid Amin, Mohamad Mahmoud Al Rahhal, Wail S. Elkilani, Mohamed Lamine Mekhalfi and Mina Ibrahim

Symmetry 2023, 15(10), 1822; https://doi.org/10.3390/sym15101822 - 25 Sep 2023

Cited by 20 | Viewed by 4350

Abstract

Emotion recognition via electroencephalography (EEG) has been gaining increasing attention in applications such as human–computer interaction, mental health assessment, and affective computing. However, it poses several challenges, primarily stemming from the complex and noisy nature of EEG signals. Commonly adopted strategies involve feature [...] Read more.

Emotion recognition via electroencephalography (EEG) has been gaining increasing attention in applications such as human–computer interaction, mental health assessment, and affective computing. However, it poses several challenges, primarily stemming from the complex and noisy nature of EEG signals. Commonly adopted strategies involve feature extraction and machine learning techniques, which often struggle to capture intricate emotional nuances and may require extensive handcrafted feature engineering. To address these limitations, we propose a novel approach utilizing convolutional neural networks (CNNs) for EEG emotion recognition. Unlike traditional methods, our CNN-based approach learns discriminative cues directly from raw EEG signals, bypassing the need for intricate feature engineering. This approach not only simplifies the preprocessing pipeline but also allows for the extraction of more informative features. We achieve state-of-the-art performance on benchmark emotion datasets, namely DEAP and SEED datasets, showcasing the superiority of our approach in capturing subtle emotional cues. In particular, accuracies of 96.32% and 92.54% were achieved on SEED and DEAP datasets, respectively. Further, our pipeline is robust against noise and artefact interference, enhancing its applicability in real-world scenarios. Full article

(This article belongs to the Special Issue Symmetry in Mechanical and Biomedical Mechanical Engineering II)

► Show Figures

Figure 1

16 pages, 2575 KiB

Open AccessArticle

CapERA: Captioning Events in Aerial Videos

by Laila Bashmal, Yakoub Bazi, Mohamad Mahmoud Al Rahhal, Mansour Zuair and Farid Melgani

Remote Sens. 2023, 15(8), 2139; https://doi.org/10.3390/rs15082139 - 18 Apr 2023

Cited by 6 | Viewed by 2599

Abstract

In this paper, we introduce the CapERA dataset, which upgrades the Event Recognition in Aerial Videos (ERA) dataset to aerial video captioning. The newly proposed dataset aims to advance visual–language-understanding tasks for UAV videos by providing each video with diverse textual descriptions. To [...] Read more.

In this paper, we introduce the CapERA dataset, which upgrades the Event Recognition in Aerial Videos (ERA) dataset to aerial video captioning. The newly proposed dataset aims to advance visual–language-understanding tasks for UAV videos by providing each video with diverse textual descriptions. To build the dataset, 2864 aerial videos are manually annotated with a caption that includes information such as the main event, object, place, action, numbers, and time. More captions are automatically generated from the manual annotation to take into account as much as possible the variation in describing the same video. Furthermore, we propose a captioning model for the CapERA dataset to provide benchmark results for UAV video captioning. The proposed model is based on the encoder–decoder paradigm with two configurations to encode the video. The first configuration encodes the video frames independently by an image encoder. Then, a temporal attention module is added on the top to consider the temporal dynamics between features derived from the video frames. In the second configuration, we directly encode the input video using a video encoder that employs factorized space–time attention to capture the dependencies within and between the frames. For generating captions, a language decoder is utilized to autoregressively produce the captions from the visual tokens. The experimental results under different evaluation criteria show the challenges of generating captions from aerial videos. We expect that the introduction of CapERA will open interesting new research avenues for integrating natural language processing (NLP) with UAV video understandings. Full article

(This article belongs to the Special Issue Advanced Deep Learning Strategies for the Analysis of Remote Sensing Images - 2nd Edition)

► Show Figures

Figure 1

17 pages, 4030 KiB

Open AccessArticle

Vision–Language Model for Visual Question Answering in Medical Imagery

by Yakoub Bazi, Mohamad Mahmoud Al Rahhal, Laila Bashmal and Mansour Zuair

Bioengineering 2023, 10(3), 380; https://doi.org/10.3390/bioengineering10030380 - 20 Mar 2023

Cited by 44 | Viewed by 9818

Abstract

In the clinical and healthcare domains, medical images play a critical role. A mature medical visual question answering system (VQA) can improve diagnosis by answering clinical questions presented with a medical image. Despite its enormous potential in the healthcare industry and services, this [...] Read more.

In the clinical and healthcare domains, medical images play a critical role. A mature medical visual question answering system (VQA) can improve diagnosis by answering clinical questions presented with a medical image. Despite its enormous potential in the healthcare industry and services, this technology is still in its infancy and is far from practical use. This paper introduces an approach based on a transformer encoder–decoder architecture. Specifically, we extract image features using the vision transformer (ViT) model, and we embed the question using a textual encoder transformer. Then, we concatenate the resulting visual and textual representations and feed them into a multi-modal decoder for generating the answer in an autoregressive way. In the experiments, we validate the proposed model on two VQA datasets for radiology images termed VQA-RAD and PathVQA. The model shows promising results compared to existing solutions. It yields closed and open accuracies of 84.99% and 72.97%, respectively, for VQA-RAD, and 83.86% and 62.37%, respectively, for PathVQA. Other metrics such as the BLUE score showing the alignment between the predicted and true answer sentences are also reported. Full article

(This article belongs to the Section Biosignal Processing)

► Show Figures

Figure 1

17 pages, 3761 KiB

Open AccessArticle

Contrasting EfficientNet, ViT, and gMLP for COVID-19 Detection in Ultrasound Imagery

by Mohamad Mahmoud Al Rahhal, Yakoub Bazi, Rami M. Jomaa, Mansour Zuair and Farid Melgani

J. Pers. Med. 2022, 12(10), 1707; https://doi.org/10.3390/jpm12101707 - 12 Oct 2022

Cited by 17 | Viewed by 3296

Abstract

A timely diagnosis of coronavirus is critical in order to control the spread of the virus. To aid in this, we propose in this paper a deep learning-based approach for detecting coronavirus patients using ultrasound imagery. We propose to exploit the transfer learning [...] Read more.

A timely diagnosis of coronavirus is critical in order to control the spread of the virus. To aid in this, we propose in this paper a deep learning-based approach for detecting coronavirus patients using ultrasound imagery. We propose to exploit the transfer learning of a EfficientNet model pre-trained on the ImageNet dataset for the classification of ultrasound images of suspected patients. In particular, we contrast the results of EfficentNet-B2 with the results of ViT and gMLP. Then, we show the results of the three models by learning from scratch, i.e., without transfer learning. We view the detection problem from a multiclass classification perspective by classifying images as COVID-19, pneumonia, and normal. In the experiments, we evaluated the models on a publically available ultrasound dataset. This dataset consists of 261 recordings (202 videos + 59 images) belonging to 216 distinct patients. The best results were obtained using EfficientNet-B2 with transfer learning. In particular, we obtained precision, recall, and F1 scores of 95.84%, 99.88%, and 24 97.41%, respectively, for detecting the COVID-19 class. EfficientNet-B2 with transfer learning presented an overall accuracy of 96.79%, outperforming gMLP and ViT, which achieved accuracies of 93.03% and 92.82%, respectively. Full article

(This article belongs to the Special Issue COVID-19 and Kidney Transplantation: Clinical Outcomes, Management, and Challenges)

► Show Figures

Figure 1

17 pages, 3259 KiB

Open AccessArticle

COVID-19 Detection in CT/X-ray Imagery Using Vision Transformers

by Mohamad Mahmoud Al Rahhal, Yakoub Bazi, Rami M. Jomaa, Ahmad AlShibli, Naif Alajlan, Mohamed Lamine Mekhalfi and Farid Melgani

J. Pers. Med. 2022, 12(2), 310; https://doi.org/10.3390/jpm12020310 - 18 Feb 2022

Cited by 39 | Viewed by 5466

Abstract

The steady spread of the 2019 Coronavirus disease has brought about human and economic losses, imposing a new lifestyle across the world. On this point, medical imaging tests such as computed tomography (CT) and X-ray have demonstrated a sound screening potential. Deep learning [...] Read more.

The steady spread of the 2019 Coronavirus disease has brought about human and economic losses, imposing a new lifestyle across the world. On this point, medical imaging tests such as computed tomography (CT) and X-ray have demonstrated a sound screening potential. Deep learning methodologies have evidenced superior image analysis capabilities with respect to prior handcrafted counterparts. In this paper, we propose a novel deep learning framework for Coronavirus detection using CT and X-ray images. In particular, a Vision Transformer architecture is adopted as a backbone in the proposed network, in which a Siamese encoder is utilized. The latter is composed of two branches: one for processing the original image and another for processing an augmented view of the original image. The input images are divided into patches and fed through the encoder. The proposed framework is evaluated on public CT and X-ray datasets. The proposed system confirms its superiority over state-of-the-art methods on CT and X-ray data in terms of accuracy, precision, recall, specificity, and F1 score. Furthermore, the proposed system also exhibits good robustness when a small portion of training data is allocated. Full article

(This article belongs to the Special Issue Recent Advances on Coronavirus Disease 2019 (COVID-19))

► Show Figures

Figure 1

17 pages, 39319 KiB

Open AccessArticle

UAV Image Multi-Labeling with Data-Efficient Transformers

by Laila Bashmal, Yakoub Bazi, Mohamad Mahmoud Al Rahhal, Haikel Alhichri and Naif Al Ajlan

Appl. Sci. 2021, 11(9), 3974; https://doi.org/10.3390/app11093974 - 27 Apr 2021

Cited by 19 | Viewed by 3678

Abstract

In this paper, we present an approach for the multi-label classification of remote sensing images based on data-efficient transformers. During the training phase, we generated a second view for each image from the training set using data augmentation. Then, both the image and [...] Read more.

In this paper, we present an approach for the multi-label classification of remote sensing images based on data-efficient transformers. During the training phase, we generated a second view for each image from the training set using data augmentation. Then, both the image and its augmented version were reshaped into a sequence of flattened patches and then fed to the transformer encoder. The latter extracts a compact feature representation from each image with the help of a self-attention mechanism, which can handle the global dependencies between different regions of the high-resolution aerial image. On the top of the encoder, we mounted two classifiers, a token and a distiller classifier. During training, we minimized a global loss consisting of two terms, each corresponding to one of the two classifiers. In the test phase, we considered the average of the two classifiers as the final class labels. Experiments on two datasets acquired over the cities of Trento and Civezzano with a ground resolution of two-centimeter demonstrated the effectiveness of the proposed model. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

11 pages, 5795 KiB

Open AccessArticle

Detecting Crop Circles in Google Earth Images with Mask R-CNN and YOLOv3

by Mohamed Lamine Mekhalfi, Carlo Nicolò, Yakoub Bazi, Mohamad Mahmoud Al Rahhal and Eslam Al Maghayreh

Appl. Sci. 2021, 11(5), 2238; https://doi.org/10.3390/app11052238 - 3 Mar 2021

Cited by 9 | Viewed by 5830

Abstract

Automatic detection and counting of crop circles in the desert can be of great use for large-scale farming as it enables easy and timely management of the farming land. However, so far, the literature remains short of relevant contributions in this regard. This [...] Read more.

Automatic detection and counting of crop circles in the desert can be of great use for large-scale farming as it enables easy and timely management of the farming land. However, so far, the literature remains short of relevant contributions in this regard. This letter frames the crop circles detection problem within a deep learning framework. In particular, accounting for their outstanding performance in object detection, we investigate the use of Mask R-CNN (Region Based Convolutional Neural Networks) as well as YOLOv3 (You Only Look Once) models for crop circle detection in the desert. In order to quantify the performance, we build a crop circles dataset from images extracted via Google Earth over a desert area in the East Oweinat in the South-Western Desert of Egypt. The dataset totals 2511 crop circle samples. With a small training set and a relatively large test set, plausible detection rates were obtained, scoring a precision of 1 and a recall of about 0.82 for Mask R-CNN and a precision of 0.88 and a recall of 0.94 regarding YOLOv3. Full article

(This article belongs to the Special Issue Remote and Proximal Sensing Applied to Agriculture and Forest Sciences)

► Show Figures

Figure 1

Search Results (9)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (9)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI