Deep-Learning-Based Cognitive Assistance Embedded Systems for People with Visual Impairment

Ngo, Huu-Huy; Le, Hung Linh; Lin, Feng-Cheng

doi:10.3390/app15115887

Open AccessArticle

Deep-Learning-Based Cognitive Assistance Embedded Systems for People with Visual Impairment

by

Huu-Huy Ngo

¹

,

Hung Linh Le

²

and

Feng-Cheng Lin

^3,*

¹

Faculty of Information Technology, Thai Nguyen University of Information and Communication Technology, Thai Nguyen 24000, Vietnam

²

Faculty of Engineering and Technology, Thai Nguyen University of Information and Communication Technology, Thai Nguyen 24000, Vietnam

³

Department of Information Engineering and Computer Science, Feng Chia University, Taichung 40724, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 5887; https://doi.org/10.3390/app15115887

Submission received: 20 April 2025 / Revised: 20 May 2025 / Accepted: 20 May 2025 / Published: 23 May 2025

(This article belongs to the Special Issue Improving Healthcare with Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

For people with vision impairment, various daily tasks, such as independent navigation, information access, and context awareness, may be challenging. Although several smart devices have been developed to assist blind people, most of these devices focus exclusively on navigation assistance and obstacle avoidance. In this study, we developed a portable system for not only obstacle avoidance but also identifying people and their emotions. The core of the developed system is a powerful and portable edge computing device that implements various deep learning algorithms for images captured from a webcam. The user can easily select a function by using a remote control device, and the system vocally reports the results to the user. The developed system has three primary functions: detecting the names and emotions of known people; detecting the age, gender, and emotion of unknown people; and detecting objects. To validate the performance of the developed system, a prototype was constructed and tested. The results reveal that the developed system has high accuracy and responsiveness and is therefore suitable for practical applications as a navigation and social assistive device for people with visual impairment.

Keywords:

age classification; emotion classification; face recognition; gender classification; object detection

1. Introduction

People with visual impairment must overcome various difficulties in daily life; tasks such as moving around spaces might be challenging for these people. Numerous tools and devices have been created to help blind people increase their independence. The designs of these tools and devices are typically sensor-based, computer-vision-based, or smartphone-based. Sensor-based approaches [1,2,3,4,5] use data collected from a variety of sensors, including ultrasonic, infrared (IR), laser, and distance sensors, to identify obstructions. Computer-vision-based techniques [6,7,8,9,10] use cameras to capture the environment in front of the user and then apply computer vision algorithms to identify obstacles. Smartphone-based solutions [11,12,13,14] leverage smartphone cameras or sensors to obtain information from the outside world. These data are then processed to detect obstructions [15]. Recent deep learning developments have greatly improved picture recognition system performance and application in practical assisting environments. Deep learning models, especially convolutional neural networks (CNNs), can learn multi-level feature representations from raw input data on their own, unlike conventional image processing techniques. This skill helps these models to excel in challenging tasks such as object identification, facial recognition, and emotional categorization. Such characteristics are particularly crucial in dynamic and erratic environments; such urban navigation systems are meant to help visually challenged individuals. For instance, Han et al. [16] developed methods incorporating deep learning with data augmentation methods to increase architectural feature detection in complex street-view imagery. Their research is a prime example of how deep learning models are adaptable and robust.

The designs of most devices developed for people with visual impairment are focused primarily on navigation and obstacle avoidance instead of context awareness and object recognition [17,18,19,20]. In addition, these designs often use servers or computers to implement their method, which limits possible navigation areas. Object detection schemes are typically employed to perform object identification in a current scene, and the produced descriptions of the environment are simplistic. To resolve the aforementioned problems, this study developed a deep-learning-based multifunctional embedded mobile system that performs face recognition, object detection, and gender, age, and emotion classifications to assist people with visual impairment in perceiving their environment. These tasks are briefly described as follows.

(1) Face recognition

Face recognition has become a popular research topic in computer vision because this task is critical for numerous applications, including monitoring and security, video surveillance, and human–computer interaction (HCI) systems [21]. Deep learning techniques can successfully perform face recognition; for example, CNN architectures have been used to develop cutting-edge face recognition technologies, such as VGGFace [22], FaceNet [23], and ArcFace [24].

Internet of Things (IoT) technology has also recently been widely applied; however, the potential of IoT-based health-care systems requires further exploration. Some studies have developed cloud- or computer-based facial recognition software for people with visual impairment. Cloud-based systems typically comprise a server and a local unit [25,26]. The local unit collects data such as photos and sends and receives the data to and from a server. The server processes the received data and communicates its findings to the local unit. Cloud-based systems can be portable but typically have long processing times and are only usable with Internet access. Computer-based approaches often involve using a laptop [27,28] to handle all of the operations of a cloud-based system. However, carrying a laptop during movement and navigation may be tiring for people with visual impairment. Thus, computer-based face recognition methods can be improved.

(2) Gender, age, and emotion classifications

Gender classification has applications in HCI, surveillance systems, business, demographic research, and entertainment. For HCIs, robots can identify a user’s gender to offer tailored services. Furthermore, intelligent surveillance systems can be improved by implementing gender classification. Gender classification can also be used to facilitate commercial development by improving market research to support corporate decision-making. Moreover, gender classification facilitates the collection of demographic data in demographic research. In entertainment, gender classification can be used for dynamically modifying and customizing mobile applications (apps) or game content [29]. In medicine, gender classification methods are rarely applied. However, gender classification methods for assisting those with visual impairment have recently received substantial attention and represent a promising area for additional study.

Age classification has also been widely applied. For example, age classification can be used to prevent minors from viewing adult content, visiting adult websites, purchasing alcohol, or purchasing other age-restricted items. Shopping can be tailored to customers on the basis of their age or gender in accordance with customer preferences and expectations ascertained from market trends [30,31]. Age categorization can also be applied to help people with visual impairment.

Facial emotion is a group of powerful, universal, and natural signals for communicating emotional states and intentions. Facial emotion classification has been the subject of numerous studies because of its practical significance in numerous applications, including HCI, robotics, driver sleepiness surveillance, and psychological examinations. Numerous computer-vision-based and machine-learning-based facial emotion recognition systems have been developed. Humans experience six fundamental emotions: rage, disgust, fear, joy, sadness, and surprise. In human emotion studies, the primary emotion classification is based on one of these fundamental emotions. Numerous studies have used emotion classification to assist people with visual impairment [32,33]. However, these studies did not attempt to develop a flexible or practical method. Therefore, an effective system for classifying gender, age, and emotion is required.

(3) Object detection

Object detection is a key task in computer vision because of its wide range of applications in HCI, monitoring and security systems, video surveillance, and self-driving cars [21]. Many deep learning object detection techniques, such as Faster R-CNN [34], SSD [35], YOLOv3 [36], RetinaNet [37], and Mask R-CNN [38], have been proposed.

Several medical studies have used object detection methods to help people with visual impairment navigate freely and perceive their surroundings [39,40,41]. These studies have produced efficient systems by combining a camera and several sensors. For example, Joshi et al. [40] suggested an assistance system that comprises a camera and distance sensor for detecting objects, recognizing words, and estimating the distance between obstacles. However, the systems developed in the aforementioned studies are typically implemented on laptops, which limits their accessibility. Although some systems use the miniature Raspberry Pi platform, such systems have decreased performance. Therefore, a portable, versatile, and effective solution for using object detection algorithms to help those with visual impairment is still required.

Users with visual impairments experience multiple challenges when moving around complex environments and interacting with others, mainly because they do not receive relevant or extensive visual information. Each part of the system addresses certain aspects of the challenges. The facial recognition feature allows users to recognize known individuals and hence increase social interaction and self-esteem. Extending on this feature, emotion recognition makes it easier to comprehend others’ emotions and helps embrace the social content of interactions. Where interactions involve strangers, the identification of age and gender provides social contexts helpful in understanding the users’ social environments. Object recognition contributes to identifying and learning about environmental barriers and everyday objects, thus supporting safe and independent mobility. Together with commonly related technologies, all of these elements form an integrated cognitive aid system supporting both social engagement and wayfinding in visually impaired individuals.

Therefore, the main goal in this research is to create a multifunctional deep learning-based cognitive support system aimed at augmenting social interaction and environmental perception for visually impaired individuals. The suggested system combines high-end computer vision features in an embedded small-form-factor platform to facilitate real-time identification of familiar individuals and their emotions, approximate age and gender of unfamiliar individuals, and everyday objects in the environment. In contrast to traditional methods based on mere obstacle avoidance or server-side computation, the proposed system is realized in an edge-computing device for portability, low latency, and independence from connectivity. By tackling the mobility and context awareness issues, this research seeks to improve users’ autonomy, safety, and potential for social interaction in everyday environments.

The remainder of this paper is organized as follows. Section 2 discusses recent relevant research and technologies. Section 3 describes the proposed system’s architecture. The system implementation and developed prototype are detailed in Section 4. Section 5 presents the experimental findings. Finally, the conclusions of this study and suggestions for future research are provided in Section 6.

2. Related Work

2.1. Face Recognition

Face recognition is a key application of computer vision. The most advanced face recognition techniques, such as VGGFace [22], FaceNet [23], and ArcFace [24], have been made possible by the recent advancements in deep learning and CNN architectures. Researchers have also investigated the use of IoT technologies and facial recognition in health care. For example, a real-time facial recognition system for assisting people with visual impairment was suggested by Aza et al. [41]. Their system uses the local binary pattern histogram (LBPH) algorithm to identify human faces in videos captured by a smartphone. However, the system can only identify one human face in each frame, and the LBPH algorithm requires the input images to be converted into the binary or grayscale format.

Cloud-based techniques have enabled considerable improvements in system performance. A mobile face recognition system for helping people with visual impairments was presented by Chaudhry and Chandra [25]. This system was implemented on a smartphone that connected to a server. The primary system operations, such as facial detection and recognition, were implemented by the smartphone. However, the identification and enrolling of a new person was performed by the server. The Cascade classifier and the LBPH algorithm were used to find and identify faces. Chen et al. [26] proposed a smart wearable cloud-based system that can identify text, objects, and human faces to aid people with visual impairment in effectively perceiving the world. This system contains two key cooperative processing components: a cloud server and a local unit. The local unit captures input photographs, communicates with the server, and provides feedback. The cloud server primarily performs image processing and sends its results to the local unit. However, the aforementioned system can only be used in locations with Internet access because of its reliance on the cloud server.

Computer-based face recognition methods have high flexibility and practical applicability. Mocanu et al. [27] introduced a real-time system (DEEP-SEE FACE) to facilitate cognition, interactions, and communication for people with visual impairment. This system combines computer vision algorithms and deep CNNs to enable those with visual impairment to detect, follow, and identify numerous people in front of them. A laptop serves as the system’s processing unit and handles all tasks, including the collection of input images, the processing of the images, and the exporting of the results. However, those with visual impairment face difficulty in moving with the device over time because of its weight. Neto et al. [28] suggested a wearable system for face recognition to aid people with visual impairment. They used a Kinect sensor, which contains an infrared (IR) emitter and a red–green–blue (RGB) camera, to collect RGB-Depth images. They then combined several methods, including the histogram of oriented gradients (HOG), principal component analysis, and k-nearest neighbor, to identify faces. However, the system cannot be used outdoors because it requires Kinect’s IR sensor.

2.2. Gender, Age, and Emotion Classifications

Gender classification can be performed by evaluating a person’s ear, fingerprint, iris, voice, or face. Face-based gender classification is the most common gender classification method. Arriaga et al. proposed a real-time fully connected CNN called mini-Xception for gender and emotion classifications [42]. Their model includes convolutional layers, the rectified linear unit (ReLU) activation function, a residual depth-wise separable convolution structure, the softmax activation function, and an average pooling layer. The residual depth-wise separable convolution structure contains two layers: a depth-wise layer and a point-wise layer. Liew et al. proposed a CNN model for face-based gender classification [43]. This model has a simple architecture that comprises only three convolutional layers and one output layer. Cross-correlation was used instead of convolution to reduce the computational load. Yang et al. [44] proposed a compact soft stagewise regression network (SSR-Net) for age categorization; they later extended SSR-Net to perform gender classification [45]. The improved model has two heterogeneous streams, each of which comprises 3 × 3 convolutional layers, batch normalization layers, nonlinear activation layers, and 2 × 2 pooling layers. Dhomne et al. [46] demonstrated a CNN model for face-based gender classification based on the VGGNet [47] architecture. This model can automatically extract features from images without using the HOG or support vector machine method because it uses a deep CNN. Khan et al. [48] suggested a system for automatic gender categorization. They first trained a segmentation model by using a conditional random field. The model then divided a facial image into six parts: eyes, hair, mouth, nose, skin, and background. A probability map was subsequently produced for each of the six classes by using a probabilistic classification strategy. These probability maps were then used for gender classification.

Numerous studies have investigated age classification. Age and gender classifications or age and emotion classifications are frequently performed together in studies because these classifications are based on facial characteristics. Levi and Hassner [49] presented a deep CNN model with an architecture inspired by AlexNet for identifying age and gender [50]; however, this model only has five main layers: three convolutional layers and two fully connected layers. It also has max-pooling, normalization, and dropout layers; uses the ReLU activation function; and includes filters. A face-based model for age and gender classifications was presented by Agbo-Ajala and Viriri [51]. The network architecture of this model, which is also based on AlexNet, includes convolutional, ReLU activation, batch normalization, max-pooling, dropout, and fully connected layers. Nam et al. [52] demonstrated a sophisticated CNN-based approach for age estimation. To create high-resolution face images from low-resolution images, a conditional generative adversarial network was used to produce images as the model’s input. VGGNet was used for age estimation in [47]. Liao et al. [53] developed a CNN model for estimating age according to the divide-and-rule strategy. A robust feature extraction technique involving the use of a deep CNN based on the GoogLeNet [54] architecture was used to extract face features, and a divide-and-rule learning approach was proposed for age estimation. Zhang et al. [55] presented a novel residual network based on ResNet [56] for age and gender classifications. They suggested two strategies for improving age estimation performance by observing the features of each age group.

Face emotion classification is used in numerous real-world contexts and has been extensively discussed in the literature by many authors. Hu et al. [57] proposed a new deep CNN model called the supervised scoring ensemble model for emotion recognition. An auxiliary block is the foundation of this model’s architecture. In the main CNN, three supervised blocks are added to supervise the shallow, intermediate, and deep layers; this method considerably increases the model’s accuracy. Moreover, a scoring connection layer was developed to combine the probabilities output by the supervised blocks. To classify emotions, Cai et al. [58] proposed a novel island loss for a CNN model. This model’s architecture was developed using the loss layer as a foundation. To increase the pairwise distances between various class centers, the island loss was calculated at the feature extraction layer. Thus, the island loss enhanced the differences between different classes while reducing the variances within each class. The high accuracy of the aforementioned CNN model was attributed to the island loss. For categorizing emotions, Bargal et al. [59] introduced a novel CNN model with a network ensemble architecture comprising three CNN models: VGG13, VGG16, and ResNet. The learned features from these networks are combined to create a single feature vector that describes the input image. This feature vector is then encoded and normalized for emotion classification. Zhang et al. [60] suggested an evolutionary spatial–temporal network based on multitask networks for identifying facial expressions. They proposed a multisignal CNN (MSCNN) and part-based hierarchical bidirectional recurrent neural network (PHRNN) for effectively extracting temporal and spatial characteristics, respectively. The inclusion of the PHRNN and MSCNN substantially improves the performance of the model proposed in [60]. Liu et al. [61] created an AU-aware deep network for recognizing facial expressions. Their model architecture is based on cascaded networks, in which various modules for various tasks are sequentially combined to create a deeper network. The network proposed in [61] contains three sequential modules. In the first module, convolutional and max-pooling layers are used to create an overcomplete representation. In the second module, an AU-aware receptive field layer explores subsets of the overcomplete representation and simulates particular areas. In the final module, hierarchical features for face expression identification are learned using multilayer restricted Boltzmann machines.

2.3. Object Detection

Deep learning-based systems for object identification have shown considerable versatility in a variety of applications, including health, construction, robotics, and designing smart cities. For assistive technologies, especially in medical applications, they enable independent navigation among persons with visual impairment by extending obstacle detection abilities as well as situational awareness augmentation [40,62,63]. Around construction, they improve structural evaluation as well as compliance with regulations, supported by UAV-based rebar detection [64] as well as monitoring systems for heavy equipment [65]. Recent studies have broadened their use in robotic perception as well as artificial intelligence integration, with working models proposed for LiDAR technology [66], road scene segmentation [67], and real-time instance segmentation [68]. Further, semantic segmentation techniques have been used to assess vegetation in urban areas as well as monitor ecological markers in urban planning in a smart city scenario [69] [W6], thus highlighting versatility as well as usability in assistive as well as industrial applications.

Object identification has various applications, including in smart health-care systems. Many tools and systems for object detection have been proposed for helping people with visual impairment communicate and participate in society. Tian et al. [70] suggested a computer-vision-based solution to help blind people navigate. Their system, which performs object detection and word recognition, was implemented on a portable computer and could detect objects that are critical for wayfinding, such as doors, elevators, and cabinets. To help people with visual impairment identify the detected objects more precisely, text recognition was applied to identify any textual information linked to the objects. Ko and Kim [71] created a smartphone-based wayfinding system for object detection to aid people with visual impairment. This system enables users to determine their location and then navigate to a given destination in an unknown interior setting. However, the system obtains wayfinding data from QR codes; thus, it can only be used in buildings in which wayfinding QR codes are installed, which limits its applicability.

Several other systems have leveraged sensors and deep learning methods to enhance object detection effectiveness. Mekhalfi et al. [7] demonstrated a ground-breaking prototype navigation system comprising a camera, an inertial measurement unit, and laser sensors as hardware. The system also contains two main software modules, namely a guidance module and a recognition module, for handling navigation and object recognition, respectively. Long et al. [62] created a framework for assisting people with visual impairment in identifying and avoiding obstacles. This system uses a millimeter-wave (MMW) radar and an RGB-Depth sensor for data collection. Objects are detected using the SSD and Mask R-CNN networks, and obstacle depth information is acquired using the MeanShift technique. Moreover, MMW radar data can be used for simultaneously estimating the positions and velocities of multiple detected targets, thereby substantially improving the system’s performance. An embedded system presented by Khade and Dandawate [63] for users with visual impairment can track obstacles by using the object detection technique. The system is based on Raspberry Pi and is therefore wearable, lightweight, and small. Joshi et al. [40] presented an assistive artificial intelligence device to help people with visual impairment in better understanding their environment. This device uses a distance-measuring sensor to detect obstacles and employs YOLOv3 [36] to detect objects. Information regarding obstacles is transmitted to the user as audio. To achieve assisted navigation, Tapu et al. [72] proposed an automatic cognition system based on computer vision algorithms and a deep CNN running on a laptop. This system uses YOLOv1 [73] to detect objects and alerts the user through a headphone when it detects obstacles.

2.4. Smart Health Care

IoT-based technologies have received substantial attention in health care. Several studies have created sensor-based, computer-vision-based, or smartphone-based devices and systems for assisting people with visual impairment. Sensor-based approaches detect obstacles by using data from sensors, such as distance, IR, laser, and ultrasonic sensors. Katzschmann et al. [2] proposed a navigational aid called Array of Lidars and Vibrotactile Units, which has two components: a sensor belt and a haptic strap. The sensor belt has an array of distance sensors for determining the distance from the user to surrounding obstructions, and the haptic strap provides corresponding haptic feedback to facilitate safe movement for people with visual impairment. Nada et al. [3] created an inexpensive, portable, and easy-to-use smart stick navigational aid comprising a pair of IR sensors that can detect obstructions up to a distance of 2 m and provide a spoken warning message. The device has low power requirements and is highly responsive. Capi [1] developed an intelligent robotic navigational aid that comprises a laptop computer and other components, including a camera, speaker, and laser rangefinder. This system comprises two modes: the assisting and guiding modes. In the assisting mode, the system detects obstacles; in guiding mode, the robot follows predrawn floor lines to reach its destination. Kumar et al. [4] constructed a smart cane using ultrasonic sensors and fuzzy logic to enhance the decision-making ability of the user in complex environments. The system takes the combined multiple sensor data to aid in obstacle identification and provides corresponding audio feedback to enhance the effectiveness of obstacle identification. Sipos et al. [5] developed a framework for an intelligent aid tool targeted at visually impaired individuals using distance sensors in conjunction with real-time data evaluation to achieve navigation. The initial test results showed acceptable responsiveness and usability levels in real-life environments, most prominently indoors.

In computer-vision-based approaches, a camera captures a scene to enable computer-vision-based algorithms to recognize obstacles. Kang et al. [6] proposed a deformable grid obstacle detection method in which an initially standard grid deforms according to the movement of objects in a scene. Objects with which the user might collide are identified on the basis of the deformation degree. This scheme enhances the accuracy of the aforementioned system. Yang et al. [8] presented a deep-learning-based semantic segmentation framework for helping people with visual impairment perceive their surroundings and avoid obstacles, such as obstructions, pedestrians, and vehicles. They created a system based on the aforementioned framework that notifies the user about traversable areas, walkways, stairs, and water hazards. Ashiq et al. [9] proposed a CNN-based framework for object recognition and tracking with a special focus on applications for people with visual impairments. The proposed approach enables real-time recognition and continuous monitoring of dynamic objects in changing environments, thus improving the awareness and safety of the user. Li et al. [10] developed a wearable cognitive aid system combining a range of sensory modalities such as visual, inertial, and ambient sensing and artificially augmented guidance and decision-making algorithms. The system provides real-time feedback and situational awareness and improves mobility both indoors and outdoors.

Smartphone-based systems leverage smartphone processors for data collection, data processing, and decision-making. Tanveer et al. [12] demonstrated a smartphone system that enables the use of voice commands to make phone calls and to assist users in determining the locations of obstacles. Moreover, the system tracks the user’s location by using global positioning system data and stores these data on a server; an Android app can then display and track the user’s location. Cheraghi et al. [11] created a wayfinding system (GuideBeacon) in which Bluetooth beacons were installed in a region of interest, namely a room, and smartphones communicated with the beacons to facilitate navigation. They reported that people with visual impairment could navigate quickly and effectively when using the GuideBeacon system.

3. Multifunctional Embedded System

3.1. System Architecture

Figure 1 presents an overview of the proposed system. This system is intended to improve the environmental awareness of people with visual impairment and thus has three primary functions: face recognition and emotion classification, age and gender classifications, and object detection. The main module of the system is an embedded NVIDIA Jetson AGX Xavier system with several peripheral devices, namely a webcam, speaker, and Bluetooth audio transmitter adapter. These devices perform image collection, image processing, and system control. The user first selects a function. The system’s webcam records the current scene, and the system then remotely processes the image to perform the functions. In the first function, namely person detection, the system uses face and emotion recognition to output the names and emotions of detected individuals. If a stranger is identified, the user can use the second function, namely stranger identification, to reveal the stranger’s gender and age in addition to their emotion. The third function, namely object detection, is used to recognize and report the type and quantity of each object in the image. All system outputs are spoken to the user through a speaker.

3.2. Function Selection

3.2.1. Remote Controller

System function selection can be performed using voice control, computer vision, or a remote control. A remote control was selected for the proposed system because this tool is a simple and reliable control tool. A remote control is effectively a keyboard that can send control signals to a central processing unit (Jetson AGX Xavier).

A Python program was developed to identify the codes sent by each key on the remote control (Table 1). A Logitech remote control (Figure 2) was selected because of its user-friendliness and responsiveness. The function key codes were determined using the Python v3.6.5 program; function keys 1, 2, and 3 had key codes of 85, 86, and 46, respectively. These codes were mapped to the corresponding functions (person detection, stranger identification, and object detection) on the Jetson AGX Xavier. When the Jetson system receives a control signal, it performs the corresponding function.

3.2.2. Function Selection Process

The selection process, shown in Figure 3, also begins automatically after the Jetson AGX Xavier system has been activated and initialized. The system is ready to receive input when an audio alert informs the user. The user is allowed to select a particular function by pressing a remote control button: Function 1 refers to the identification of the faces and emotions of known individuals, Function 2 refers to gender and age and also to emotional states of unknown individuals, and Function 3 is devoted to object identification.

After a function has been selected by the system, it takes a current picture using the webcam and subsequently applies the corresponding deep learning models. In Function 1, if a person is identified as a stranger to the system, the user is asked to press a different key to activate Function 2 to perform further scrutiny. The output from each function is presented to the user in real-time audio feedback form. Upon the finishing of the functions, the user is given the option to select more functions or power off the system. The design accommodates usability, flexibility, and smooth transitions among functions in real-world applications.

3.3. Face Recognition Model

An overview of the face recognition model is presented in Figure 4. Data were first collected and preprocessed as described in Section 4.3. To select a face recognition model for the person detection task, the performance of three face recognition models, namely VGGFace, FaceNet, and ArcFace, was compared in terms of their efficiency.

All models had favorable face recognition performance; however, the FaceNet model had excellent accuracy and approximately half the calculation time of the VGGFace and ArcFace models [74]. Therefore, the FaceNet model was selected for use in the proposed system. FaceNet is an end-to-end learning architecture proposed by Schroff et al. [23]. It extracts crucial facial features from images by using the Inception-ResNet-v1 [75] architecture (Figure 5) to create a vector known as an embedding [76] for effectively encoding images as feature vectors. The embeddings are organized in a manner where embeddings for one individual are placed close together in feature space while embeddings for different people are separated from each other. The use of embeddings in this model greatly simplifies face recognition.

3.4. Gender, Age, and Emotion Classification Models

The stranger identification function requires a longer processing time than does the person identification function because the stranger identification function requires gender and age classifications (Figure 6). First, an input image is captured, and gender, age, and emotion classifications are performed separately by corresponding pretrained models. The final result reported to the user is produced by combining the results of these models.

The SSR-Net model [44] was used for gender classification. A straightforward CNN model suggested by Levi and Hassner [49] was used for age classification, and this model categorizes each face into one of eight age categories: 0–2, 4–6, 8–13, 15–20, 25–32, 38–43, 47–53, and 60+ years. Finally, a real-time CNN model suggested by Arriaga et al. [42] was used to classify emotion; this model can recognize seven basic emotions, namely angry, disgusted, fearful, happy, sad, surprised, and neutral. The aforementioned three models were integrated into a single framework to perform the stranger identification function.

3.5. Object Detection Model

An overview of the object detection function of the proposed system is illustrated in Figure 7. First, an input image is captured, and a pretrained object detection model classifies and counts objects. Various object detection models were evaluated in terms of their accuracy and processing time (efficiency). The SSD, YOLOv3, and Mask R-CNN models are three of the most effective object detection models. The YOLOv3 and Mask R-CNN models have high accuracy; however, the YOLOv3 model requires approximately half the processing time of the Mask R-CNN model [21,77]. Therefore, YOLOv3 has higher efficiency than does the Mask R-CNN model and was selected for the proposed system. Moreover, a dynamic object ordering table was included to sort the results of the object detection model.

3.6. Arrangement of Result Description

The object ordering table can help the user quickly receive key information if the object detection model returns numerous results. The object detection model was trained using the common objects in context (COCO) data set [78], which comprises natural images depicting the everyday world. The majority of the objects in these images are tagged and segmented. The COCO data set has 91 object categories [78], which are further organized into supercategories, namely person and accessory, animal, vehicle, outdoor object, sports, food, kitchenware, furniture, appliance, electronics, and indoor object. The object ordering table was based on the supercategories and varied in accordance with the user’s surroundings. For example, if the user is outside, the top 10 objects are person, bicycle, motorcycle, car, bus, truck, train, traffic light, stop sign, and bench.

4. System Implementation and Prototype

4.1. Devices Used in System Implementation

The proposed system contains several devices (Figure 8). The main module is a compact, lightweight backpack containing a power bank and NVIDIA Jetson AGX Xavier. The power bank can provide AC power to the Jetson system for at least 3 h. A remote control was connected to the Jetson system by a receiver plugged into a USB port on this system. Images were captured using a Logitech C920 webcam (Logitech, San Jose, CA, USA) that was directly connected to a USB port on the Jetson system and attached to the front of the backpack to ensure that the user’s view and webcam view were similar. Sound input was achieved by connecting the Jetson system’s HDMI port to a VGA-and-audio adapter, which was then connected to a Bluetooth audio transmitter. This setup enabled linking a Bluetooth headphone to the Jetson system for delivering spoken output. Table 2 details the technical specifications of the aforementioned devices.

The deep-learning architectures of the system were trained using high-performance computing power supported by dedicated GPU capabilities. The software elements were written using Python in the Anaconda development framework to support modular architecture and scalable execution. The trained models were thus deployed in an embedded AI platform by the NVIDIA Jetson AGX Xavier to support efficient inference required to sustain edge computing applications and ensure deployability and real-time execution.

4.2. Initialization Program of the Embedded System

At system boot, the control programs are run automatically for system use. The user can simply power the system on, wait for the system ready alert, and immediately begin controlling the system. The automatic initialization procedure has three steps, which are described in the following text.

(1) Terminal autolaunch

Startup Applications is a built-in tool in the Ubuntu operating system that allows users to manage the applications and services that automatically run when the system boots to streamline the boot process and improve overall system performance. The terminal program was added to the startup programs list (Figure 9) by clicking the “add” button and inputting a name, the command gnome-terminal, and a comment.

(2) Shell script

An executable file containing a list of shell commands is known as a shell script. A shell script titled “auto_code.sh” was created to change the working directory, activate the Python virtual environment, and run the primary control program (Figure 10). The auto_code file contains the following information:

Lines 6–9 implement the change from the current directory to the folder containing the main control program.
Lines 12 and 13 activate a Python virtual environment called “opencv_cuda.”
Line 17 performs the main control program called “multifunctional.py.”

(3) Configure .bashrc

In the Ubuntu operating system, the .bashrc file is a script file that is executed whenever a new terminal window is opened in a shell session to set the behavior of the default shell in Ubuntu, namely the Bash shell, which includes the environment variables, aliases, shell options, and functions. In particular, this shell can execute a shell script whenever a new terminal is opened. Therefore, we added the file path of the file shell script “auto_code” into the “.bashrc” file such that the shell script file is executed automatically (Figure 11).

Figure 10. Content of the shell script file.

4.3. Data Set Collection

A face data set [74] with 26,000 images divided into 20 classes corresponding to 20 well-known Taiwanese actors or actresses was selected. Data were collected from selected videos and Google Images. After preprocessing by removing blurry images, a multitask cascaded CNN was used to align the faces in these images. The resulting training and testing data sets contained 20,000 and 6000 images, respectively. The gender and emotion classification functions were trained and tested on these face data sets.

The age classification function was evaluated using a face age data set [79]. Face images of people aged 2–80 years are included in this data set. Face images were only selected from eight age groups (0–2, 4–6, 8–13, 15–20, 25–32, 38–43, 47–53, and 60+ years) to evaluate the classification results precisely. The testing data set comprised 1800 face images randomly selected from the face age data set.

Our lab data set, which comprises images of the lab members, was used to test the face recognition function. This data set contains 10,000 images divided into 10 classes, each of which represents a person. The training and testing data sets contained 7000 and 3000 images, respectively.

Moreover, the proposed system was used on several routes at Feng Chia University (Figure 12). The input and output images were archived during the system implementation to analyze and assess the proposed system. A total of 500 input images constituted the collected data set, which was used to test all system functions other than face recognition.

5. Experimental Results

5.1. Analysis Results for Face Recognition

The face recognition function was evaluated using our lab data set comprising images of the lab members. The model’s performance was evaluated using precision, recall, and F1-score as primary metrics. These metrics fully assess the accuracy of classification by weighing the false positive and false negative rates, and are thus highly applicable to assistive technology for the visually impaired, where accurate and reliable detection has a great impact on the usability and safety of the system. Table 3 presents the results obtained in this study. The precision, recall, and F1 score values were extremely high; in particular, recall reached the maximum value of 1 for all classes. Even the lowest precision and F1 score values were 0.96 and 0.98, respectively. Therefore, the face recognition function had excellent performance.

5.2. Gender Classification Results

The gender classification function, namely the SSR-Net model [44], was evaluated on the testing face data set containing 6000 images belonging to 20 classes (Section 4.3) [74]. The relevant results are presented in Figure 13.

The confusion matrix displayed in Figure 14 presents the gender classification results. The true positive rates (TPRs) for the male and female classes were 96.88% and 96.30%, respectively. Moreover, the corresponding false positive rate and negative rates were low for both classes. Thus, these findings confirm that the gender classification function had exceptional performance.

5.3. Analysis Results of Age Classification

The straightforward CNN model suggested by Levi and Hassner [49] was applied to the face age testing data set [79] described in Section 4.3. In this study, we excluded the 0–2 and 4–6 age groups from the evaluation, as individuals in these age ranges are not socially independent and thus not considered relevant in practical scenarios for assistive systems designed for visually impaired users. Six age groups were assessed: 8–13, 15–20, 25–32, 38–43, 47–53, and 60+ years. Figure 15 depicts the relevant confusion matrix. The 25–32 age group demonstrated the highest TPR. Some misclassifications across related age groups could be caused by subjective visual cues such as makeup, facial expressions, or variations in lighting and image quality. These features illustrate the challenges in estimating precise ages in practical settings.

Table 4 presents the precision, recall, and F1 score. According to this table, the age groups of 25–32 and 60+ years exhibited the highest precision of 0.75, whereas the age group of 8–13 years exhibited the highest recall and F1 score of 0.96 and 0.82, respectively. The average values of precision, recall, and F1 score were 0.71, 0.67, and 0.67, respectively.

5.4. Emotion Classification Results

The real-time CNN model suggested by Arriaga et al. [42] for classifying seven emotions was evaluated on the testing face data set [74] described in Section 4.3, and the corresponding results are presented in Figure 16. The most frequently predicted emotion was neutral (49.09%), followed by happy (28.81 %) and angry (9.02%). Disgusted was the least predicted emotion (0.13%).

5.5. Analysis Results of Object Detection

The YOLOv3 object detection model was tested on several routes at Feng Chia University. The top 10 most frequently detected object classes were person, car, potted plant, backpack, truck, bench, handbag, motorcycle umbrella, and fire hydrant (Figure 17). The most frequently detected object class was person (2124 detections), followed by potted plant and car (726 and 621 detections), respectively. The three least frequently detected object classes were motorbike, umbrella, and fire hydrant.

The three most commonly detected classes were used for evaluation because the number of detected objects was insufficient in the other classes. Figure 18 displays the model’s precision and recall. For all three object classes, the precision values were consistently high. By contrast, the recall values differed between the classes. The recall was high for human and car but relatively low for potted plant. Figure 19 and Figure 20 present sample images for object detection, which reveal that the proposed system can achieve excellent results in object detection. The results of this system are reasonably accurate and can provide sufficient information to people with visual impairment in the examined environment.

5.6. Processing Time of System Functions

Processing time also affects system performance; a system should achieve a favorable trade-off between accuracy and responsiveness. The average processing time of each system function is presented in Table 5. Function 1 is person detection and involves face and emotion classifications, Function 2 is stranger identification and involves person detection with gender and age classifications, and Function 3 is object detection. A frame rate of 4.70 fps was achieved for object detection; thus, Function 3 required the least processing time. Functions 1 and 2 exhibited similar processing times of 1.88 and 1.65 fps, respectively. The aforementioned processing times are acceptable because functions are not continually selected.

6. Conclusions

In this paper, we propose an embedded system based on deep learning techniques, which is multidimensional in its approach. The major aim of the device is to aid visually impaired individuals in recognizing their environment. The system provides enhanced mental support in terms of recognizing familiar people and what these people are feeling, estimating unknown people’s age, gender, and their emotional states, and detecting objects around. These features enable visually impaired individuals to safely move around their areas while understanding the environment, thus promoting independence and confidence.

The system was implemented on the NVIDIA Jetson AGX Xavier board, which provides impressive computational efficiency, low latency, and improved portability. The experiment proved that it has superior performance in face recognition, gender recognition, and object detection and satisfactory performance in emotion and age estimation. Moreover, by incorporating a remote control for system function selection and voice output, it provides easy and intuitive interaction.

Nevertheless, there are certain limitations that should be mentioned. The text-to-speech output in the current system is English only, which may restrict usability for non-English speakers. Also, as with most vision-based systems, the performance of the proposed method is influenced by environmental conditions such as low illumination or partial occlusion. The frame rate for facial actions is moderate, which is sufficient for assistive purposes; however, it may be further enhanced with additional optimization.

These challenges will be tackled by future studies with a focus on adding multilingual support, improving performance under low light conditions (for example, by exploiting infrared or depth sensors), and model inference acceleration using techniques like ONNX conversion and TensorRT. We also intend to broaden the face recognition data to include other ethnic groups, facial shapes, and lighting environments so that the system becomes more robust, equitable, and usable in real-world environments. Visually impaired field trials are planned with a view to testing the system in naturalistic scenarios and driving developments led by end-user needs.

In conclusion, the system covered herein is a great advance for assistive technology as it merges real-time embedded processing, a wide range of deep learning capabilities, and an easy-to-use interface. With further refinement, it holds potential for unmatched capabilities to significantly improve independence, confidence, and overall quality of life for those with visual impairment.

Author Contributions

H.-H.N. and F.-C.L. conceived and designed the experiments; H.-H.N. and H.L.L. performed the experiments; F.-C.L. and H.L.L. analyzed the data and contributed to the materials and analysis tools; H.-H.N. and H.L.L. wrote the manuscript. Finally, F.-C.L. and H.L.L. gave relevant materials and valuable comments for this paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Education and Training, Vietnam (Grant Number: B2023-TNA-21).

Institutional Review Board Statement

Ethical review and approval were waived for this study because all participants involved in the study were members of the research team and co-authors of the manuscript. The study did not involve any external subjects or sensitive personal data.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Capi, G. Development of a new robotic system for assisting and guiding visually impaired people. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (ROBIO), Guangzhou, China, 11–14 December 2012; pp. 229–234. [Google Scholar]
Katzschmann, R.K.; Araki, B.; Rus, D. Safe local navigation for visually impaired users with a time-of-flight and haptic feedback device. IEEE Trans. Neural Syst. Rehabil. Eng. 2018, 26, 583–593. [Google Scholar] [CrossRef] [PubMed]
Nada, A.A.; Fakhr, M.A.; Seddik, A.F. Assistive infrared sensor based smart stick for blind people. In Proceedings of the Science and Information Conference (SAI), London, UK, 28–30 July 2015; pp. 1149–1154. [Google Scholar]
Kumar, D.; Sudha, K.; Gaur, A.; Sharma, A.; Vandana; Sharma, S. Assistive ultrasonic sensor based smart blind stick using fuzzy logic. J. Inf. Optim. Sci. 2022, 43, 233–237. [Google Scholar] [CrossRef]
Șipoș, E.; Ciuciu, C.; Ivanciu, L. Sensor-based prototype of a smart assistant for visually impaired people—Preliminary results. Sensors 2022, 22, 4271. [Google Scholar] [CrossRef] [PubMed]
Kang, M.C.; Chae, S.H.; Sun, J.Y.; Yoo, J.W.; Ko, S.J. A novel obstacle detection method based on deformable grid for the visually impaired. IEEE Trans. Consum. Electron. 2015, 61, 376–383. [Google Scholar] [CrossRef]
Mekhalfi, M.L.; Melgani, F.; Zeggada, A.; Natale, F.G.B.D.; Salem, M.A.M.; Khamis, A. Recovering the sight to blind people in indoor environments with smart technologies. Expert Syst. Appl. 2016, 46, 129–138. [Google Scholar] [CrossRef]
Yang, K.; Wang, K.; Bergasa, L.M.; Romera, E.; Hu, W.; Sun, D.; Sun, J.; Cheng, R.; Chen, T.; López, E. Unifying terrain awareness for the visually impaired through real-time semantic segmentation. Sensors 2018, 18, 1506. [Google Scholar] [CrossRef]
Ashiq, F.; Asif, M.; Ahmad, M.B.; Zafar, S.; Masood, K.; Mahmood, T.; Mahmood, M.T.; Lee, I.H. CNN-based object recognition and tracking system to assist visually impaired people. IEEE Access 2022, 10, 14819–14834. [Google Scholar] [CrossRef]
Li, G.; Xu, J.; Li, Z.; Chen, C.; Kan, Z. Sensing and navigation of wearable assistance cognitive systems for the visually impaired. IEEE Trans. Cogn. Dev. Syst. 2022, 15, 122–133. [Google Scholar] [CrossRef]
Cheraghi, S.A.; Namboodiri, V.; Walker, L. Guidebeacon: Beacon-based indoor wayfinding for the blind, visually impaired, and disoriented. In Proceedings of the IEEE International Conference on Pervasive Computing and Communications (PerCom), Kona, HI, USA, 13–17 March 2017; pp. 121–130. [Google Scholar]
Tanveer, M.S.R.; Hashem, M.M.A.; Hossain, M.K. Android assistant eyemate for blind and blind tracker. In Proceedings of the 18th International Conference on Computer and Information Technology (ICCIT), Dhaka, Bangladesh, 21–23 December 2015; pp. 266–271. [Google Scholar]
Khan, A.; Khusro, S. An insight into smartphone-based assistive solutions for visually impaired and blind people: Issues, challenges and opportunities. Univers. Access Inf. Soc. 2021, 20, 265–298. [Google Scholar] [CrossRef]
Ullah, M.; Khusro, S.; Khan, M.; Alam, I.; Khan, I.; Niazi, B. Smartphone-based cognitive assistance of blind people in room recognition and awareness. Mob. Inf. Syst. 2022, 2022, 6068584. [Google Scholar] [CrossRef]
Islam, M.M.; Sadi, M.S.; Zamli, K.Z.; Ahmed, M.M. Developing walking assistants for visually impaired people: A review. IEEE Sens. J. 2019, 19, 2814–2828. [Google Scholar] [CrossRef]
Han, J.; Kim, J.; Kim, S.; Wang, S. Effectiveness of image augmentation techniques on detection of building characteristics from street view images using deep learning. J. Constr. Eng. Manag. 2024, 150, 04024129. [Google Scholar] [CrossRef]
Angin, P.; Bhargava, B.K. Real-time mobile-cloud computing for context-aware blind navigation. Int. J. Next Gener. Comput. 2011, 2, 405–414. [Google Scholar]
Trabelsi, R.; Jabri, I.; Melgani, F.; Smach, F.; Conci, N.; Bouallegue, A. Indoor object recognition in RGBD images with complex-valued neural networks for visually-impaired people. Neurocomputing 2019, 330, 94–103. [Google Scholar] [CrossRef]
Chaudary, B.; Pohjolainen, S.; Aziz, S.; Arhippainen, L.; Pulli, P. Teleguidance-based remote navigation assistance for visually impaired and blind people—Usability and user experience. Virtual Real. 2021, 27, 141–158. [Google Scholar] [CrossRef]
Lo Valvo, A.; Croce, D.; Garlisi, D.; Giuliano, F.; Giarré, L.; Tinnirello, I. A navigation and augmented reality system for visually impaired people. Sensors 2021, 21, 3061. [Google Scholar] [CrossRef]
Ngo, H.H.; Lin, F.C.; Sehn, Y.T.; Tu, M.; Dow, C.R. A room monitoring system using deep learning and perspective correction techniques. Appl. Sci. 2020, 10, 4423. [Google Scholar] [CrossRef]
Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep face recognition. In Proceedings of the British Machine Vision Conference (BMVC), Swansea, UK, 7–10 September 2015; pp. 1–12. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 4690–4699. [Google Scholar]
Chaudhry, S.; Chandra, R. Design of a mobile face recognition system for visually impaired persons. arXiv 2015, arXiv:1502.00756. [Google Scholar]
Chen, S.; Yao, D.; Cao, H.; Shen, C. A novel approach to wearable image recognition systems to aid visually impaired people. Appl. Sci. 2019, 9, 3350. [Google Scholar] [CrossRef]
Mocanu, B.; Tapu, R.; Zaharia, T. DEEP-SEE FACE: A mobile face recognition system dedicated to visually impaired people. IEEE Access 2018, 6, 51975–51985. [Google Scholar] [CrossRef]
Neto, L.B.; Grijalva, F.; Maike, V.R.M.L.; Martini, L.C.; Florencio, D.; Baranauskas, M.C.C.; Rocha, A.; Goldenstein, S. A Kinect-Based Wearable Face Recognition System to Aid Visually Impaired Users. IEEE Trans. Hum.-Mach. Syst. 2017, 47, 52–64. [Google Scholar] [CrossRef]
Lin, F.; Wu, Y.; Zhuang, Y.; Long, X.; Xu, W. Human gender classification: A review. Int. J. Biom. 2016, 8, 275–300. [Google Scholar] [CrossRef]
Agbo-Ajala, O.; Viriri, S. Deep learning approach for facial age classification: A survey of the state-of-the-art. Artif. Intell. Rev. 2020, 54, 179–213. [Google Scholar] [CrossRef]
Punyani, P.; Gupta, R.; Kumar, A. Neural networks for facial age estimation: A survey on recent advances. Artif. Intell. Rev. 2020, 53, 3299–3347. [Google Scholar] [CrossRef]
Ashok, A.; John, J. Facial expression recognition system for visually impaired. In Proceedings of the International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI), Coimbatore, Tamil Nadu, India, 7–8 August 2018; pp. 244–250. [Google Scholar]
Das, S. A novel emotion recognition model for the visually impaired. In Proceedings of the IEEE 5th International Conference for Convergence in Technology (I2CT), Pune, India, 29–31 March 2019; pp. 1–6. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. arXiv 2016, arXiv:1512.02325. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2018, arXiv:1703.06870. [Google Scholar]
Islam, M.T.; Ahmad, M.; Bappy, A.S. Microprocessor-based smart blind glass system for visually impaired people. In Proceedings of the International Joint Conference on Computational Intelligence, Dhaka, Bangladesh, 28–29 December 2018; pp. 151–161. [Google Scholar]
Joshi, R.C.; Yadav, S.; Dutta, M.K.; Travieso-Gonzalez, C.M. Efficient multi-object detection and smart navigation using artificial intelligence for visually impaired people. Entropy 2020, 22, 941. [Google Scholar] [CrossRef]
Aza, V.; Indrabayu; Areni, I.S. Face recognition using local binary pattern histogram for visually impaired people. In Proceedings of the International Seminar on Application for Technology of Information and Communication (iSemantic), Semarang, Indonesia, 21–22 September 2019; pp. 241–245. [Google Scholar]
Arriaga, O.; Ploger, P.G.; Valdenegro, M. Real-time convolutional neural networks for emotion and gender classification. arXiv 2017, arXiv:1710.07557. [Google Scholar]
Liew, S.S.; Hani, M.K.; Radzi, S.A.; Bakhteri, R. Gender classification: A convolutional neural network approach. Turk. J. Electr. Eng. Comput. Sci. 2016, 24, 1248–1264. [Google Scholar] [CrossRef]
Yang, T.Y.; Huang, Y.H.; Lin, Y.Y.; Hsiu, P.C.; Chuang, Y.Y. SSR-Net: A compact soft stagewise regression network for age estimation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 1078–1084. [Google Scholar]
SSR-Net. GitHub. 2018. Available online: https://github.com/shamangary/SSR-Net (accessed on 15 October 2022).
Dhomne, A.; Kumar, R.; Bhan, V. Gender recognition through face using deep learning. Procedia Comput. Sci. 2018, 132, 2–10. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
Khan, K.; Attique, M.; Syed, I.; Gul, A. Automatic gender classification through face segmentation. Symmetry 2019, 11, 770. [Google Scholar] [CrossRef]
Levi, G.; Hassner, T. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 34–42. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Agbo-Ajala, O.; Viriri, S. Face-based age and gender classification using deep learning model. In Proceedings of the Image and Video Technology, Sydney, NSW, Australia, 18–22 November 2019; pp. 125–137. [Google Scholar]
Nam, S.H.; Kim, Y.H.; Truong, N.Q.; Choi, J.; Park, K.R. Age estimation by super-resolution reconstruction based on adversarial networks. IEEE Access 2020, 8, 17103–17120. [Google Scholar] [CrossRef]
Liao, H.; Yan, Y.; Dai, W.; Fan, P. Age estimation of face images based on CNN and divide-and-rule strategy. Math. Probl. Eng. 2018, 2018, 1712686. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Zhang, K.; Gao, C.; Guo, L.; Sun, M.; Yuan, X.; Han, T.X.; Zhao, Z.; Li, B. Age group and gender estimation in the wild with deep RoR architecture. IEEE Access 2017, 5, 22492–22503. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Hu, P.; Cai, D.; Wang, S.; Yao, A.; Chen, Y. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, New York, NY, USA, 13–17 November 2017; pp. 553–560. [Google Scholar]
Cai, J.; Meng, Z.; Khan, A.S.; Li, Z.; O’Reilly, J.; Tong, Y. Island loss for learning discriminative features in facial expression recognition. In Proceedings of the 13th IEEE International Conference on Automatic Face Gesture Recognition, Xi’an, China, 15–19 May 2018; pp. 302–309. [Google Scholar]
Bargal, S.A.; Barsoum, E.; Ferrer, C.C.; Zhang, C. Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; pp. 433–436. [Google Scholar]
Zhang, K.; Huang, Y.; Du, Y.; Wang, L. Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Trans. Image Process. 2017, 26, 4193–4203. [Google Scholar] [CrossRef]
Liu, M.; Li, S.; Shan, S.; Chen, X. AU-aware deep networks for facial expression recognition. In Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, Shanghai, China, 22–26 April 2013; pp. 1–6. [Google Scholar]
Long, N.; Wang, K.; Cheng, R.; Hu, W.; Yang, K. Unifying obstacle detection, recognition, and fusion based on millimeter wave radar and rgb-depth sensors for the visually impaired. Rev. Sci. Instrum. 2019, 90, 044102. [Google Scholar] [CrossRef]
Khade, S.; Dandawate, Y.H. Hardware implementation of obstacle detection for assisting visually impaired people in an unfamiliar environment by using raspberry pi. In Proceedings of the Smart Trends in Information Technology and Computer Communications, Jaipur, Rajasthan, India, 6–7 August 2016; pp. 889–895. [Google Scholar]
Wang, S.; Kim, M.; Hae, H.; Cao, M.; Kim, J. The development of a rebar-counting model for reinforced concrete columns: Using an unmanned aerial vehicle and deep-learning approach. J. Constr. Eng. Manag. 2023, 149, 04023111. [Google Scholar] [CrossRef]
Eum, I.; Kim, J.; Wang, S.; Kim, J. Heavy equipment detection on construction sites using you only look once (YOLO-version 10) with transformer architectures. Appl. Sci. 2025, 15, 2320. [Google Scholar] [CrossRef]
Yu, X.; Salimpour, S.; Queralta, J.P.; Westerlund, T. General-purpose deep learning detection and segmentation models for images from a lidar-based camera sensor. Sensors 2023, 23, 2936. [Google Scholar] [CrossRef]
Alokasi, H.; Ahmad, M.B. Deep learning-based frameworks for semantic segmentation of road scenes. Electronics 2022, 11, 1884. [Google Scholar] [CrossRef]
Jung, S.; Heo, H.; Park, S.; Jung, S.U.; Lee, K. Benchmarking deep learning models for instance segmentation. Appl. Sci. 2022, 12, 8856. [Google Scholar] [CrossRef]
Lee, D.H.; Park, H.Y.; Lee, J. A review on recent deep learning-based semantic segmentation for urban greenness measurement. Sensors 2024, 24, 2245. [Google Scholar] [CrossRef]
Tian, Y.; Yang, X.; Yi, C.; Arditi, A. Toward a computer vision-based wayfinding aid for blind persons to access unfamiliar indoor environments. Mach. Vis. Appl. 2013, 24, 521–535. [Google Scholar] [CrossRef]
Ko, E.; Kim, E.Y. A vision-based wayfinding system for visually impaired people using situation awareness and activity-based instructions. Sensors 2017, 17, 1882. [Google Scholar] [CrossRef]
Tapu, R.; Mocanu, B.; Zaharia, T. Seeing without sight—An automatic cognition system dedicated to blind and visually impaired people. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 1452–1459. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Lin, F.C.; Ngo, H.H.; Dow, C.R. A cloud-based face video retrieval system with deep learning. J. Supercomput. 2020, 76, 8473–8493. [Google Scholar] [CrossRef]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
FaceNet. GitHub. 2016. Available online: https://github.com/davidsandberg/facenet/ (accessed on 9 August 2022).
Dow, C.R.; Ngo, H.H.; Lee, L.H.; Lai, P.Y.; Wang, K.C.; Bui, V.T. A crosswalk pedestrian recognition system by using deep learning and zebra-crossing recognition techniques. Softw. Pract. Exp. 2019, 50, 630–644. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common objects in context. arXiv 2015, arXiv:1405.0312. [Google Scholar]
Cheng, J.; Li, Y.; Wang, J.; Yu, L.; Wang, S. Exploiting effective facial patches for robust gender recognition. Tsinghua Sci. Technol. 2019, 24, 333–345. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed system.

Figure 2. Function keys of the Logitech remote control.

Figure 3. Function selection flowchart.

Figure 4. Overview of the face recognition function of the proposed system.

Figure 5. Inception-ResNet-v1 network [75].

Figure 6. Overview of the gender, age, and emotion classification framework.

Figure 7. Overview of the object detection function of the proposed system.

Figure 8. Devices used for the implementation of the proposed system.

Figure 9. Setting preferences for startup applications.

Figure 11. Illustration of the “.bashrc” file.

Figure 12. Testing routes at Feng Chia University.

Figure 13. Gender distribution in the testing data set.

Figure 14. Confusion matrix of the gender classification model.

Figure 15. Age classification confusion matrix.

Figure 16. Results of emotion classification.

Figure 17. Top 10 most frequently detected object classes on the considered routes at Feng Chia University.

Figure 18. Precision and recall of the object detection model.

Figure 19. Object detection results on routes 1–2 at Feng Chia University.

Figure 20. Object detection results for routes 3–6 at Feng Chia University.

Table 1. Program for key code identification.

1.	print(“Please press any key…”)
2.	image = numpy.zeros([512,512,1],dtype=numpy.uint8)
3.	while(True):
4.	cv2.imshow(“The input key code testing (press Esc key to exit)”,image)
5.	key_input = cv2.waitKey(1) & 0xFF
6.	if (key_input != 0xFF): #press key
7.	print(key_input)
8.	if (key_input == 27): #press Esc key
9.	break
10.	cv2.destroyAllWindows()

Table 2. Technical specifications of the implemented devices.

No.	Device	Technical Specifications
1	Computer	Intel Core i7 CPU 3.70 GHz, 64-bit Windows operating system, 32 GB RAM, NVIDIA TITAN V GPU.
2	Jetson AGX Xavier	8-core NVIDIA Carmel 64-bit CPU, 64-bit Ubuntu 18.04 operating system, 512-core NVIDIA Volta GPU with 64 Tensor Cores.
3	Power bank	Enerpad universal power bank, Input: DC 24 V/2 A, USB output: DC 5 V/2.4 A, AC output: AC 120 V/60 Hz.
4	Remote controller	Logitech R400 remote controller, Wireless operating distance: approx 10 m², Wireless technology: 2.4 GHz.
5	Webcam	Logitech C920 webcam, Max resolution: 1080p/30 fps–720p/30 fps.
6	Headphone	E-books E-EPA184 Bluetooth headphone.
7	Audio transmitter adapter	RX-TX-10 transmitter and receiver Bluetooth adapter, Battery: 200 mAh, Power voltage: 3.7 V, Bluetooth 4.2.
8	HDMI to VGA/audio adapter	Mini HDMI to VGA video converter HD cable adapter, Input: Mini HDMI, Output: VGA and Audio.

Table 3. Results of the face recognition model of the proposed system.

Classes	Precision	Recall	F1 Score
An	1	1	1
Huy	1	1	1
Jason	1	1	1
Lam	0.99	1	0.99
Prof. Dow	0.99	1	0.99
Prof. Lin	1	1	1
Rich	0.99	1	0.99
Thanh	0.99	1	0.99
Tung	0.96	1	0.98
Wolf	0.99	1	0.99

Table 4. Age classification results.

Age Groups	8–13	15–20	25–32	38–43	48–53	60+	Average
Precision	0.72	0.68	0.75	0.69	0.64	0.75	0.71
Recall	0.96	0.69	0.78	0.49	0.64	0.43	0.67
F1 Score	0.82	0.68	0.76	0.57	0.64	0.55	0.67

Table 5. Results of average processing time.

Functions	Function 1	Function 2	Function 3
Processing Time (frames per second)	1.88	1.65	4.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ngo, H.-H.; Le, H.L.; Lin, F.-C. Deep-Learning-Based Cognitive Assistance Embedded Systems for People with Visual Impairment. Appl. Sci. 2025, 15, 5887. https://doi.org/10.3390/app15115887

AMA Style

Ngo H-H, Le HL, Lin F-C. Deep-Learning-Based Cognitive Assistance Embedded Systems for People with Visual Impairment. Applied Sciences. 2025; 15(11):5887. https://doi.org/10.3390/app15115887

Chicago/Turabian Style

Ngo, Huu-Huy, Hung Linh Le, and Feng-Cheng Lin. 2025. "Deep-Learning-Based Cognitive Assistance Embedded Systems for People with Visual Impairment" Applied Sciences 15, no. 11: 5887. https://doi.org/10.3390/app15115887

APA Style

Ngo, H.-H., Le, H. L., & Lin, F.-C. (2025). Deep-Learning-Based Cognitive Assistance Embedded Systems for People with Visual Impairment. Applied Sciences, 15(11), 5887. https://doi.org/10.3390/app15115887

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep-Learning-Based Cognitive Assistance Embedded Systems for People with Visual Impairment

Abstract

1. Introduction

2. Related Work

2.1. Face Recognition

2.2. Gender, Age, and Emotion Classifications

2.3. Object Detection

2.4. Smart Health Care

3. Multifunctional Embedded System

3.1. System Architecture

3.2. Function Selection

3.2.1. Remote Controller

3.2.2. Function Selection Process

3.3. Face Recognition Model

3.4. Gender, Age, and Emotion Classification Models

3.5. Object Detection Model

3.6. Arrangement of Result Description

4. System Implementation and Prototype

4.1. Devices Used in System Implementation

4.2. Initialization Program of the Embedded System

4.3. Data Set Collection

5. Experimental Results

5.1. Analysis Results for Face Recognition

5.2. Gender Classification Results

5.3. Analysis Results of Age Classification

5.4. Emotion Classification Results

5.5. Analysis Results of Object Detection

5.6. Processing Time of System Functions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI