Deep Learning Reader for Visually Impaired

: Recent advances in machine and deep learning algorithms and enhanced computational capabilities have revolutionized healthcare and medicine. Nowadays, research on assistive technology has beneﬁted from such advances in creating visual substitution for visual impairment. Several obstacles exist for people with visual impairment in reading printed text which is normally substituted with a pattern-based display known as Braille. Over the past decade, more wearable and embedded assistive devices and solutions were created for people with visual impairment to facilitate the reading of texts. However, assistive tools for comprehending the embedded meaning in images or objects are still limited. In this paper, we present a Deep Learning approach for people with visual impairment that addresses the aforementioned issue with a voice-based form to represent and illustrate images embedded in printed texts. The proposed system is divided into three phases: collecting input images, extracting features for training the deep learning model, and evaluating performance. The proposed approach leverages deep learning algorithms; namely, Convolutional Neural Network (CNN), Long Short Term Memory (LSTM), for extracting salient features, captioning images, and converting written text to speech. The Convolution Neural Network (CNN) is implemented for detecting features from the printed image and its associated caption. The Long Short-Term Memory (LSTM) network is used as a captioning tool to describe the detected text from images. The identiﬁed captions and detected text is converted into voice message to the user via Text-To-Speech API. The proposed CNN-LSTM model is investigated using various network architectures, namely, GoogleNet, AlexNet, ResNet, SqueezeNet, and VGG16. The empirical results conclude that the CNN-LSTM based training model with ResNet architecture achieved the highest prediction accuracy of an image caption of 83%.


Introduction
Over the past decade, machine learning algorithms and applications have contributed to new advances in the field of assistive technology.Researchers are leveraging such advancements to continuously improve human quality of life, especially those with disabilities or alarming health conditions [1].Assistive technology (AT) deploy devices, present services or programs to improve functional capabilities of people with disabilities [2].The scope of assistive technology research studies comprises hearing impairment, visual impairment, and cognitive impairment, among others [3][4][5].
Vision impairment can vary from mild, moderate, severe vision impairment and total blindness.In the light of the recent advances in machine learning and deep learning, research studies and new solutions for people with visual impairment have gained more popularity.The main goal is to provide people with visual impairment with visual substitution by creating navigation or orientation solutions.Such solutions can ensure self-independence, confidence, and safety for people with visual impairment in the daily tasks [6].According to estimates, approximately 253 million individuals suffer from visual impairments: 217 million have low-to-high vision impairments, and 36 million are blind.Figures have also shown that, amongst this population, 4.8% are born with visual deficiencies, such as blindness: for 90% of these individuals, their ailments have different causes, including accidents, diabetes, glaucoma, and macular degeneration.The world's population is not only growing, but also getting older, meaning more people will lose their sight due to chronic diseases [7].Such impediments can have knockon effects; for example, individuals with visual impairments who want an education may need specialized help in the form of a helper or equipment.Learners with visual impairments can now make use of course content in different forms, such as audiotapes, Braille, and magnified material [8].It is worth noting that these tools read the text instead of images.Technological advancements have been employed in educational environments to assist people with visual impairment, blind people, and special-needs learners, and these developments, particularly concerning machine learning, are ongoing.
The main objective of conducting visual impairment research studies is to achieve visual enhancement, vision replacement, or vision substitution as originally classified by Welsh Richard in 1981 [9].Vision enhancement involve acquiring signals from camera which processed to produce an output display through head-mounted device.Vision replacement deals with displaying visual information to the human brain's visual cortex or the optic nerve.Vision substitution concentrate on delivering nonvisual output in a auditory signals [10,11].In this paper, we focus on vision substitution solution that delivers a vocal description on both printed texts and images to people with visual impairment.There are three main areas of concentration concerning research on people with visual impairment; namely, mobility, object detection and recognition and navigation.In the era of data explosion and information availability, it is imperative to consider means to information access for people with visual impairment specially printed information and images [6].Over the past decades, authors have leveraged state of the art machine learning algorithms to develop solutions supporting each of the aforementioned areas.
Deep learning has evolved in prominence as a field of study that seeks innovative approaches for automating different tasks depending on input data [12][13][14][15][16][17][18].Deep learning is a type of artificial intelligence techniques that can be used for image classification, recognition, virtual assistants, healthcare, authentication systems, natural language processing, fraud detection, and other purposes.The study describes an Intelligent Reader system that employs Deep Learning techniques to help people with visual impairment read and describe images in a printed text book.In the proposed technique, Convolutional Neural Network (CNN) [19] is utilised to extract features from input images, while Long Short-Term Memory (LSTM) [20] is used to describe visual information in an image.The intelligent learning system generates a voice message comprising text and graphic information from a printed text book using the text-to-speech approach.Deep learning-based technologies increase image-related task performance and can help people with visual impairment live better lives.The overall architecture of the proposed solution is demonstrated in Figure 1.
The proposed intelligent reader system reads text using optical character recognition (OCR) and the Google Text-to-Speech (TTS) approach, which converts textual input into voice messages.The input images were trained with CNN-LSTM model to predicts the appropriate captions of an image and sends them to the intelligent reader system.The reader system transmits all data to visually impaired users in the form of audio messages.The proposed approach divides into three phrases: acquisition of input images, extracting features for training the deep learning model, and assessing performance.The efficiency of the constructed model is evaluated using different deep learning architectures, including ResNet, AlxeNet, GoogleNet, SqueezeNet, and VGG16.The experimental results suggest that the ResNet network design outperforms other architectures in terms of accuracy.
This paper provides the following contributions.First, it delivers an Electronic Travel Aids (ETA) vision substitution solution for people with visual impairment that includes spatial inputs such as photography or visual content.Although many studies have proposed text-to-speech solutions, this paper utilizes deep learning capabilities to describe images as well as text to a person with visually impairment.Second, it briefs the reader about most significant deep learning architectures for image recognition, along with most identified features of each architecture.Finally, this paper proposes and implements a deep learning architecture utilizing CNN and LSTM algorithms.Content is extracted from text and images with the former algorithm, and a captions are predicted with the latter.In the recent decades, many researchers have developed an assistive device/system to read text books for people with visual impairment, which helps them enhance their learning skills without the assistance of a tutor.Reading image content is a challenging task for visually impaired students.The proposed system is unique in that it incorporates the intelligence of two deep learning approaches, CNN and LSTM, to assist people with visual impairment in reading a text book (both text and image content) without the assistance of a human.The proposed approach reads the text content in the book using OCR and then provides an audio message.If any images are presented in the text book between the texts, the system uses the CNN model to extract the features of the image, and the LSTM model to describe the captions of the images.Following that, the image captions are translated into voice messages.As a result, visually challenged persons understand the concept of the text book without any ambiguity.The suggested method combines the benefits of OCR, CNN, LSTM, and TTS to read and describe the complete book content through audio/voice message.
The rest of the paper is structured as follows: Section 2 covers previous proposed solutions available for visual impairment.The preliminaries of various architectures in Deep learning approach is explained in Section 3.Then, the empirical findings and model evaluation are detailed in Section 4. We conclude this endeavour in Section 5 with concluding remarks and future work.

Related Work
Vision impairment is a common disability with different level of severity.Assistive technology has contributed to providing visual substitution in the form of products, devices, software or systems [21][22][23].Visual substitution is an alternative means to capture visual images, directions or movements and deliver it in a non-visual manner through audio or Braille [2].Visual substitution can be categorized into three main categories; namely, Electronic Travel Aids (ETAs), Electronic Orientation Aids (EOAs), and Position Locator Devices (PLDs) [24].An overview of each category of visual substitution is discussed further.

1.
Electronic Travel Aids (ETAs) ETAa are devices that translate environment information that are typically identified via human vision, using non-vision sensory.It includes sensing inputs such as a camera, Radio Frequency Identification (RFID), Bluetooth, or Near-Field Communication (NFC), to receive environment inputs, and a feedback modalities to deliver information to the user in a non-vision form such as such as audio, tactile, or vibrations.

2.
Electronic Orientation Aids (EOAs) EOAs devices provide a navigation path and identify obstacles to people with visual impairment.The objectives of EOAs devices is to improve safety and mobility in unrecognized environment by detecting obstacles and delivering information by means of audio or vibrations [25].

3.
Position Locator Devices (PLDs) PLDs provide a precise positioning of devices that utilizes Global Positioning System (GPS) and Geographic Information System (GIS).Such technologies have limitation that it ought to be used outdoors and need to be coupled with other sensors to identify obstacles throughout navigation This paper delivers an ETA vision substitution solution for people with visual impairment.Technological developments, including computer vision and deep and machine learning, are utilized in an autonomous learning system for people with visual impairment.

AT Based on Deep Learning Techniques
The author in [26] outlined a one-off cheap wearable assistive technology (AT) device running on solar power that provides users with ongoing real-time continuous, real-time object recognition to aid VI individuals.This system comprises three elements: a camera, a system on module (SoM) processing unit, and an ultrasonic sensor.The user wears the camera like a pair of glasses to provide real-time recordings, while the SoM, worn as a belt, processes information from the camera, and the sensor can detect objects.Lin et al.
[27] proposed a deep learning-based support system to heighten users' ability to perceive their surroundings.This system involves a terminal that can be worn and has an earpiece, RGBD camera, and an earphone.A CPU can aid deep learning, and a smartphone is employed whenever touch-based actions are necessary.The system also provides safe, clear walking directives thanks to the RGBD information and semantic maps.In [28], employs deep convolution neural network-based architecture to create a system that detects indoor objects and is based on the "RetinaNet" deep convolution neural network.The assessment of detection levels utilizes different elements, including AlexNet, GoogleNet, ResNet, SqueezeNet, and VGGNet.Applying this system resulted in detection clarity of mean average precision (mAP) 84.61%.
To assist people with visual impairments, Tasnim et al. [29] outlined an automatic process solution to detect Bangladeshi banknotes by way of a convolutional neural network.The research proved successful, as demonstrated by the fact that the system was 92% accurate in specifying the notes and could provide written and audio outputs.The researcher [30] designed a smart glass system for blind and people with visual impairment using computer vision and deep learning algorithms.This proposed approach includes four distinct modules: low-light image enhancement, object detection, audio feedback, and tactile graphics generation.In the first module, a deep learning approach is used to improve the quality of the dark image, and objects/texts are recognized using an object recognition method.Finally, the text to speech module produces an audio output.In this method, the object detection model is trained on 133 different types of sounds.The ExDark data set is used to assess the effectiveness of the proposed approach.Reading books, detecting currency notes, and determining parcel specifics are all challenging tasks for people with visual impairment.Mishra et al. [31] developed ChartVi, an automated chart summarising system that accepts various types of chart images such as line, pie, bar, and so on and generates a summary.The CNN-VGG16 network model is utilised in this approach to identify the chart image categories, and then feature extraction techniques are employed to automatically separate graphical and textual information.The inpainting method removed the grid lines from the chart.Finally, the chart summary is divided into three sections: prime, core, and wrapping.The premier section of the chart comprises the fundamental information about the chart, such as the title, axis titles, and range, among other things, the core part contains the actual meaning of the chart, and the wrapping part contains the details of multi-serious charts.According to the empirical studies, ChartVi achieves 97.09% accuracy in chart type classification, >95% accuracy in textual segmentation, and 98% accuracy in graphical extraction.Consequently, a database containing thousands of images of these banknotes was created.Developments such as these mean blind individuals or those with visual impairments can participate in everyday activities.

AT Based on Raspberry Pi
According to Zamir et al. [32], a smart reader system informed by the Raspberry Pi can turn text into spoken signals.A camera recognizes printed text thanks to optical character recognition (OCR).This method proposes to create a system that converts images to audio per the Raspberry Pi single-board computer.In [33], the authors presented the unified descriptor network, Dual Desc, that could outperform the NetVLAD architecture in terms of describing images.A wearable device validates real-world information, and the suggested visual localization suggestions employ multimodal images to avoid any issues associated with RGB photos.The author [34] developed a voice mentor system that reads content such as books, currency notes, and shopping parcels and provides audio output to the user.The Raspberry Pi is deployed in this approach to support the portable camera and audio signals through headphones.To extract text from images and transform the text to audio, optical character recognition (OCR) is used.Chauhan et al. [35] use a Raspberry Pi 3B model and ultrasonic sensors to create Ikshana, an intelligent assistive device for vision impaired users.This device is designed to help people with a variety of daily chores including as character recognition, facial detection, currency denomination identification, and obstacle detection.OCR software is used to extract text from printed books and internet content.The assisting device's design includes a Raspberry Pi 3B model as the computing unit, a Raspberry Pi camera, buttons, and ultrasonic sensors.The headphone acts as a narrative agent, directing the audio output to the user.
A smart electronic assistive device, consisting of two gadgets such as glasses and a smart cane, is designed by Flores et al. [36].The glasses utilise an image processing technique to recognise text, while the smart cane detects obstacles in the walking path by using sensors named VL53L0X and Ultrasonic.The developed device achieves 100% accuracy in obstacle detection, 98.13% accuracy in text recognition, and 91.33% accuracy in natural scene identification.

AT Based on Internet of Things (IoT)
The author [37] developed an intelligent assistive system based on machine learning and Internet of Things (IoT) to recognise people with visual impairment's acquaintances in their regular activities.The author built the proposed system using three major technologies: machine learning, image processing, and IoT.In this system, the data ingestion layer is used to store input images, while the data analysis layer analyses processed data and evaluates the system's accuracy and efficiency using machine learning.Finally, the application layer builds a mobile app that may be used to detect a new individual whose face samples must be saved in the cloud and that gives haptic response to the person with visual impairment when an acquaintance is detected.The researcher [38] created an IoT-based automatic object identification system that can recognise objects and currency notes in real time.This system employs four kinds of sensors to detect obstructions in the front, left, right, and floor directions.To detect the currency note, the Single Shot Detector (SSD) model using MobileNet and Tensorflow-lite is utilised.There were 365 people with visual impairment evaluated with this technology, and 82% of them thought the cost was acceptable, 13% thought it was moderate, and the remaining 5% thought it was relatively high.The proposed system's overall accuracy in object identification and recognition is 99.31% and 98.43%, respectively.

Image Captioning Techniques
Image captioning technique is utilised in a wide range of applications, including bridge damage detection, remote sensing image captioning, language caption synthesis, and construction.Chun et al. [39] used an image captioning technique to describe the damage state of a bridge.A deep learning model is used in this work to produce descriptive sentences from an image.This method can also detect many types of damage in bridge images and provide a full interpretation of complicated imagery.The real time dataset is created during inspection work on 3118 bridges controlled by Japan's Kanto Regional Development Bureau's MLIT from 2004 to 2018.The developed technique uses the Bilingual Evaluation Understudy (BLEU) score to evaluate the algorithm's performance.The proposed method achieves 69.3% accuracy for accurately generating explanatory phrases that give user-friendly, text-based descriptions of bridge damage in images.The researcher [40] used Meta captioning to develop a remote sensing image captioning system.The Meta characteristics are extracted from two tasks in this approach: remote sensing classification and natural image classification.Because of the scarcity of training dataset, effective remote sensing image captioning is extremely difficult.The Meta features are then employed for remote sensing image captioning.The ResNet network is used to train natural image categorization.To illustrate the efficiency of the Meta captioning framework, three distinct remote sensing captioning datasets were employed in the experimental analysis: Sydney-Captions, the Remote Sensing Image Captioning Dataset, and the University of California Merced dataset.
In [41], an integrated approach for extracting semantic information about items, behaviours, and interactions from construction images with visual links was devised.In this approach, the CNN model is used to extract the prominent features from the entire image, and the Mask R-CNN-based Encoder model is used to forecast the image's description words based on the input features.To train the model, 41,668 images were collected from 174 distinct construction sites and divided into training and validation sets.According to the results of the experimental analysis, the proposed method produces BLEU Scores of 0.61, 0.52, 0.44, and 0.36 for BLEU1, BLEU2, BLEU3, and BLEU4, respectively.Afyouni et al. [42] developed AraCap, an Arabic Image Caption Generation approach that combines an objectbased and image captioning framework.The COCO and Flickr30k datasets are used to assess the method's performance.The proposed method includes the object detection and image captioning processes in a sequential order.Using a similarity score, the proposed approach generates captions that are compared to original captions from public databases.The results show that the similarity scores of the proposed models for Arabic generated captions surpassed the basic captioning technique.The remote sensing captioning model was constructed [43] utilising a Variational Autoencoder and a Reinforcement Learning-based Two-stage Multi-task Learning Model (VRTMM).CNN is used in this method to extract both semantic and spatial characteristics from an image.Then, Reinforcement Learning is used to improve the quality of the generated phrases.To identify the remote sensing image scene, a publicly accessible Remote Sensing Image dataset of 31,500 images and 45 scene classifications was used.The results of the experiments illustrate that the proposed model is successful at remote sensing image captioning and produces a new state-of-the-art outcome.
Table 1 illustrates the learning approach employed in recent studies designed for people with visual impairment.Develop a system that assists people in determining their perspective of their environment.Zamir et al. [32] 2019 Raspberry Pi OCR based Text Detection system Calabrese et al. [26] 2020 DL object detection system Afif et al. [28] 2020 Deep CNN indoor object detector system Afif et al. [28] 2020 Deep CNN indoor object detector system Figure 2 depicts a summary of relevant literature.Recently, new innovations in the field of assistive technology have emerged, providing excellent assistance to people with visual impairment in a variety of ways.According to the above literature, 54% of researchers applied deep learning and artificial intelligence approaches to design an assistive device or system.The key advantage of these systems is that they are mobile apps, making it very easy for the user to utilise them.The figure also depicts that, 23% of assistive devices are Rasberry Pi and IoT based hardware devices.The remaining 23% of researchers analyses the image captioning methods using deep learning.The significant applications of the literature discussed above include text recognition, currency note identification, bridge damage detection, language prediction, remote sensing image captioning and facial recognition.It is extremely difficult for visually challenged persons to comprehend image information presented in textbooks, articles, and online advertisements.To overcome these limitations, the proposed system uses deep learning algorithms to provide image information in the form of audio output.

Preliminaries
Deep learning is a type of machine learning and artificial intelligence (AI) that that models the learning of data.This approach helps academic researchers to gather, assess, and decipher substantial reams of information because it streamlines and quickens the process.
A vast amount of research has been conducted in the field of computer vision in recent decades.Image classification, image segmentation, video tracking, pedestrian identification, object detection, and many other applications are examples of computer vision applications.One of the most essential computer vision techniques is object detection, which is used to discover and locate objects/obstacles inside an image or video.Object identification approaches include drawing bounding boxes and representing various things of interest in a given image.Several deep learning variations based on artificial neural networks have been employed such as Multilayer Perceptron (MLP), Recurrent Neural Networks (RNN), Convolutional Neural Network (CNN), and Long Short Term Memory (LSTM), where different architectures plays an important role in different applications [47].

Multilayer Perceptron (MLP)
Multilayer Perceptron is a feed-forward artificial neural network algorithm which has input, output and one or more hidden layers [48].The perceptron can use Rectified Linear Unit (ReLU) [49] activation function or Sigmoid that combined with the initial weights in a weighted sum for prediction.In the fully connected layer of MLP, all the nodes are connected with the next and previous layer.There are many applications of multi-layer perceptron such as speech recognition, pattern recognition, sentiment analysis, etc.

Convolutional Neural Networks (CNNs)
CNN [50] can be defined as a kind of deep learning neural network that has aided the development of classifying and recognizing images.The CNNs are composed of several different basic layers followed by the activation layers.A CNN is made up of three layers: a convolutional layer, a pooling layer, and a fully connected layer.The Convolutional layer involves a procedure wherein a succession of layers retrieve low-to high-level features from the input layer.Meanwhile, the fully connected layer utilizes the Softmax Classification method to calculate and arrange the class label scores.The pooling layer is responsible for reducing the convoluted features' spatial dimensions.This pooling comprises two kinds: average pooling and maximum pooling.The former provides an average of each value from the part of the image within the kernel's boundaries, while the latter returns the topmost value.The fully connected (FC) layer performs classification using the characteristics retrieved by the previous layers and their various filters.FC layers typically use a softmax activation function to classify inputs, yielding a probability ranging from 0 to 1. Figure 3 shows the CNN architecture.Numerous CNN architectures, such as Alexnet [52], VGG16 [53], Squeezenet [54], ResNet [55], and GoogLeNet [56], have emerged in recent years, with many differences in terms of layer types, hyper-parameters, and so on.The most significant predefined networks are discussed in this article.

CNN-AlexNet
AlexNet is a pioneering architecture in the field of computer vision.This model takes images with dimensions of 227 × 227 × 3 as input.As the number of filters increases, the model is trained deeper and more features are extracted.In addition, the filter size is decreasing, implying that the original filter is becoming smaller.RGB images are sent into the deep learning model's input.Softmax is the activation function utilised in the output layer [52].

CNN-VGG16
The VGG16 is a typical convolution neural network (CNN) architecture developed by Karen Simonyan and Andrew Zisserman of the University of Oxford.The architecture's performance is assessed using the ImageNet dataset [57], which obtained 92.7 percent top-5 test accuracy in 2014.In comparison to AlexNet, VGG16 employs huge kernel-sized filters.The architecture's input image dimensions are set at 244 × 244 × 3.All of the hidden layers in this network are followed by the ReLu activation function.Finally, the softmax layer serves as the output layer [53].

CNN-GoogLeNet
The primary goal of the Inception architectural model is to use less computational resource by altering earlier Inception designs.The initial version of the inception model is called "GoogLeNet", and it has 22 layers.These networks have learnt several feature representations for a variety of images.The network's input dimensions are 224 × 224 × 3. The GoogLeNet architecture differs from prior designs such as AlexNet and VGG16 in that it uses global average pooling to generate deeper architecture.Rectified Linear Unit (ReLU) is used as activation functions in this architecture's convolutions [56].

CNN-ResNet
Deep neural networks need more time to train the model and are more prone to overfitting.To overcome these shortcomings, Microsoft launched ResNet, a residual learning framework that improves the training of networks that are far deeper and relatively simple to grasp than those previously employed.Every few stacked levels in this network design directly suit a required underlying mapping [55].

CNN-SqueezeNet
SqueezeNet is a smaller neural network that was created to be a more compact alternative for AlexNet.This architecture has 50x less parameters than AlexNet and performs 3× quicker.It used ReLU activation in all squeeze and expand layers [54].
Table 2 describes the 3D depiction of each deep learning network architecture.Ten-sorSpace (https://tensorspace.org/index.html(accessed on 15 October 2022)) is an interactive visualization tool that exposes data connections between network layers.Long Short-Term Memory (LSTM) [58] networks are Recurrent Neural Networks (RNN) that have the capacity to grasp order dependency in sequence prediction scenarios.RNN is a feed-forward neural network characterized by its internal memory.In this network, the current stage involves the output of the preceding step acting as an input: after its generation, the output undergoes replication and is returned to the RNN.During the decision-making process, the network assesses information about the input and output it acquired from the previous input and helps identify the order of the images.LSTM networks can be used in different contexts, including activity recognition, grammar learning, handwriting identification, human action detection, picture description, rhythm learning, time series prediction, voice recognition, and video description.Figure 4 illustrates the LSTM architecture.LSTM networks comprise numerous memory blocks, otherwise referred to as cells and illustrated in the image as rectangles.These blocks take responsibility for recording information, and modifying this information occurs using one of four gate methods.LSTMs handle Short-Term Memory (STM) and Long-Term Memory (LTM), while the gates aim to streamline the computation process.In this instance, the LTM moves to the forget gate, where it loses data that does not serve a purpose: conversely, the learn gate makes it possible to grasp data from the STM, and the remember gate amends LTM data and brings it up to date, and the use gate forecasts the output of the current event.

The Proposed CNN-LSTM Design
The proposed approach involves feeding the input file into the intelligent reader system, which utilizes an Optical Character Recognition (OCR) tool that scrutinizes the file's contents and Google Text-to-Speech (TTS) technique adapts written input into voice responses.When a file has images, the trained CNN-LSTM model predicts the related captions, which are forwarded to the intelligent reader system.The reader system passes on all data in the form of voice messages.The proposed system is divided into three phases: collecting input images, extracting features for training the deep learning model, and evaluating performance.Such an approach aims to ease concerns over predicting sequences, including spatial inputs such as photography or visual content.Figure 5 depicts the suggested CNN-LSTM model's architecture.
Phase 1 (Input Image Collection): The input images are collected and preprocessed.In this research, Flickr 8K dataset, which comprises images and associated human descriptions, is utilised for model training.
Phase 2 (Model Training): It consists of two main parts: feature extraction and a language prediction model built with two deep learning techniques: Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM).CNN is a sub component of the Deep Learning approach and customized deep neural networks that are used for image classification and recognition.Images in CNN model are represented as a 2D matrix that can be scaled, translated, and rotated.The CNN model analyses the images from top to bottom and left to right, extracting salient features for image categorization.In this network architecture the convolutional layer with 3 × 3 kernels is utilised for feature extraction with ReLU active function.To minimise the dimensions of an input picture, the max-pooling layer with a size of 2 × 2 kernels is utilised.The extracted features will be put into the LSTM model, which will provide the image caption.LSTM is a subsection of Recurrent Neural Networks (RNN) that was created to solve sequence prediction issues.The output from the last hidden state of the CNN (Encoder) is fed as the input of the decoder.Let x 1 = <START> vector and the required label y 1 = first word in the sequence.In the same way consider x 2 = word vector of the first word and expect the network to identify the next word.Lastly, x T = last word, and y T = <END> token.The visualization of language prediction model is depicted in Figure 6.The language model takes the image pixels i and the input word vectors is denoted as (x 1 ,x 2 ,. . .,x n ), and determines the series of hidden states (h 1 ,h 2 ,. . .,h n ) that produce the outputs (y 1 ,y 2 ,. . .,y n ).As the initial hidden state ht, the image feature vectors are only transmitted once.As a result, the image vector I the previously hidden state h t−1 , and the current input x t are used to determine the next hidden state.A softmax layer is used on the specified hidden state activation function to generate the current output y t .
The CNN-LSTM is a deep learning architecture that combines two algorithms: CNN and LSTM.The salient features of the input images are extracted to predict sequences, and the latter predicts captions.The developed deep network model is evaluated using various architectures including ResNet, AlexNet, GoogleNet, SqueezeNet, and VGG16.Phase 3 (Testing): In phase 3, the trained model is tested using the test dataset.The CNN-LSTM model predicts the caption sequence from the test image.The proposed approach's efficiency is determined using metrics such as BLEU, precision, recall, and accuracy.Using Google Text-to-Speech API, the output captions are turned into audio messages.The intelligent reader system based on deep learning enables people with visual impairment to easily understand text as well as images displayed in text content.

Dataset Collection
In this research, the Flickr8k dataset [60,61] is employed to train the model.Flickr8k Dataset, which contains 8092 images, and each is annotated with 5 sentences using Amazon Mechanical Turk.The annotations on each image allow for progress in automatic image description and grounded language understanding.Flickr8k Text file, which contains image names and captions.For training, the deep learning model dataset is divided into three parts: 80%, 10%, and 10% for training, validation, and testing, respectively.Table 3 shows a sample image with a caption.

Sample Image Description/Caption
A child in a pink dress is climbing up a set of stairs in an entry way.
A man stands in front of a very tall building.
The white dog is playing in a green field with a yellow toy.

Results and Discussion
The training process involved feeding the dataset, which was the input, into the model.This  As a consequence of the parameter analysis, several metrics were investigated, including input image size, activation function, developer name and year of creation, and top-5 error rate.Table 4 depicts the input parameter values.In this table the input image size of AlexNet is 227 × 227 × 3, while the remaining network architecture is 224 × 224 × 3. The activation function is used to find the output of the neural network that contains several activation functions such as sigmoid, tanh, relu, softmax, and so on.In this article, AlexNet and VGG16 employed the softmax activation function, whereas the others used Relu activation.When compared to other algorithms, ResNet has the lowest error rate.In the empirical analysis, the batch size and epochs were set to 512 and 200, respectively.The suggested approach features a document comprising words and images: the LSTM model predicts the caption, and a voice message disseminates all relevant data.Figures 8 and 9 demonstrates the output of the proposed deep learning reader.
BiLingual Evaluation Understudy (BLEU) [62] evaluates the performance levels of the image captioning system and carries out an investigation of the n-gram correlation between the reference translation statement and the translation statement under consideration.BLEU score is computed using the following equation, where, m i c is the count of i-gram in candidate matching the reference translation, m i r is the count of i-gram in the reference translation, w i t is the total number of i-grams in candidate tanslation.
A higher BLEU score indicates correspondingly high performance levels.Table 6 summarises the performance of the proposed framework and the other image captioning approaches presented in the related work section.Image captioning is used in a variety of applications such as bridge damage detection, remote sensing image captioning, language caption synthesis, constructions, and so on.According to the table, the BLEU score values for construction image captioning method, remote sensing image captioning method, and Arabic Image Caption Generation were 0.56, 0.77, and 0.81, respectively.For describing the image caption, the proposed method used various Convolutional Neural Network (CNN) pre-trained networks such as AlexNet, GoogleNet, ResNet, SqueezeNet, and VGG16.The empirical results show that the proposed CNN-ResNet network model achieves a higher BLEU score value than other network models and existing image captioning approaches.The suggested CNN-LSTM algorithm's efficiency is assessed using evaluation metrics such as precision, recall, and accuracy.Precision is the number of correct class predictions that belong to the same class.The number of actual predictions made out of all the classes in the data set is denoted as recall.Model accuracy relates to the ability to choose the best model based on training data, and it is defined as follows: To evaluate the correctness of the predicted images captions, it is compared against the caption of the tagged images.In this empirical analysis, true positive means that the predicted model accurately predicts image captions that are tagged positive captions in the class labels.True negatives indicate that the predicted model properly predicts image captions with negative tags in the class labels.A false positive is an outcome in which the model forecasts the positive class inaccurately.Similarly, false negative is an output in which the model predicts negative captions incorrectly.The ResNet network performs better than the existing architectures.The next step in the methodology employ text-to-speech to vocally narrate the data for people with visual impairment.Figure 10        The loss rate for the VGG16 network architecture started at 0.75 and ended at 0.2 and it is shown in Figure 14.It is believed that the loss rate for both training and validation seems to be the same.It is also important to note that the training loss rate is impressed in a smooth manner, with no ups and downs.The experimental results show that the SqueezNet network has a very high loss rate, implying that it produces less efficiency than the other networks.In comparison of ResNet and Vgg16 networks, ResNet produce superior and lower loss rates.Similarly, when compared to other networks, GoogleNet has an average loss rate.The ResNet design is thought to better match the training data and forecast the incoming data.

Conclusions
This work involves generating a deep learning-based intelligent system to assist individuals with visual impairments.The system comprises entering text and images from coursebooks: CNN extricates the relevant data, and LSTM specifies the visual input.Users receive the data in the form of voice messages that use the text-to-speech module.The Alexnet, GoogleNet, ResNet, SqueezeNet, and VGG16 networks train the LSTM model.According to research, the LSTM-based training model provides the most suitable image descriptions and predictions.
An intelligent system means individuals with visual impairments can easily comprehend text and images, although limitations exist, such as requiring the use of the Flicker8k dataset to provide image data.Subsequent studies will utilize transfer learning to refine descriptions of images based on real-time photos and their descriptive content.

Figure 1 .
Figure 1.The Deep Learning Reader Architecture.
research employed CNN and LSTM to ascertain an image's caption: CNN withdrew the features, and an LSTM-trained model came up with the caption.Post-training, the model should accept the image, which subsequently summarizes the content.The trained model helps capture the information encoded in the image, as illustrated in Figure 7.
demonstrates the evaluation metric values for several architectures such as Alexnet, GoogleNet, SqueezeNet, VGG16, and ResNet.

Figure 10 .
Figure 10.Prediction Accuracy.According to the experimental results, ResNet has the highest precision value of 85.45%, while Alexnet has the lowest precision value of 68.26%.In terms of recall, squeezeNet has the lowest recall value of 69.26%, while ResNet has the highest precision value of 83.12%.ResNet architecture has the highest overall accuracy of 86.74% when compared to other network architectures.The empirical results indicate that the CNN-LSTM model with ResNet network architecture outperforms the image caption prediction.In this experimental analysis, the model is trained with 200 epochs.The training and validation loss for the AleNet architecture is depicted in the Figure11.The lost accuracy start at 0.7 and gradually decreased to an average value of 0.2.

Figure 12
Figure 12 demonstrates the training and validation loss for GoogleNet architecture.According to the GoogleNet training and validation graph, the loss value starts at 0.6 and ends at 0.25, indicating that the network performs lower for image caption prediction.

Figure 13
Figure 13 demonstrates the training and validation loss of the ResNet architecture.According to this figure, the loss rate started at 0.8 and gradually decreased below 0.1 after the 80th epoch.When compared to other networks architectures, the ResNet produces better accuracy and a lower loss rate.

Finally, Figure 15
shows the SqueezNet training and validation graph.In this figure, both curves are deviated with high values, indicating that it produce less performance for image caption prediction.The graphs confirms a proven decrease in loss across the different tested models as the number of epochs increases during validation and training which represents a better learning capability of the models.

Table 1 .
Summarization of the Related works.

Table 2 .
Visualization of Deep Learning Architectures.

Table 3 .
Sample Image and Description.

Table 4 .
Parameter Values of Proposed Method.
Table 5 features the 1-and 2-gram BLEU scores for the AlexNet, GoogleNet, ResNet, SqueezeNet, and VGG16 networks.Studies have found that the ResNet network architecture exceeds the other networks' performance levels.

Table 6 .
The Performance Comparison with Existing Approaches.