Image Accessibility for Screen Reader Users: A Systematic Review and a Road Map

: A number of studies have been conducted to improve the accessibility of images using touchscreen devices for screen reader users. In this study, we conducted a systematic review of 33 papers to get a holistic understanding of existing approaches and to suggest a research road map given identiﬁed gaps. As a result, we identiﬁed types of images, visual information, input device and feedback modalities that were studied for improving image accessibility using touchscreen devices. Findings also revealed that there is little study how the generation of image-related information can be automated. Moreover, we conﬁrmed that the involvement of screen reader users is mostly limited to evaluations, while input from target users during the design process is particularly important for the development of assistive technologies. Then we introduce two of our recent studies on the accessibility of artwork and comics, AccessArt and AccessComics, respectively. Based on the identiﬁed key challenges, we suggest a research agenda for improving image accessibility for screen reader users.


Introduction
According to the World Health Organization, at least 2.2 billion people have a visual impairment, and the number is likely to increase with population growth and aging [1]. For them, understanding visual information is one of the main challenges.
To improve the accessibility of images for people who are blind or have low vision (BLV), a number of studies have been conducted to assess the effectiveness of custommade tactile versions of images [2][3][4][5][6][7][8]. Cavazos et al. [5], for instance, proposed a 2.5D tactile representation of the artwork where blind users can feel the artwork by touch while listening to audio feedback. Holloway et al. [7] also investigated tactile graphics and 3D models to deliver map information such as the number of entrances, location and direction of certain landmarks. This approach with extra tactile feedback is found to be effective as it can deepen one's spatial understanding of images by touch [9][10][11]. However, it requires additional equipment, which potential users have limited access to (e.g., 3D printer, custom devices). Moreover, tactile representations need to be designed and built for each image, and thus it is not ideal for supporting a number of different images in terms of time and cost.
Meanwhile, others have relied on digital devices that are commercially available (e.g., PC, tablets, smartphones) for conveying image descriptions (also known as alternative text or alt text) on the web in particular [12][13][14]. For instance, Zhong et al. [13] generated alt text for images on the web that are identified as important using crowdsourcing. In addition, Stangl et al. [12] used natural language processing and computer vision techniques to automatically extract visual descriptions (alt text) on online shopping websites for clothes. Unlike tactile approaches, this software-based approach is more scalable, especially with the help of crowds or advanced machine learning techniques. However, listening to a set of verbal descriptions of an image may not be sufficient for understanding its spatial layout of content or objects within each image.
To leverage the issues of two different approaches above, researchers have worked on touchscreen-based image accessibility that enables users to explore different regions on images by touch to help them have a better spatial understanding. In this paper, to gain a more holistic perspective of this approach by examining the current states and identifying the challenges to be solved, we conducted a systematic literature review of 33 papers, following PRISMA guidelines [15]. To be specific, our goal is to identify the following: supported image types, provided information, collection and the delivery method of the information, and the involvement of screen reader users during the design and development process.
As a result, we found that research studies on touchscreen-based image accessibility have been mostly focused on maps (e.g., directions, distance), graphs (e.g., graph type, values) and geometric shapes (e.g., shape, size, length) using audio and haptic feedback. Moreover, it revealed that the majority of them manually generated image-related information or assumed that the information was given. We also confirmed that while most user studies are conducted with participants who are blind or have low vision for user evaluation, a few studies involved target users during the design process.
In addition, to demonstrate how other types of images can be made accessible using touchscreen devices, we introduce two of our systems: AccessArt [16][17][18] for artwork and AccessComics [19] for digital comics.
Based on the challenges and limitations identified by conducting systematic review and from our own experience of improving image accessibility for screen reader users, we suggest a road map for future studies in this field of research. The following are the the contributions of our work: • A systematic review of touchscreen-based image accessibility for screen reader users. • A summary of the systematic review in terms of image type, information type, methods for collecting and delivering information, and the involvement of screen reader users. • The identifications of key challenges and limitations of studying image accessibility of screen reader users using touchscreen devices. • Recommendations for future research directions.
The rest of the content covers a summary of prior studies on image accessibility and touchscreen accessibility for BLV people (Section 2), followed by a description of how we conducted a systematic review (Section 3), and the results (Section 4), demonstrations of two systems for improving the accessibility of artwork and digital comics (Section 5), discussions on the current limitations and potentials of existing work and suggestions on future work (Section 6), and conclusions (Section 7).

Related Work
Our work is inspired by prior work on image accessibility and touchscreen accessibility for people who are blind or who have low vision.

Image Accessibility
Screen readers cannot describe an image unless its metadata such as alt text are present. To improve the accessibility of images, various solutions have been proposed to provide accurate descriptions for individual images on the web or on mobile devices [12][13][14][20][21][22]. Winters et al., for instance (Reference [14]) proposed an auditory display for social media that can automatically detect the overall mood of an image and gender and emotion of any faces using Microsoft's computer vision and optical character recognition (OCR) APIs. Similarly, Stangl et al. [12] developed computer vision (CV) and natural language processing (NLP) modules to extract information about clothing images on an online shopping mall. To be specific, The CV module automatically generates a description of the entire outfit shown in a product image, while the NLP module is responsible for extracting price, material, and description from the web page. Goncu and Marriott [22], on the other hand, demonstrated the idea of creating accessible images by the general public using a web-based tool. In addition, Morris et al. [20] proposed a mobile interface that provides screen reader users with rich information of visual contents prepared using real-time crowdsourcing and friend-sourcing rather than using machine learning techniques. It allowed users to listen to the alt text of a photograph and ask questions using voice input while touching specific regions with their fingers.
While most of the studies for improving the accessibility of images that can be accessed with digital devices tend to focus on how to collect the metadata that can be read out to screen reader users, others investigated how to deliver image-related information with tactile feedback [2,4,5,23,24]. Götzelmann et al. [23], for instance, presented a 3D-printed map to convey geographic information by touch. However, some worked on using computational methods to automatically generate tactile representations [3,25]. For example, Rodrigues et al. [25] proposed an algorithm for creating tactile representations of objects presented in an artwork varying in shape and depth. While it is promising, we focused on improving image accessibility on a touchscreen, which is widely adopted in personal devices such as smartphones and tablets, since it does not require additional hardware.

Touchscreen Accessibility
While we chose to focus on touchscreen-based image accessibility, touchscreen devices are innately inaccessible as they require accurate hand-eye coordination [26]. Thus, various studies have been conducted to improve touchscreen accessibility by providing tactile feedback using additional hardware devices [27][28][29]. TouchCam, for example, designed and implemented a camera-based wearable device that can be worn on a finger, which is used to access one's personal touchscreen devices by interacting with their skin surface to provide extra tactile and proprioceptive feedback. Physical overlays that can be placed on the top of a touchscreen were also investigated [6,30]. For instance, TouchPlates [6] allows people with visual impairments to interact with touchscreen devices by placing tactile overlays on the top of the touch display. Meanwhile, software-based approaches have been proposed as well such as supporting touchscreen gestures that can be performed anywhere on the screen [26,[31][32][33][34]. BrailleTouch [32] and No-Look Notes [34], for example, proposed software solutions for supporting eyes-free text entry for blind users by using multi-touch gestures. Similarly, smartphones on the market also offer screen reader modules with location-insensitive gestures: iOS's VoiceOver (https://www.apple.com/accessibility/ vision/,accessedon15April2021) and Android's Talkback (https://support.google.com/ accessibility/android/answer/6283677?hl=en). These screen readers read out the contents on the screen if focused, and users can navigate different items by directional swipes (i.e., left-to-right and right-to-left swipe gestures) or by exploration-by-touch.
Again, we are interested in how touchscreen devices can be used to improve image accessibility mainly because they are readily available to a large number of end-users including BLV people as they have their own personal devices with touchscreens. In addition, as touchscreen devices offer screen reader functionality, they are accessible.

Method
To identify the road map of future research directions on image accessibility for people with visual impairments using touchscreen devices, we conducted a systematic review following PRISMA guidelines [15]. The process is shown in Figure 1.

Research Questions
We had five specific research questions for this systematic review: • RQ1. What types of images have been studied for image accessibility? • RQ2. What types of image-related information has been supported for BLV people? • RQ3. How has image-related information been collected? • RQ4. How has image-related information been delivered? • RQ5. How have BLV people been involved in the design and evaluation process?

Screening
Then, we examined the titles and abstracts of the rest 53 unique papers and excluded four papers that are not relevant, which were all conference reviews.

Eligibility
Of the 49 remaining papers, we excluded 12 papers that met the following exclusion criteria:

3.
Not a full paper such as posters or workshop and case study papers (N = 9).
Then, papers were included if and only if the goal of the paper was improving the accessibility of any type of images for people with visual impairments.

Results
As the result of the systematic review, 33 papers were considered as eligible for the analysis. We summarize the papers mainly in terms of the research questions specified in Section 3.1.

Overview
As shown in Figure

Supported Image Types
The types of images that have been studied in prior works for providing better accessibility for screen reader users (RQ1) are summarized in Table 1. While three papers were designed to support any type of images in general [35][36][37], most of the papers focused on specific types of images. To be specific, approximately half of the papers studied images of maps (N = 10) and graphs (N = 6). Interestingly, while the accessibility of photographs for BLV was largely investigated in terms of web accessibility [21,38], only three out of 33 papers aimed to support photographs using touchscreen devices in particular. In addition, as touchscreen devices themselves have accessibility issues for people with visual impairments, requiring accurate hand-eye coordination [26], four papers focused on improving the accessibility of the touchscreen-based interface itself such as soft buttons [39][40][41] and gestures [42].

Information Type
We have identified types of information provided to BLV to improve the accessibility of images (RQ2), which is shown in Table 2. The name of the object, mainly the object that is touched, was provided the most (N = 10), followed by spatial layout of various objects in each image (N = 8). As expected, types of provided information differ depending on the image type. For instance, direction/orientation, distance, and other geographic information was provided for map images. On the other hand, shape, size, boundary of objects, and length were mostly offered for images related to geometric figures. Scene descriptions (N = 5), textures (N = 4), graph values (N = 2), and texts written in images (N = 2) were also present. Note that only two papers supported color information for photographs [35] and artwork [66]. Other information types include types of graph [53] and weather [24]. Five papers did not specify the information they provided.

Information Preparation
In addition to types of information supported to improve image accessibility (RQ3), we have found that most studies have not specified how the visual information is collected or created (see Table 3). This suggests that the aim of many of these studies is "delivering" visual information that is inaccessible to BLV as is while assuming that the information is given rather than "retrieving" the information. Meanwhile, close to one-third of the studies seem to manually create the data they need to provide (N = 9) or use metadata such as alternative text (alt text) or textual descriptions that are paired with the images (N = 4). Others relied on automatic approaches to extract visual information of images: image processing, optical character recognition (OCR), and computer vision (N = 5). Meanwhile, two papers proposed a system where the image descriptions are provided by crowdworkers.

Interaction Types
We were also interested in how image-related information is delivered using touchscreen devices (RQ4). We have identified the interaction type in terms of input types and output modalities as follows: Input types. As shown in Table 4, the major input type is touch, as expected; most of the studies allowed users to explore images by touch with their bare hands (N = 28 out of 33). Moreover, touchscreen gestures were also used as input (N = 5). On the other hand, physical input devices (N = 4 for keyboard, N = 3 for stylus, and N = 2 for mouse) were used in addition to touchscreen devices. While it is known that aiming a camera towards a target direction is difficult for BLV [67], a camera was also used as a type of input, where users were allowed to share image feeds from cameras with others so that they could get information about their surrounding physical objects such as touch panels on a microwave [60,61]. Output Modalities. As for output modalities, various types of feedback techniques were used (see Table 5). Approximately half of the studies used a single modality: audio only (N = 14; including both speech and non-speech audio) or vibration only (N = 3). On the other hand, others used multimodal feedback, where the combination of audio and vibration was most frequent (N = 6), followed by audio with tactile feedback (N = 5). The most widely used output was speech feedback that verbally describes images to BLV users using an audio channel as a screen reader reads out what is on the screen using text-to-speech (e.g., Apple's VoiceOver). On the other hand, non-speech audio feedback (e.g., sonification) was also used. For instance, different pitches of sound [24,35,36,42] or rhythms [55] were used to convey image-related information. Meanwhile, vibration was as popular as non-speech audio feedback, while some used tactile feedback to convey information. For example, Gotzelmann et al. [23] used a 3D-printed tactile map. Zhang et al. [41] also made user interface elements (e.g., buttons, sliders) with a 3D printer to improve the accessibility of touchscreen-based interfaces in general by replacing virtual elements on a touchscreen with physical ones. Moreover, Hausberger et al. [56] proposed an interesting approach using kinesthetic feedback along with frictions. Their system dynamically changes the position and the orientation of a touchscreen device in a 3D space for BLV to explore shapes and textures of images on a touchscreen device.

Involvement of BLV
Finally, we checked if BLV, the target users, were involved in the system development and evaluation processes; see Table 6. We first examined if user evaluation was conducted regardless of whether target users were involved or not. As a result, we found that all studies but two had tested their system with human subjects. Most of them had a controlled lab study, where metrics related to task performance were collected for evaluation such as the number of correct responses and completion time. However, close to half of the studies had subjective assessments such as easiness and satisfaction in a Likert scales, or open-ended comments about their experience after using the systems.
Of the remaining 31 papers, three papers had user studies but with no BLV participants. The rest of the 28 papers had evaluated their system with participants from the target user group. In addition to user evaluation, seven studies used participatory design approaches during their design process. Moreover, some papers had BLV participate in their formative qualitative studies at an early stage of their system development to make their ideas concrete (i.e., survey, interview). Table 6. Methodologies used in the studies and BLV's involvement in system design and evaluation. Note that the following three studies conducted user studies with blind-folded sighted participants [54][55][56].

AccessArt and AccessComics
Based on our systematic review results, we have confirmed that various types of images were studied to improve their accessibility for BLV people. However, most of the studies focused on providing knowledge or information based on facts (e.g., maps, graphs) to users rather than offering improved user experience that BLV users can enjoy allowing subjective interpretations. Thus, we focused on supporting two types of images in particular that are rarely studied for screen reader users: artwork and comics. Here, we demonstrate how these two types of images can be supported and appreciated with improved accessibility: AccessArt [16][17][18] and AccessComics [19] (see Table 7 as well).  [68][69][70]. However, a number of accessibility issues exist when visiting and navigating inside a museum [71]. While audio guide services are in operation for some exhibition sites [72][73][74][75], it can still be difficult for BLV people to understand the spatial arrangement of objects within each painting. Tactile versions of artwork, on the other hand, allows BLV people to learn the spatial layout of objects in the scene by touch [2][3][4][5]. However, it is not feasible to make these replicas for every exhibited artwork. Thus, we began to design and implement touchscreen-based artwork exploration tool called AccessArt. AccessArt Ver1. The very first version of AccessArt is shown in Figure 3, in which it had four paintings with varying genres: landscape, portrait, abstract, and still life [17]. As for the object-level labels, we segmented each object along with descriptions. Then we developed a web application that allowed BLV users to (1) select one of the four paintings they wish to explore and (2) scan objects within each painting by touch with its corresponding verbal description including object-level information such as the name, color and position of the object. For example, if a user touches the moon on "The Starry Night", then the system reads out the following: "Moon, shining. Its color is yellow and it's located at the top right corner". Users can either use swipe gestures to go through a list of objects or freely explore objects in a painting by touch to better understand objects' location within an image. In addition, users can also specify objects and attributes they wish to explore using filtering options. Eight participants with visual impairments were recruited for a semi-structured interview study using our prototype and provided positive feedback. AccessArt Ver2. The major problem with the first version of AccessArt was the object segmentation process, which was not scalable as it was all manually done by a couple of researchers. Thus, we investigated the feasibility of relying on crowdworkers who were not expected have expertise in art [16]. We used Amazon Mechanical Turk (https: //www.mturk.com/) for collecting object-label metadata for eight different paintings from an anonymous crowd. Then we assessed the effectiveness of the descriptions generated by crowd with nine participants with visual impairments, where they were asked to go through four steps of the Felman Model of Criticism [76]: description, analysis, interpretation, and judgment). Findings showed that object-level descriptions provided by anonymous crowds were sufficient for supporting BLV's artwork appreciation. AccessArt Ver3. As a final step, we implemented an online platform (https://artwiki-hci2020.vercel.app/) as shown in Figure 4. It is designed to allow anonymous users to freely volunteer to provide object segmentation and description, inspired by Wikipedia (www.wikipedia.org). While no user evaluation has been conducted with the final version yet, we expect this platform to serve as an accessible online art gallery for BLV people where the metadata are collected and maintained by crowd to support a greater number of artwork, which can be accessed anywhere using one's personal device.

AccessComics for Comics Accessibility
Compared to artwork accessibility, fewer studies have been conducted to improve the accessibility of comics. For instance, Ponsard et al. [77] proposed a system for people who have low vision or have motor impairments, which can automatically retrieve necessary information (e.g., panel detection and ordering) from images of digital comics and reads out the content on a desktop computer controlled with a TV remote. ALCOVE [78] is another web-based digital comic book reader for people who have low vision. The authors conducted a user study with 11 people who have low vision, and most of them preferred their system over the .pdf version of digital comics. Inspired by this study, our system, AccessComics [19], is designed to provide BLV users with overview (as shown in Figure 5a,b), various reading units (i.e., page, strip, panel), magnifier, text-to-speech, and autoplay. Moreover, we mapped different voices with different characters to offer a high sense of immersion in addition to improved accessibility similar to how Wang et al. [79] used a voice synthesis technique that can express various emotional states given scripts. Here, we briefly describe how the system is implemented.

Discussion
Here, we discuss the current state of research on touchscreen-based image accessibility and missing gaps to be investigated in the future based on the findings from systematic review and our own experience of designing two systems for artwork and comics accessibility.

From Static Images to Dynamic Images
Various types of images displayed on touchscreen devices were studied in terms of accessibility over a decade since the year of 2008. However, all but one [42] have supported still images without motion. However, dynamically changing images such as animations and videos (e.g., movies, TV programs, games, video conferences) has rarely been explored in terms of accessibility for touchscreen devices for BLV users. Considering the rapid growth of YouTube [81] and its use for gaining knowledge [82], videos are another type of images (a series of images) that have various accessibility issues. While the area has been explored as well regardless of the medium [83][84][85], it would be interesting to examine how it can be supported for touchscreen devices.

Types of Information Supported for Different Image Types
As it has been found in prior work that BLV wish to get different types of information depending on the context [21], different types of information was provided for different image types. For example, geographic information such as building locations, direction, and distance were offered for map and graph images. On the other hand, shape, size, and line-length information were conveyed to users for geometric objects. However, little study has been done about other types of images, although specific locations or spatial relationships of objects within an image such as photographs and touchscreen user interface are considered important [20,30]. To identify types of information that users are interested in for each type of image, adopting recommendation techniques [86,87] can be a solution for providing user-specific content based on users' preference, interests, and needs.

Limited Room for Subjective Interpretations
The majority of the studies have prioritized images that contain useful information (e.g., facts, knowledge) over images that can be interpreted subjectively, differently from one person to another, such as artwork, using touchscreen devices. Even for artwork images, many studies have focused on delivering encyclopedia-style explanations (i.e., title, artist, painting styles) [2][3][4][5]88]. AccessArt [16] was an exception, where they demonstrated if their artwork appreciation system can enable BLV people to make their own judgements and criticism about artwork they explored. We believe that more investigations are needed to improve the experience of enjoying the content of the images or of making decisions based on subjective judgements for BLV people (e.g., providing a summary of product reviews of others as a reference).

Automatic Retrieval of Metadata of Images for Scalability
The greatest number of studies that we have identified in our systematic review, all but seven out of 33 papers, assumed that image-related information is given. If not, researchers manually created the information. However, a number of images on the web do not have alt text, although it is recommended by Web Content Accessibility Guidelines (WCAG) (https://www.w3.org/TR/WCAG21/). Moreover, it is not feasible for a couple of researchers to generate metadata for individual images. Thus, automatic approaches such as machine learning techniques have been studied [12,13,89,90]. However, since the accuracy of descriptions produced by humans is not as high, we recommend the crowdsourcing approach for generating descriptions [18,20] if precise annotations are needed. This can serve as a human-AI collaboration for validating auto-generated annotations [13]. Eventually, these data can be used to train machine learning models for implementing a fully automated image description generation system [91][92][93].

Limited Input and Output Modalities of Touchscreen Devices
Unlike other assistive systems that require BLV to physically visit certain locations (e.g., [75,88]) or that require special hardware devices with tactile cues (e.g., [3,4]), touchscreen devices benefit from being portable, where a variety of images can be accessed using a personal device with less physical and time constraints. However, the input and output modalities that touchscreen devices can offer are limited to audio and vibration feedback. To provide more intuitive and rich feedback, more in-depth studies on how to ease the design of 2.5D or 3D models and how the cost and time for producing tactile representations of images can be reduced should be conducted. One way to do so is open-sourcing the process, as in Instructables (https://www.instructables.com/), which is an online community where people explore and share instructions for do-it-yourself projects. Meanwhile, we also recommend touchscreen-based approaches to make a larger number of images accessible to a greater number of BLV people.

Limited Involvement of BLV People during Design Process
The findings of our systematic review revealed that most studies had user evaluation of proposed systems with BLV participants after the design and implementation. However, it is important to have target users participate in the design process at an early state when developing a new technology [94]. A formative study with surveys or semi-structured interviews is recommended to understand the current needs and challenges of BLV people before making design decisions. Iterative participatory design process is also great way to reflect BLV participants' opinions into the design, especially for users with disabilities [95][96][97].

Conclusions
To have complete understanding of existing approaches and identify challenges to be solved as the next step, we conducted a systematic review of 33 papers on touchscreenbased image accessibility for screen reader users. The results revealed that image types other than maps, graphs and geometric shapes such as artwork and comics are rarely studied. Furthermore, we found that only about one-third of the papers provide multimodal feedback of audio and haptic. Moreover, our findings show that ways to collect image descriptions was out of the scope of interest for most studies, suggesting that automatic retrievals of image-related information is one of the bottlenecks for making images accessible on a large scale. Finally, while the majority of studies did not involve people who are blind or have low vision during the system design process, future studies should consider inviting target users early in advance and reflect their comments for making design decisions.

Conflicts of Interest:
The authors declare no conflict of interest.