Clinical Progress and Optimization of Information Processing in Artificial Visual Prostheses

Visual prostheses, used to assist in restoring functional vision to the visually impaired, convert captured external images into corresponding electrical stimulation patterns that are stimulated by implanted microelectrodes to induce phosphenes and eventually visual perception. Detecting and providing useful visual information to the prosthesis wearer under limited artificial vision has been an important concern in the field of visual prosthesis. Along with the development of prosthetic device design and stimulus encoding methods, researchers have explored the possibility of the application of computer vision by simulating visual perception under prosthetic vision. Effective image processing in computer vision is performed to optimize artificial visual information and improve the ability to restore various important visual functions in implant recipients, allowing them to better achieve their daily demands. This paper first reviews the recent clinical implantation of different types of visual prostheses, summarizes the artificial visual perception of implant recipients, and especially focuses on its irregularities, such as dropout and distorted phosphenes. Then, the important aspects of computer vision in the optimization of visual information processing are reviewed, and the possibilities and shortcomings of these solutions are discussed. Ultimately, the development direction and emphasis issues for improving the performance of visual prosthesis devices are summarized.


Introduction
In the 2019 World Health Organization report on global vision [1], it was revealed that at least 2.2 billion people worldwide have various vision problems, and Dr. Tedros added that 65 million people are blind or visually impaired. The main causes of blind-ness are retinal diseases, both hereditary and acquired. Among these, retinal degenerative diseases such as retinitis pigmentosa (RP) and age-related macular degeneration (AMD) are irreversible causes of blindness.
Loss of vision can cause great difficulties for humans to perform some tasks in daily life. People who are visually impaired can accomplish simple braille reading with the help of touch, some simple tasks with the help of hearing, daily walking and going out with the help of guide dogs, etc. With the development of technology, wearable and implantable visual aid electronic devices have benefited a certain number of visually impaired patients. These devices can improve partial functional vision by inducing optical illusions from implanted electrodes through techniques of artificial vision generation on the basis of a partially intact visual pathway.
Research on artificial vision generation has been conducted since the 1950s, including the 1956 discovery by the American scientist Tassiker [2] that subretinal implantation of a photosensitive selenium cell could help humans generate light perception, and since then, researchers have started to investigate electrical stimulation to elicit light perception. In The safety of the device was evaluated in 2018 by implantation in four subjects with increased electrode-retinal distance and stable impedance after the procedure, with no side effects.
Intracortical ORION [59] 60 NA NCT03344848 Six patients without photoreceptors were approved by the FDA to be implanted in 2017, and each implant recipient received a 5-year follow-up; data from the relevant trials are not yet publicly available.
Data source: National Center for Biotechnology Information official website.

Epiretinal Prostheses
The study of the Argus series of retinal prostheses began in 1990, which achieved rapid development and gained more attention, with the first implantation into patients and long-term trials in 2002 in a total of six patients with end-stage RP. The first-generation device, Argus ® I, based on the cochlear implant, was implanted into the patient's eye with a 4 × 4 array of electrodes and carried out several visual tasks [6]: locating a white square on a black background, pointing out the path of a white line above a black background, and finding a door in a room. The results of the clinical trial showed that patients could recognize simple geometric shapes with the help of the prosthesis, and one patient showed some improvement in visual perception during the target localization and mobility tests. In addition to using white square location tests, Dagnelie et al. [28] attempted to have implants perform sock classification (black and white only) and showed that the average success rate for sock classification in subjects with 60 electrodes was around 50%. With the same electrodes, the Humayun group implanted a permanent retinal prosthesis in the eye of a completely blind patient and performed individual letter recognition and word reading tests, showing that the patient was able to correctly recognize individual letters within 6 s to 221 s. During the test, the patient indicated that not all electrodes worked properly [32] and that the phosphene array, seen via electrode stimulation, had large distortions and some degree of loss of visual information, which may increase the difficulty to perform character recognition. A schematic representation of the phenomenon of dropout or distorted individual letters during recognition is shown in Figure 1. The same feedback of visual information loss was obtained from other patients implanted with this prototype device under the same circumstances [5,6,33]. With a limited number of implanted electrodes, implanters could only perform some simple visual tasks such as letter recognition. Meanwhile, it was found that the patients saw a lower resolution than the number of implanted electrodes, with a distorted array of electrodes after surgery. This may have been due to the implanted electrodes not working or because the electrodes were implanted in necrotic tissues from simulation experiments [34]. The principle of inducing dropout or distorted phosphene is illustrated in Figure 2.
rapid development and gained more attention, with the first implantation into patients and long-term trials in 2002 in a total of six patients with end-stage RP. The firstgeneration device, Argus ® I, based on the cochlear implant, was implanted into the patient's eye with a 4 × 4 array of electrodes and carried out several visual tasks [6]: locating a white square on a black background, pointing out the path of a white line above a black background, and finding a door in a room. The results of the clinical trial showed that patients could recognize simple geometric shapes with the help of the prosthesis, and one patient showed some improvement in visual perception during the target localization and mobility tests. In addition to using white square location tests, Dagnelie et al. [28] attempted to have implants perform sock classification (black and white only) and showed that the average success rate for sock classification in subjects with 60 electrodes was around 50%. With the same electrodes, the Humayun group implanted a permanent retinal prosthesis in the eye of a completely blind patient and performed individual letter recognition and word reading tests, showing that the patient was able to correctly recognize individual letters within 6 s to 221 s. During the test, the patient indicated that not all electrodes worked properly [32] and that the phosphene array, seen via electrode stimulation, had large distortions and some degree of loss of visual information, which may increase the difficulty to perform character recognition. A schematic representation of the phenomenon of dropout or distorted individual letters during recognition is shown in Figure 1. The same feedback of visual information loss was obtained from other patients implanted with this prototype device under the same circumstances [5,6,33]. With a limited number of implanted electrodes, implanters could only perform some simple visual tasks such as letter recognition. Meanwhile, it was found that the patients saw a lower resolution than the number of implanted electrodes, with a distorted array of electrodes after surgery. This may have been due to the implanted electrodes not working or because the electrodes were implanted in necrotic tissues from simulation experiments [34]. The principle of inducing dropout or distorted phosphene is illustrated in Figure 2. shows an individual letter recognition task (dark environment). The display shows the letters in white in Century Gothic font on a black background, and the monitor next to it shows the camera view (V) and array (A) in real time; (B) is an illustration of the difference between the electrode activation maps in standard and scrambled modes when the camera is viewing the letter "N". The correspondence between the real position of the phosphenes and the stimulus position on the array were randomized in the scrambled mode (Adapted with permission from Ref. [62]. Copyright 2011 Royal Australian and New Zealand College of Ophthalmologists). shows an individual letter recognition task (dark environment). The display shows the letters in white in Century Gothic font on a black background, and the monitor next to it shows the camera view (V) and array (A) in real time; (B) is an illustration of the difference between the electrode activation maps in standard and scrambled modes when the camera is viewing the letter "N". The correspondence between the real position of the phosphenes and the stimulus position on the array were randomized in the scrambled mode (Adapted with permission from Ref. [62]. Copyright 2011 Royal Australian and New Zealand College of Ophthalmologists). The second generation of the Argus product, Argus ® II, increases the number of implanted electrodes to a 6 × 10 array and installs a camera on the device's eyeglasses that captures images, which are processed and coded to transmit stimulus commands to the electrode array, producing corresponding phosphenes. Not only does the second-  The second generation of the Argus product, Argus ® II, increases the number of implanted electrodes to a 6 × 10 array and installs a camera on the device's eyeglasses that captures images, which are processed and coded to transmit stimulus commands to the electrode array, producing corresponding phosphenes. Not only does the secondgeneration product have an increased number of electrodes and theoretically evoke better visual information from electrodes, the Argus ® II product was the first visual prosthesis in the world to receive CE mark approval and FDA approval, making the Argus ® II among the most common implants worldwide [35]. Breanne et al. [36] used the second-generation product to test patients for single-letter recognition, where the test letters were a subset of the Sloan alphabet containing O, V, K, and Z. Two subjects were able to complete 27 out of 36 trials correctly. In addition, patients reported seeing a defective array with distorted and dropout letter shapes, making it difficult to correctly identify letters. A follow-up survey was conducted with 32 patients implanted with the second-generation product, and the results showed that the number of truly effective electrodes ranged from 46 to 60 [6]. With the best achievable visual acuity of 20/1260 after implantation, patients were still not able to perform visual tasks smoothly in daily life. Beyeler et al. [37] also used a single-letter recognition test to assess the functional visual improvement after the implantation of the Argus series in four patients. The results showed that the recovery of visual acuity ranged from 20/15,887 to 20/1260 after implantation, while the patients all reported they obtained deficits in vision, resulting in dropout letter information and increased recognition time. Other trials were conducted to test the effect of Argus ® II implantation on their performance in motion detection, and clinical trial results showed that half of patients had improved ability to detect moving targets [38]. Abhishek et al. [39] tested the electrode and retinal gap distance using Cirrus HD-OCT software and obtained a series of clinical trial data after 1 month, 3 months, 6 months, and 1 year postoperatively. The clinical trial data showed that the distance between the electrode array and the retinal gap had an effect on the patient's ability to complete the square localization task. The greater the distance, the weaker the light sensation produced by the stimulated electrode, while the closer the distance, the greater the light sensations that may be produced.
The IRIS prosthesis, which has 150 electrodes and is in the internal surface of the retina, was implanted in 2017. Several clinical trials tested visual tasks in 10 patients after implantation [40], and the results showed that the mean error distance was reduced from 8 to 2 in the square localization test and the mean accuracy for image recognition was improved from 45% to about 55%. However, the performance of image recognition with the device did not reach the passing line (60%), and the device had a short life span. Twenty participants with epiretinal prosthesis IMI consisting of a platinum microelectrode [41,42] stated during a follow-up period that they could perceive a weak sense of light with the prosthesis, which was not sufficient to help with activities of daily living. Retinal detachment occurred in some patients during the 3-month follow-up period.

Subretinal Prostheses
Alpha-AMS is the representative of the subretinal prostheses, an implant containing an array of 1500 active microphotodiodes implanted subretinally. Katarina et al. [43] reported the performance of three patients with Alpha-AMS subretinal implants on a 26-letter recognition test in which the patients were able to recognize only five letters, T, V, L, I, and O, when each letter was displayed individually. Among them, some investigators summarized the visual acuity of patients after the implantation of Alpha-AMS [44] and showed that the optimal level of visual acuity that the patients could achieve was 20/546. Zrenner et al. [45] evaluated the recovery of functional visual acuity in patients after the implantation. Two patients were unable to identify the Landolt C ring and individual letters, and only one patient was able to identify individual letters such as L, T, and Z for a maximum of 60 s. Two patients were able to distinguish between the different positions of the letter "U" opening and achieved 73% and 88% correct response rates. Meanwhile, some patients recognized individual letters for more than 40 s after implantation, as in [32]. PRIMA [47,49], a prosthetic device with 378 electrodes, was shown to have an implantation life of up to 3 years in animals [50]. In 2018 [51], it was successfully implanted in three subjects. The follow-up comparison showed that the three patients with subretinal implantation sites achieved visual acuity between 20/460 and 20/550, whereas the two patients with suprachoroidal implantation sites achieved 20/800, indicating the subretinal implantation site was optimal. However, the optimal visual acuity after implantation was far below normal vision, and the patients still had difficulty performing visual tasks in daily life. From the above clinical results, it can be concluded that subretinal and epiretinal prostheses can both elicit a kind of light sensation called "phosphene". Comparing the signal processing method, the former used an extra-ocular information processor, while the latter used the processing method closest to that of natural vision. From the perspective of implantation risk, the latter was less damaging to the retina, while the small space of the former poses a greater challenge for electrode encapsulation and design.
Suprachoroidal prostheses are also considered to be a type of retinal prosthesis. Fujikado et al. [52] implanted a developed suprachoroidal prosthesis with 49 electrodes into three patients with RP. For one year after surgery, all patients were tested daily at home, such as the ability to view white lines on a black background, square positioning, walking along a line, and differentiating between dishes and chopsticks, to assess the effectiveness of the suprachoroidal prosthesis in improving patients' functional vision. Two patients with RP reported that the stimulation of the electrodes did not produce corresponding phosphenes in their expected locations [53] and that the viewed phosphene array had distortions. Mathew et al. [56,58] performed square localization (SL) testing and functional visual acuity assessment in four patients with advanced RP and four with advanced AMD after the implantation of the Bionic Eye choroidal implant with 44 electrodes. The results showed that the mean pointing error in SL decreased from 27.7 ± 8.7 • to 10.3 ± 3.3 • in the four RP patients who used the device, and the mean success rate for patients completing the task was lower than 40% in the four AMD patients who performed the search for objects on the table test.

Visual Cortex Prostheses
As early as 1976, research began on the ICVP visual cortex prosthesis project, and later ICVP used a wireless floating microelectrode array (WFMA) to replace earlier implantation kits that used a large number of wires to connect electrodes and to reduce the cost and risk of the procedure. Brindley et al. [63] implanted a completely blind patient to perform a word reading test, and the best level the patient could achieve was 8.5 characters/min. Fernández et al. [18] implanted a CORTIVIS consisting of 96 electrodes in the visual cortex of a totally blind patient for 6 months, during which the patient was given a letter recognition test. The results showed that the patient was able to recognize only five letters out of 26: I, L, C, V, and O. Another CORTIVIS prosthesis [64] produced vision by evoking 100 microelectrodes. Patients described the evoked phosphenes as flickering, colored, or colorless pinpoint stars that dropped out and were distorted during the single-letter recognition test. The irregularities are shown in Figure 3. Chen et al. [65] implanted 1024 electrodes into the visual cortexes of monkeys, and the monkeys were able to recognize simple shapes or letters after evoking the electrodes. Because the visual cortex is located in the occipital cortex of the brain, far from the center of the human visual field, this is more risky to accomplish surgically, and there have been fewer clinical trials of visual cortex prostheses.
Of the several abovementioned prosthetic devices that have entered the clinic or ended that phase, researchers have assessed the functional visual acuity of patients after wearing the prosthesis through different visual tasks. The greatest number of tests were required for single-letter recognition, square positioning, etc., but patients did not perform well on these simple tasks, such as the recognition of single letters in more than 40 s with an implant containing 60 implanted electrodes [32] and 1500 diode arrays [45]. Additionally, most of the implant recipients reported that the evoked electrodes produced phosphene points with dropouts and distorted arrays, which still bring inconvenience in achieving given a letter recognition test. The results showed that the patient was able to recog only five letters out of 26: I, L, C, V, and O. Another CORTIVIS prosthesis [64] prod vision by evoking 100 microelectrodes. Patients described the evoked phosphene flickering, colored, or colorless pinpoint stars that dropped out and were distorted ing the single-letter recognition test. The irregularities are shown in Figure 3. Chen [65] implanted 1024 electrodes into the visual cortexes of monkeys, and the mon were able to recognize simple shapes or letters after evoking the electrodes. Because visual cortex is located in the occipital cortex of the brain, far from the center of the man visual field, this is more risky to accomplish surgically, and there have been fe clinical trials of visual cortex prostheses. Of the several abovementioned prosthetic devices that have entered the clin ended that phase, researchers have assessed the functional visual acuity of patients wearing the prosthesis through different visual tasks. The greatest number of tests w required for single-letter recognition, square positioning, etc., but patients did not form well on these simple tasks, such as the recognition of single letters in more tha s with an implant containing 60 implanted electrodes [32] and 1500 diode arrays Additionally, most of the implant recipients reported that the evoked electrodes duced phosphene points with dropouts and distorted arrays, which still bring incon ience in achieving daily visual tasks. The researchers in the field of prostheses started to look for relevant optimization solutions to further improve the limited v perception.

Optimization of Information Processing in Visual Prosthetics
It is important to provide better visual perception to the patient as researchers for factors that influence the visual perception of the implant recipient, such as the m rial of the electrode and the density of array [17], the stimulation parameters of the trode [67][68][69], the distance between the electrode and the implantation site [70], and ers. Seung et al. [71] used a liquid crystal polymer (LCP) to fabricate a smoothly roun and flexible structured electrode and implanted it into the choroid of rabbit eyes, w showed that this electrode was safe and stable and could be effectively used for re (B) Immediately after implantation, the induced phosphenes may cause poor perception of objects, such as the letter "E" in the figure. However, appropriate learning and rehabilitation strategies can help to improve the poor perception (adapted from [66]).

Optimization of Information Processing in Visual Prosthetics
It is important to provide better visual perception to the patient as researchers look for factors that influence the visual perception of the implant recipient, such as the material of the electrode and the density of array [17], the stimulation parameters of the electrode [67][68][69], the distance between the electrode and the implantation site [70], and others. Seung et al. [71] used a liquid crystal polymer (LCP) to fabricate a smoothly rounded and flexible structured electrode and implanted it into the choroid of rabbit eyes, which showed that this electrode was safe and stable and could be effectively used for retinal implants. The Argus II increased the number of electrodes from 16 to 60, provided nearly four times the resolution to the patient, and was theoretically capable of providing significantly more visual information than the first generation. The characteristics of the phosphenes are influenced by adjusting the electrode stimulation parameters, such as synchronous pulses affecting the brightness and shape of phosphenes [72], producing a higher level of visual perception than asynchronous stimulation, and the degree of influence is also closely related to the configuration and location of the implanted electrodes [73]. Rebecca et al. [74] used electrodes made of activated iridium oxide (AIROF) to maintain anodic potential bias during interpulse intervals, which could satisfy the charge level required for neural stimulation and reduce electrode polarization. To avoid the phenomenon of phosphenes lasting less than half a second due to the desensitization of retinal ganglion cells, chenais et al. [75] proposed a more natural stimulation strategy based on the temporal modulation of electrical pulses, which was effectively validated on experimental mice with a duration of 4.2 s.
Hardware upgrades, such as increasing the number of electrodes, are necessary for visual prosthetic devices. However, there are currently more difficulties in practice. Therefore, researchers are looking for image processing strategies to optimize the visual information at a low resolution so that implant recipients can better understand the artificial vision available with current prosthesis devices. The optimization of the image processing strategy mainly uses effective techniques in computer vision to extract useful information and to propose certain expression methods according to different visual tasks. Finally, more useful visual information is provided to the recipients. Depending on the target in the assessment of functional vision with a visual prosthesis, widely studied visual tasks include face recognition, letter recognition, and object recognition.

The Optimization Strategy of Face Recognition
Human beings socially communicate with people very frequently in daily life, so learning how to improve face recognition through image processing is one of the important directions in prosthesis research. The related studies conducted a recognition task with either unfamiliar or familiar faces. Boyle et al. [23] designed six processing schemes by image enlargement for subjects to choose the best scheme for face recognition under prosthetic vision. The results showed that the image optimization based on the magnification window of saliency detection was the most chosen, and thus it was considered as the most effective one. Wang et al. [76] proposed three face detection strategies for investigating the appropriate regions for face recognition under artificial prosthetic vision. The first one was to detect faces with the Viola-Jones face detection technique (VJFR) and box out face regions; the second one was to extract face regions according to statistical face ratios (SFR) based on the results of VJFR; the third one was to use the matting face region (MFR) depending on the detection of the previous two methods. Subjects achieved the best recognition accuracy of 67.22 ± 14.45% with the low resolution of the three methods. In the meantime, the experimental results indicated that hair was important for familiar face recognition at a low resolution.
Interior feature extraction is particularly important in familiar face recognition because interior features (e.g., glasses, nose, and mouth region) can help subjects identify familiar faces. Rollend et al. [77] also proposed an image enhancement method using efficient local binary pattern (LBP) features to detect faces when the detected face intersects the an implanted field of view, segmenting the area around the face with an ellipse, performing face contrast enhancement by histogram equalization, and achieving real-time face detection at a low resolution. Moreover, to highlight interior features, Jessica et al. [78] caricatured the face image to exaggerate the identity information in both familiar and unfamiliar face recognition. The average faces of females and males were calculated from the locations of the marked face attributes. By exaggerating the distance between attributes such as the target face's and the average face's lips, people with thick lips were caricatured so that the lips became thicker. At a resolution of 40 × 40 and a dropout rate of 30%, the average face recognition accuracy of the subjects improved from 55% to 65%, exceeding the passing level. The schematic illustration of the principle is shown in Figure 4, A, B and C in Figure 4 are the results of the face processing. To further reduce the difficulty of face recognition, Zhao et al. [79] proposed a FaceNet-based strategy to transform and replace complex face information into simple Chinese characters (surnames) in real time, resulting in recognition accuracy values of 77.37% and above, providing a new possible direction for improving face recognition in the field of prosthetics. Chang et al. [80] combined Sobel edge detection and contrast enhancement techniques to highlight interior features of familiar faces. The proposed contrast enhancement was a novel histogram equalization technique that adjusted the input histogram by adaptively changing parameters to enhance the image naturally. The face images selected in the experiments were all familiar faces for the subjects. The results showed that the subjects' face recognition accuracy reached 27 ± 12.96%, 56.43 ± 17.54%, and 84.05 ± 11.23% for the three resolutions (8 × 8, 12 × 12, and 16 × 16), respectively, while the subjects' average response times for recognizing facial images were 3.21 ± 0.68 s, 2.73 s, and 1.93 ± 0.53 s, respectively. Recently, Xia et al. [81] proposed an F2Pnet for translating faces into pixelated faces, and 14 subjects were recruited for tests of face recognition. The training dataset was AIRS-PFD, and the results showed that mean individual identifiability values were 58% with pixelated faces and 46% with reduced resolution and display degradation (30%).
Methods such as image enhancement and edge detection for interior face features have been shown to be helpful in improving face recognition under prosthetic vision, and some methods have been applied in prosthetic devices [82]. However, the study of image optimization algorithms for face recognition under irregular artificial prosthesis vision were relatively small, and the resolution used in the relevant studies that have been conducted was high, much higher than the number of electrodes in the more widely implanted togram equalization technique that adjusted the input histogram by adaptively changing parameters to enhance the image naturally. The face images selected in the experiments were all familiar faces for the subjects. The results showed that the subjects' face recognition accuracy reached 27 ± 12.96%, 56.43 ± 17.54%, and 84.05 ± 11.23% for the three resolutions (8 × 8, 12 × 12, and 16 × 16), respectively, while the subjects' average response times for recognizing facial images were 3.21 ± 0.68 s, 2.73 s, and 1.93 ± 0.53 s, respectively. Recently, Xia et al. [81] proposed an F2Pnet for translating faces into pixelated faces, and 14 subjects were recruited for tests of face recognition. The training dataset was AIRS-PFD, and the results showed that mean individual identifiability values were 58% with pixelated faces and 46% with reduced resolution and display degradation (30%).

The Optimization Strategy for Character Recognition
Character recognition has likewise received much attention, as an important direction in prosthesis research. Early studies focused on the effects of phosphene properties, such as dot size and number, on character recognition [11,[83][84][85]. Some of the work utilized image processing methods, such as Fu et al. [86], who processed images with cropping and segmentation. Considering the presence of dropout phosphenes and array distortions, some researchers have improved the adverse effects of such irregularities through image processing. Dai et al. [26] proposed two correction methods, including weighted nearest neighbor search (NNS) and expansion based on image morphology, to improve the recog-nition of Chinese characters under irregular phosphene arrays. The results demonstrated that the average accuracy after using the correction was more than 80% when the index of array irregularity reached 0.4. Based on this work, Lu et al. [87] optimized the NNS and further proposed a projection method to improve the reading ability of subjects with irregular phosphene arrays, and its specific processing flow is shown in Figure 5. In Lu's study, the NNS found the evoked irregular phosphene array for the nearest phosphene dot, q k , in a circle that centered on the point p i in the ideal regular phosphene array. The schematic diagram is shown in Figure 6, where the observed q k replaces the q i to express visual information. Projection refers to superimposing normal characters on a phosphene array of the same size and pixelating the strokes over the viable phosphenes to finally generate the corresponding pixelated character results. Hyun et al. [89] investigated the effects of image presentation methods on character recognition at different stimulus frequencies. If the electrode stimulation frequency is too fast, it may cause the subject to see multiple phosphenes or even a large phosphene directly occupying the entire visual field. At the two resolutions of 6 × 6 and 10 × 10, Hyun et al. used two methods of pixelization for Korean and English letters, the static pixelization method and the spatiotemporal pixelization method (SP). The SP method means that the original image was downsampled with a block-averaging algorithm with four times the spatial resolution of pixelation, while the block-averaged image was subsampled to four different low-resolution images. A two-dimensional Gaussian function was convolved on each subsampled image to generate four different phosphene results, which were presented to the subjects at different stimulus frame rates. The strategy of spatiotemporal pixelation significantly improved the recognition accuracy from a failing grade to 80%. This method of sequential stimulation of subsampled images "splits" the stroke structure into four parts and takes advantage of the characteristics of the human brain's short-term memory to achieve character recognition.  Hyun et al. [89] investigated the effects of image presentation methods on character recognition at different stimulus frequencies. If the electrode stimulation frequency is too fast, it may cause the subject to see multiple phosphenes or even a large phosphene directly occupying the entire visual field. At the two resolutions of 6 × 6 and 10 × 10, Hyun et al. used two methods of pixelization for Korean and English letters, the static pixelization method and the spatiotemporal pixelization method (SP). The SP method means that the original image was downsampled with a block-averaging algorithm with four times the spatial resolution of pixelation, while the block-averaged image was subsampled to four different low-resolution images. A two-dimensional Gaussian function was convolved on each subsampled image to generate four different phosphene results, which were presented to the subjects at different stimulus frame rates. The strategy of spatiotemporal pixelation significantly improved the recognition accuracy from a failing grade to 80%. This method of sequential stimulation of subsampled images "splits" the stroke structure into four parts and takes advantage of the characteristics of the human brain's short-term memory to achieve character recognition.  By the experimental results, the accuracy of Chinese character recognition under both strategies is higher than that before optimization, and the effect is better under the nearest neighbor search optimization strategy. The results also indicated that, in the NNS, the larger the search range selected by the nearest neighbor search strategy, the higher the subjects' recognition accuracy. The accuracy reached or exceeded 69.4 ± 3.4% when the search range reached or exceeded 0.6 times the adjacent phosphene point spacing. This is due to the fact that the NNS method complements the absent features influenced by the distortion and dropout of character strokes while preserving the structure of Chinese characters.
Kiral-Kornek et al. [88] proposed to extract edge orientation information encoded as a directional elliptical phosphene to improve letter recognition performance under prosthetic vision. The results showed that, considering a dropout rate of 50%, the subjects achieved 65% recognition accuracy using the directional phosphene strategy, significantly higher than the 47% recognition accuracy under the uniform stimulation strategy. Hyun et al. [89] investigated the effects of image presentation methods on character recognition at different stimulus frequencies. If the electrode stimulation frequency is too fast, it may cause the subject to see multiple phosphenes or even a large phosphene directly occupying the entire visual field. At the two resolutions of 6 × 6 and 10 × 10, Hyun et al. used two methods of pixelization for Korean and English letters, the static pixelization method and the spatiotemporal pixelization method (SP). The SP method means that the original image was downsampled with a block-averaging algorithm with four times the spatial resolution of pixelation, while the block-averaged image was subsampled to four different low-resolution images. A two-dimensional Gaussian function was convolved on each subsampled image to generate four different phosphene results, which were presented to the subjects at different stimulus frame rates. The strategy of spatiotemporal pixelation significantly improved the recognition accuracy from a failing grade to 80%. This method of sequential stimulation of subsampled images "splits" the stroke structure into four parts and takes advantage of the characteristics of the human brain's short-term memory to achieve character recognition.
Character recognition is an important part of life, and researchers have used computer vision image processing methods to assist visually impaired people with character recognition. Character/letter recognition is not as difficult and commonly does not require high visual acuity compared to face and object recognition. Likewise, in prosthetic vision, characters/letters do not need many phosphenes to convey information, using mostly simple preprocessing methods such as binarization. During the clinical trial phase, the implant recipients were able to perform character recognition faster or better with a short training period compared to the other visual tasks tested [43]. To reduce the adverse effects of irregular artificial vision on reading or daily word communication, researchers have proposed several array optimization methods, which have been validated in Chinese character recognition under simulated artificial vision. After irregularity optimization under simulated prosthetic vision, subjects have achieved more than 80% accurate recognition rates [26]. Even in the recognition of Chinese characters with complex strokes, there was a large improvement in recognition rate with optimized correction. Therefore, in recent years, there were less studies on character recognition and its information optimization methods. On the other hand, the role of these correction methods in other language characters should be further investigated in the future.

The Optimization Strategy of Object Recognition
Similar to the studies of face recognition, research on object recognition in this field aims to extract and enhance useful information from the low-resolution artificial vision field to help the implant recipients to obtain better visual perception and object recognition ability. Li et al. proposed a top-down model for global contrast saliency detection [90], which detects and extracts the most conspicuous objects in the scene in real time by combining color and intensity differences. A set of visual tasks was designed with simulated ideal artificial vision. Subjects were asked to find the target object at a distance of 2 m. The average time to complete the task before using the optimized strategy was around 62 s. Afterwards, the average time was around 42 s using the optimized strategy. In order to better assess the effect of the optimization strategy on daily life, a second eye-hand coordination task was designed in which subjects were asked to find two target objects among the four objects in front of them and complete the corresponding actions. Meanwhile, the mean rate of correctly completed tasks (PC) increased from 62.85 ± 1.54% to 84.72 ± 1.41%, and the mean completion time (CT) decreased from 49.5 ± 3.76 s to 40.73 ± 2.1 s. Analyzing the experimental data together, the mean PC and CT with both types of vision tasks verify the effectiveness of the optimization strategy. Meanwhile, the mean head motion of the subjects decreased from 939.31 ± 38.38 • to 575.70 ± 38.53 • , indicating the searching scope was significantly reduced and the ability to perceive was improved. A year later, Li et al. [91] used another bottom-up saliency detection model, graph-based visual saliency (GBVS), combined with edge extraction to help locate foreground regions, which could also help the recognition of one or two objects of interest at an ideal low resolution.
Considering the deficit in real artificial vision, Li et al. [92,93] utilized a generative adversarial network (GAN) model, which has had remarkable success in the field of image inpainting, to compensate for the absence of phosphene points. The Pix2pix GAN was used to learn the mapping relationship between RGB images and pixelated results with a generator and a discriminator, which generates the pixelated results with additive phosphenes points close to the real one. The principle of the model is shown in Figure 7, and the calculation is shown in Equation (1).
where the input binary mask is denoted by M; the dropout image of the input phosphene point is denoted by y; G(Z) is the mapping suitable for representing the missing parts, and is the Hadamard product.   Z is the optimal solution. G(Z) is the most suitable mapping to represent the dropout part (adapted from [93]).
Inputting arbitrary Gaussian noise, z, into the generator, the image features of any simulated ideal phosphenes were learned to obtain a mapping close to the real generated image, G(z), where the input to the generator was any image, y, with dropout phosphenes. The Hadamard product of y and the binary mask of dropout part M were calculated to obtain the image y M of the no dropout part. Simultaneously, the discriminator determined the difference between the input image, y, and the real image using difference back-propagation and adversarial training to reconstruct the dropout part of the image, y, to obtain the optimal result, G(Z), for which the generator used previously learned features, where the optimal solution, Z, was obtained by back-propagating the updated generator parameters in the process of minimizing the global loss. Then, the no-dropout part, y M, and the optimal result of generating the dropout part, G(Z), were summed to obtain the final complemented complete image, pixel reconstructed . Subjects were asked to answer verbally the identity of the object appearing on the screen within 10 s at a distance of 35-40 cm from the display. The test results showed that the average recognition accuracy of subjects ranged from 35.0 ± 4.3% to 60.0 ± 6.1%, and the accuracy improved to 80.3 ± 7.7% after using the phosphene point addition optimization strategy. The model decreased the difficulty for subjects in the recognition process, and they could recognize most of the pixelated objects.
Aiding the visual prosthetic wearer in subsequent daily perceptual tasks through scene understanding, Melani et al. [94,95] used a strategy combining structural informative edges (SIE) and objects mask segmentation (OMS) to help identify objects and rooms. Among them, the objects in the indoor scene were highlighted with instance segmentation to reduce the interference of the background, the structural information in the scene was extracted with semantic segmentation, and the edge information in the scene was extracted with Canny edge detection. Object recognition and scene recognition tests were conducted to simulate the daily life of subjects in a 20 • viewing angle with a 32 × 32 resolution. In the object recognition task in direct low pixelation, the subjects' correct object recognition rate was 36.83%, compared to 62.78% with the SIE-OMS strategy. The recognition success rate for the scene recognition task with the same strategy was significantly higher than that with direct low pixelation and edge detection. To reduce the difficulty of recognition when multiple target objects within the field of view appear to overlap, Jiang et al. [96] proposed a hierarchical method to assign different levels of grayscale values to multiple targets according to an object's location, size, and degree of importance in the scene based on the segmentation of multiple targets with Mask RCNN. Ultimately, the subjects achieved a average task completion rates of 87.08 ± 1.92% on the test describing the number of multiple objects in the scene and 60.31 ± 1.99% on the test describing the content of the scene. Considering the depth information of objects, David et al. [97] proposed an InIbased object segmentation model that extracts objects from the scene based on their depth information, derived from the distance from the camera (simulating a human eye), pixelates the extracted objects to improve the clarity of objects in the vision, and reduces some of the distracting spatial and temporal effects. Other scholars have conducted studies under other image acquisition methods, such as Dagnelie et al. [98], who used infrared enhancement to help subjects with cup recognition. Again, using infrared images, Liang et al. [99] proposed an infrared image enhancement algorithm with an improved SAPHE algorithm to enhance image contrast and highlight edge contours to help highlight edges for object recognition. The experimental results showed that the subjects achieved an average recognition accuracy of 86.24 ± 1.88% under infrared mode processing, which was higher than the 64.55 ± 3.34% with direct low-resolution pixels. The depth information of images plays an important role in obstacle avoidance navigation tasks. Alejandro et al. [100] used depth information to detect obstacles while guiding the walking direction with a chessboard grid pattern. To make better use of the depth information in the scene, Rasla et al. [101] proposed a scene simplification strategy based on depth estimation and semantic edge detection using a neurobiologically inspired bionic visual computational model to simulate obstacle avoidance tests, where depth estimation was implemented by a self-supervised monocular depth estimation model using monodepth2 to predict the relative depth map of the pixels in each frame. Simplifying the scenario by semantic edge detection, the success rate for subjects with zero collision obstacles was above 80% in the obstacle avoidance test under simulated prosthetic vision using the combination of depth information and edges.
The method based on saliency detection and segmentation, which were widely used in the field of visual prosthesis, detects the most salient object (objects) in the field of view, removes complex backgrounds, and satisfies the requirement of providing limited information in a low resolution. Some scholars have carried out studies on the methods of

Summaries of Optimization of Information Processing
Since the research on the optimal processing of visual information was carried out, different computer vision methods were applied to the image processing stage, and the subjects showed improvements in their performance on different visual tasks. Table 2 summarizes the image processing optimization methods in three vision tasks with or without considering the phosphene point dropout or phosphenes array distortion as well as the optimization and final test results.   In the simulation experiments conducted with several reviewed and analyzed image processing optimization strategies, the selected subjects were normal-sighted and unfamiliar with artificial vision to avoid learning effects in psychophysical experiments. The datasets used in the experimental process were mostly created by the researchers themselves. Among them, a few studies used public datasets, such as the public dataset of indoor scenes [102] used in the study of Melani et al. [94,95] and the ETH-80 adopted by Li's work [92]. In the face and character recognition, some well-known datasets were also utilized, such as AIRS-PFD in Xia's work [81] and the standard MNREAD reading test in the work of Fu et al. [86]. Some others captured images directly from the camera in real time as experimental images. However, without a public dataset, it is difficult to generalize the researchers' experimental results. Preprocessing was sometimes applied by common methods or computer vision techniques, such as the work of Boyle et al. [23] cropping images to make them fit the window size. Wang et al. [76] used noise reduction for face recognition, and the work of Chang et al. [80] extracted the edges of faces in images with Canny operators. Before the layering and optimizing of object information, Li et al. [96] utilized Mask-RCNN to obtain the mask of multiple object instance, which could be regarded as processing. Meanwhile, certain image processing optimization strategies have high hardware requirements, and computing devices with GPUs are essential to the implementation of the algorithms in real time. The above image processing optimization strategies applied to different visual tasks brought better visual perception to the subjects to some extent, and the results of the simulation experiments in each study mostly quantitatively assessed the effectiveness of these image processing strategies. However, the majority of them are based on simulation studies under ideal arrays [21,22,90,91,96,99,[103][104][105][106][107][108], except for [26,78,81,87,92], which consider irregularities in real artificial vision.

Discussion
Visual prosthetics have provided an important research direction for repairing the visual perception of the visually impaired and have been promoted with certain clinical applications. Visual prostheses are not a fundamental solution to the visually impaired, but they provide the opportunity for the visually impaired to improve their ability to perform functional visual tasks. This paper reviews the visual perceptual ability of implant recipients in clinical trials and studies of image processing optimization in the field of simulated prosthetic vision. Although the experimental results from many studies were promising, some problems still need to be solved. Fewer prosthetic devices have entered the clinical phase, and some are implanted far from the center of the visual field and are not well-perceived by the wearer. Implantation into the patient's eye can take a long time [99,[109][110][111][112][113] and carries surgical risks. The number of electrodes in these devices is limited, and there are irregularities in the induced phosphenes. The more widely worn Argus II has an external image processing unit that includes edge detection and enhancement technology. However, these methods are simpler and provide global important information to the wearer, which is less related to the visual tasks. While image processing algorithm research is spreading in the field of visual prosthesis applications, some issues are worth considering. Most of the image processing algorithms investigated by researchers are image optimization methods used on single vision tasks in static images. However, in real life, people typically perform two or more visual tasks at the same time, such as putting on clothes, where people may perform object detection and eye-hand coordination simultaneously. The multitask requires image processing methods that guarantee good performance while ensuring real-time implementation. However, few models in the current research can meet such requirements. Meanwhile, the dropout of phosphene points and the distortion of phosphene arrays in artificial vision were less considered. Additionally, the simulated phosphenes are mostly colorless. However, clinically evoked ones are colorful, such as yellow, red, and orange [65,114,115]. Recently, Vernon et al. [116] proposed a hybrid stimulus model that can provide color information without reducing spatial resolution. The understanding of colored phosphene vision will help improve research on visual function restoration for artificial prosthesis wearers. Some researchers have now made improvements to electrode implantation by proposing an array in a honeycomb configuration [117]. This unique configuration shape offers great possibilities for improving the spatial resolution of the visual prosthesis. On the basis of existing image optimization research, future research will focus on irregular array optimization under different image categories, and increasing the color information of phosphenes should be carried out in the process of exploring the improvement in the resolution of visual prostheses.

Conclusions
In studies of the optimization of information processing, most show that computer vision can be used to improve the visual functions of wearers, such as object recognition, face recognition, and character recognition. Future visual prosthesis devices may have smaller implanted electrodes, allowing for the implantation of higher density microelectrode arrays. However, as the density of electrode arrays increases, it may not always produce the expected high-resolution artificial visual information and may bring such phenomena as virtual electrodes and increase the risk of tissue damage and the cost of implantation. These issues indicate that the growth in the number of implanted electrodes will be limited in the near future. With the growing development in the field of artificial intelligence, more accurate and efficient image detection and segmentation techniques are ongoing, offering the possibility of improving image processing modules in prosthetic devices to optimize artificial vision. It is believed that along with the improvement in visual prosthesis device hardware and the application of computer vision, the two complement each other to optimize the elicited vision of artificial visual prostheses, bringing the hope of "seeing" to prosthetic wearers.