CIRO: The Effects of Visually Diminished Real Objects on Human Perception in Handheld Augmented Reality

: Augmented reality (AR) scenes often inadvertently contain real world objects that are not relevant to the main AR content, such as arbitrary passersby on the street. We refer to these real-world objects as content-irrelevant real objects (CIROs). CIROs may distract users from focusing on the AR content and bring about perceptual issues (e.g., depth distortion or physicality conﬂict). In a prior work, we carried out a comparative experiment investigating the effects on user perception of the AR content by the degree of the visual diminishment of such a CIRO. Our ﬁndings revealed that the diminished representation had positive impacts on human perception, such as reducing the distraction and increasing the presence of the AR objects in the real environment. However, in that work, the ground truth test was staged with perfect and artifact-free diminishment. In this work, we applied an actual real-time object diminishment algorithm on the handheld AR platform, which cannot be completely artifact-free in practice, and evaluated its performance both objectively and subjectively. We found that the imperfect diminishment and visual artifacts can negatively affect the subjective user experience.


Introduction
Augmented reality (AR) allows users to see a mixed environment in which virtual objects are inserted into and superimposed on the real environment [1]. Such augmentations are often targeted for predesignated objects that are recognized and tracked by the AR system [2]. However, real environments are dynamic such that new objects may appear and inadvertently interfere with the main content [3][4][5][6]. For instance, the Pokémon Go application overlays the Pokémon character onto the video captured environment, making it appear as if it is standing on the ground, but a passerby can break such an illusion (see Figure 1b). We refer to these real-world objects as content-irrelevant real objects (CIROs). CIROs may distract users from focusing on the AR content and bring about perceptual problems [7,8], such as depth distortion [9,10] and physicality conflict [11]. One way to address this issue is to recognize arbitrary CIROs and even allow natural and plausible physical interactions [12,13] with the main AR content [4,11].
Instead, we applied diminished reality (DR), that is, to visually diminish the appearance of the CIROs in hopes of reducing the aforementioned perceptual issues in the handheld AR scenes. Diminished reality (DR) is a methodology to reduce the visual gaps between real and virtual scenes by erasing or eliminating out some objects (in this case, the CIROs who are the cause of the perceptual problems) from the scenario to varying degrees [14]. However, in handheld devices, DR may introduce another issue related to the inherent dual-view [15]. For example, even if the CIROs are visually eliminated from the scene, as shown through the small screen of the handheld AR device, they are still visible outside the screen in the real world, causing perceptual inconsistency. Thus, complete elimination might not be the best solution. In a prior work [7], we carried out a comparative experiment investigating the effects of DR on user perception by the degree of the visual diminishment of the CIRO (see Figure 1a-c). However, this work assumed the ideal ground truth case in which the diminishment was perfect without any noticeable artifacts.
In this work, we applied an actual implementation of removing a dynamic object (such as a pedestrian) and inpainting it with the hidden background in real-time on a handheld AR platform (see Figure 1d). We report on its objective performance and also the results of an experiment, similar to the one reported in [7], in which subjective perceptual performance was assessed. For the experiment, we prepared a scenario in which a pedestrian, that is, the CIRO, walked through the AR interaction space while a user was interacting with a virtual pet. The user would observe the pedestrian invading AR scenes and experience distorted depth sensations due to the absence of occlusion effects. This paper is organized as follows: In Section 2, we review previous research. We also provide a brief summary of the aforementioned prior experiment [7] to better set the stage for this work. Sections 3 and 4 outline the basic experimental set-up, and the real time inpainting algorithm implemented for the handheld platform and its objective performance, respectively. Section 5 details the subjective perceptual experiment, followed by an in-depth discussion and implications of the results in Section 6. Section 7 concludes the paper and discusses future research directions.

Perceptual Issues in AR
Depth perception in the context of AR refers to the perceived sizes or positions of the virtual objects (e.g., augmentation) as situated in the real environment and in relation to the real objects [8][9][10]. Among others, the correct depiction of occlusion among objects (real or virtual) is important [16][17][18], whereas incorrect rendering is known to adversely affect the sense realism [19]. However, acquiring three-dimensional information of the environment and objects in it is not a trivial task [16,20], and as such, virtual objects are often added onto the real by simply overlaying them in the foreground of the screen space [21].
Recent advances in tracking [22,23] and depth estimation techniques [21,24,25] are becoming more amenable even on mobile platforms. However, gracefully handling unknown objects unexpectedly introduced into the scene, such as CIROs, is still a difficult task, especially on the relatively computationally limited handheld AR devices.
A related problem has to do with when the rendered AR scene looks physically implausible [6,19], e.g., objects floating in the air, or overlapping with the real object (i.e., physicality conflict), because of the lack of or erroneous 3D understanding of the environment. An additional provision, such as enabling the virtual objects to react to objects in the real environment in a physically plausible way, can help improve the sense of co-presence with the virtual object [4,6,26].
Another problem, inherent to the relatively small handheld displays, is the dual views problem [15]-due to the limited size of the display, large objects are visible both in the display screen and by the naked eye, but in an inconsistent way because of the difference in perspective parameters between the device's camera and the user's eye [27]. Again, employing additional sensors and adjusting the camera focus [28] can alleviate the problem, but it is practically not feasible for the handheld platforms which strive to be stand-alone, self contained, and inexpensive. In this work, we explore removing (or making it less visible) the source of the problems, the CIRO, in the first place.

Visual Stimuli Altering Our Perception
It is well known in psychology that visual stimuli, the most dominating human sensing modality, can be manipulated to change one's perception of the world [29,30]. Several studies [31][32][33][34] have applied high-visibility visual effects (VFX) augmentation, based on the visual dominance [30], to influence human perception. For example, Weir et al. [31] augmented flame effects on the participants' hands, and despite the awareness that flame effects were not real, participants felt warm. Punpongsanon et al. [35] manipulated softness perception by altering the appearances of objects when touching them.
The field of DR seeks to reduce the visibility of real objects in the environment often for more efficient interaction and focus [36][37][38][39][40]. It can be used not just to hide certain objects but also to affect our perception of the environment or situation in a particular way. For instance, Buchmann et al. [41] and Okumoto et al. [42] used partially transparent (diminished) hands so that objects held by the hand could be seen better. Enomoto et al. [43] used DR to remove the gallery's visiting crowd from blocking the painting so that the user could see it clearly. However, DR is not widely employed due to the heavy computational load (especially on handheld devices), and research explaining how DR might affect user perception are hard to find.

Dynamic Object Removal
The early typical object removal methodologies segment the dynamic or static objects (e.g., pedestrian) from the image and then replace them with the hidden background by leveraging multi-view capture systems [44,45]. Object segmentation itself is a difficult problem. Traditional approaches include using image subtraction, template matching, and pixel clustering [46], but recently, deep learning-based approaches have shown dramatically superior results [47]. Performances are approaching near real-time with reasonable accuracy as well [48].
The deep learning-based inpainting [49,50] approach is also quite promising compared to simply relying on multi-view images. Most inpainting algorithms operate offline over the video frames, so that they can refer to and make use of as many frames as possible to find the background and fill in the object mask [51]. VINet [52] is considered one of the state-of-the-art deep learning-based inpainting methods in terms of its computational requirement and accuracy. However, it was not designed for or optimized for small handheld devices. In this work, we used and adapted the work of YOLACT [48] for CIRO segmentation and mask creation, and then used an inpainting algorithm that has been simplified and extended-VINet-to focus on and inpaint for an object that passed through the scene. We ran it on the mobile platform in real-time.

Impacts of Diminishing Intensity on User Perception
In our prior work [7], the influence on user perception based on the intensity or degree of visual diminishment in an AR setting was investigated. The AR content featured interaction with a virtual pet via a pedestrian (CIRO) suddenly passing by and intruding into the scene.
Three test conditions of how the intruding pedestrian was treated were compared: Default (DF)-as is; the CIRO is made semi-transparent (STP); and the CIRO is made completely transparent (TP). That is, the intensities of diminishment were 0%, 75%, and 100%, respectively (see Figure 1a-c). This work assumed the ideal ground truth case (e.g., TP) in which the diminishment was perfect without any noticeable artifacts. The experiment was made possible by rendering pre-acquired images rather than using real-time image streams. The results are summarized in Table 1, and they generally show that the diminishment helped reduce one's distraction and the physical implausibility, and improve the sense of presence. Refer to [7] for more details. However, it remains to be seen whether such results will still hold with the deployment of actual DR methods that may exhibit noticeable artifacts, especially on the computationally limited handheld devices. Table 1. Experimental results of [7] (* p < 0.05; ** p < 0.01; *** p < 0.001).

Cronbach Friedman Post-Hoc
Distraction

Basic Experimental Set-Up
The main AR content used in the experiments showcased a virtual pet (panda) augmented into the real environment exhibiting various interactive and realistic behaviors. The pet possessed 17 user reactive or autonomous behaviors (e.g., greeting, jumping, falling over, and begging for food) with 13 facial expressions (see Figure 2). The pet was augmented, navigated, and acted on the floor (planar surface) detected using the ARCore SDK [53]. To further strengthen its perceived presence [54], a simple physical-virtual interactive behavior was added where the pet could be commanded to approach and turn on/off an actual IoT driven lamp. The panda could be controlled by voice commands.
While the user in the video interacted with the virtual pet, a pedestrian walked into the space while wandering back and forth perpendicularly to the AR camera's viewing direction. The pedestrian walked over the virtual pet, and this is considered a physicality conflict that causes perceptual issues (see Figure 3e,f). The physical test environment is shown in Figure 3a. A patterned mat was put on the floor for robust positional tracking and 3D understanding of the physical space. As will be explained in more detail, our experiments were not conducted in situ, but online by the subjects watching the captured video of the AR content (with the passerby intruding the scene). Figure 3a also depicts the walking path of the pedestrian (i.e., CIRO). Thus, two smartphones were used-one as the mobile AR platform providing the AR scene, providing the first person but fixed point of AR view, and the other for recording the entire situation (both the AR scene as shown through the first smartphone, and the larger environment (see Figure 3e-g)).

CIRO Diminishing System System Design
The CIRO diminishing system is mainly composed of two modules for: (1) object segmentation and (2) inpainting, respectively. For detecting and segmenting the CIRO in this experiment, YOLACT [48] was used, setting it to detect and make a binary classification for the human body. The video frames from the mobile AR client are passed to this module running on the server. The result, the segmented image with the binary mask, is passed back to the mobile AR client for filling in hole with the inpainting module. YOLACT uses a fully-convolutional deep learning architecture (and a fast non-maximal suppression technique), and is considered to be faster than any previous competitive approach (reported to be 33.5 fps evaluated on a single Titan Xp [48]). The object segmentation task is broken into two parallel subtasks: (1) generating a set of prototype masks and (2) predicting perinstance mask coefficients. The masks are produced by linearly combining the prototypes with the mask coefficients.
As for inpainting of the CIRO mask, VINet [52], one of the state-of-the-art of its kind that executes at near real-time, was adapted. VINet is designed as a recurrent network and internally computes the flow fields from five adjacent frames for the target frame. In addition, it takes in the previously generated frameŶ t−1 , and generates the inpainted frameŶ t and the flow mapŴ t⇒t−1 . We employ both the flow sub-networks and mask sub-networks at four scales (1/8, 1/4, 1/2, and 1) to aggregate and synthesize the feature points progressively. Additionally, it uses 3D-2D encoder-decoder networks to complete the missing contents efficiently, and maintains the temporal consistency through recurrent feedback and a memory layer, trained with the flow and the warping loss. Despite being considered one of the best performing inpainting systems, in its original form, it is not suitable for real-time application on the handheld platform, because of the heavy computational load (e.g., it takes 65 ms on our machine equipped with Intel i7-8700K 3.70 GHz CPU and NVIDIA GTX 1080 Ti GPU for an input image size of 512 × 512 pixels) and requirement for the future image frame.
We created a lightweight version of VINet to achieve real-time performance on the handheld platform, by removing the four source frame encoder-decoders (which mark the most intensive computational burden in the original architecture). The recurrent feedback and the temporal memory layer (ConvLSTM) were left intact for maintaining the temporal consistency of the inpainting performance. The pre-trained model was used with no extra training. As a result, the processing time of 32 ms was achieved on the same machine. Figure 4 shows the overall configuration of the CIRO diminishing system as a serverclient system. The video frames captured from the mobile AR client are passed to the server for segmentation of the human body (i.e., CIRO) and creation of the CIRO mask, and then passed back to the AR client for inpainting of it. Considering the communication (TCP) and other (e.g., shared memory among YOLACT and VINet) overheads, the total latency (i.e., inpainted background update cycle) was measured to be about 60 ms (16fps).

Inpainting Performance Evaluation
We compared and evaluated the inpainting performance of the original VINet [52] and our lightweight, real-time version using the structural similarity index (SSIM) and peak signal-to-noise ratio (PSNR) [55,56]. SSIM measures the similarity in structure between the two images, and PSNR, the image's distortion. We prepared two datasets as follows: (1) 17 videos and ground-truth segmentation masks from the DAVIS [57], and (2) 17 videos from DAVIS and segmentation masks generated by the YOLACT [48]. We then generated inpainted images from both datasets using our model and VINet, and calculated SSIM and PSNR. Our model showed similar or even better performances than those by the original VINet in both datasets (see Table 2). Despite such a performance, our system has two limitations. First, our segmentation method (or YOLACT) may be vulnerable to fast-moving objects. When pedestrians move too quickly, incorrect segmentation masks could be generated. This in turn may distort the inpainted image. Second, if the dynamic objects occupy large areas in the image or stay for a long time, our system may not be able to restore the hidden background correctly, because it uses mainly the previous frame as the source for inpainting. Note that, however, the CIRO we consider in this paper, a pedestrian, in most cases simply passes by in the background. Additionally, note that obviously, compared to the ground truth (perfect removal and inpainting), any systems, including ours, are bound to exhibit some noticeable artifacts [52,56]. Figure 5 shows the worst-case scenario of visual artifacts captured in our inpainting system. We can see that part of pedestrian's foot and boundaries are incompletely erased.

Study Design
In our prior study [7], we confirmed the positive effect of removing or diminishing the appearance/presence of the CIRO in terms of the AR user experience. However, the test condition assumed perfect inpainting performance. Here, we conducted an experiment to investigate a more realistic situation, that is, inpainting with possible artifacts, using the basic experimental set-up and system design described in Sections 3 and 4.
The main factor was the specific inpainting method with three values or test conditions: • Default/as-is (DF): The participant sees the pedestrian pass by both through the AR screen and by the naked eye in their real environment. No diminishing is applied (see Figure 1b). This condition serves as the baseline with which much perceptual problems (depth perception and double view) [8] are likely to arise, as also demonstrated in our prior study [7] • Transparent (TP): The CIRO, the pedestrian, is completely and perfectly removed (staged) from the AR scene and filled in, but still visible in the real world by the naked eye (see Figure 1c). The staged imagery is prepared offline using video editing tools. Note that, similarly to the prior study, this experiment also was not conducted in situ, but used offline video review (for more details, see Section 5.2). This condition serves as the ground truth of perfect CIRO removal. • Inpainted (IP): The CIRO, the pedestrian, is removed from the AR scene and filled in using the system implementation described in the previous section, possibly with occasional visual artifacts (e.g., due to fast moving large pedestrians). The pedestrian is still visible in the real world by the naked eye (see Figure 1d).
In summary, the experiment was designed as 1 × 3, within subject repeated measure.

Video-Based Online Survey
Due to the COVID-19 situation, the experiment was conducted as a video based online survey. That is, the three test conditions described in the previous section were recorded as videos (see Section 3 for more details) and accessed for evaluation online through a website (https://khseob0715.github.io/DBmeSurvey accessed on 6 April 2021). The website started with sections for experimental instructions, collecting basic demographic and other background information (e.g., previous exposure to and familiarity with AR). Then, an arbitrary subject visiting the website would be served with the video links presented in a counter-balanced order, with the corresponding questionnaires (see Table 3). Only after completing the evaluation survey, could the subjects can proceed to the next video.

Subjective Measures
The dependent variables for this experiment measured the AR user experience through the responses to various survey questions in five categories-namely: two constructs related to the CIRO, (1) distraction and (2) visual inconsistency; and related to the perception of the virtual pet, (3) object presence [58], (4) object realism [59], and (5) object implausibility. For details, refer to the Table 3. These questions were answered right after viewing the video for the given test conditions. At the end of all three video viewings and evaluations, the subjects were asked to rank their preferences among the three, and prioritize the distracting objects or factors among the following: (1) the pedestrian outside the AR screen; (2) the appearance and behavior of the virtual pet; (3) the visual inconsistency between inside and outside the AR screen (dual-view); (4) the pedestrian's leg in the AR screen (if visible or its artifact); and (5) the latency of the system. Finally, the subject was asked to count the number of jumps made by the virtual pet as a way to measure their concentration to the main AR content. Table 3. The questionnaire used to assess the users' perception of visually diminished pedestrian and the virtual contents. Distraction, visual inconsistency, object presence, and object implausibility were addressed with a 7-point Likert scale, and object realism with a 5-point Likert scale. Distraction DS1 I was not able to entirely concentrate on the AR scene because of the person roaming around in the background.

DS2
The passerby's existence bothered me when observing and interacting with the virtual pet.

DS3
To what extent were you aware of the person passing in the AR scene (or real environment)?
DS4 I did not pay attention to the passerby.

VI1
The visual mismatch between outside and inside the screen of the passerby was obvious to me.

VI2
The different visual representations of the passerby's leg in the AR scene felt awkward.

VI3
I did not notice the visual inconsistency between the AR scene and the real scene.

VI4
The passerby's leg (or body parts) in the AR scene was not felt awkward at all.

Object Presence
OP1 I felt like Teddy was a part of the environment.
OP2 I felt like Teddy was actually there in the environment.
OP3 It seemed as though Teddy was present in the environment.
OP4 I felt as though Teddy was physically present in the environment.

OI1
Teddy's movements/behavior in real space looked awkward.

OI2
Teddy's appearance was out of harmony with the background space.

OI3
Teddy seemed to be in a different space than the background.

OI4
I felt Teddy turned on the lamp.

Object Realism;
Please rate your impression of the Teddy on these scales.

Participants and Procedure
We recruited 34 participants from a local community and did not include the participants who carried out our prior work [7]. Amongst them, we omitted data from 8 participants who did not pass our screening criteria; 2 by trick questions and 6 by survey time which was too short or too long. Thus, data from 26 participants (13 males and 13 females, age 18-26, M = 23.61, SD = 2.28) were used in the statistical analysis. The participants read the instructions as prescribed on the website and followed them through to provide basic and background information. We asked each participant to evaluate their AR familiarity on a 5-point Likert scale, and this yielded a slightly higher level (M = 3.23, SD = 0.86) compared to our prior work (M = 2.67, SD = 1.28) [7]. Then our participants watched each test condition video (in the counterbalanced order) and counted the number of jumps by the virtual pet (a way to focus on the video) and answered the evaluation questionnaire.

Hypotheses
Based on the literature review and in consideration of our experiment conditions, we formulated the primary hypotheses as follows.

Hypothesis 1 (H1).
Subjects are distracted the most by the CIRO among various factors (as seen outside the screen space in the real environment).

Hypothesis 2 (H2).
The more diminished the CIRO is, the more the subjects will prefer the experience. TP > IP > DF.

Hypothesis 3 (H3).
The more diminished the CIRO is, the less distracted the subjects will feel. TP > IP > DF.

Hypothesis 4 (H4).
Diminishing CIROs can worsen the visual inconsistency of the appearance of the CIRO to make it larger in the AR scene than outside. TP > IP > DF.

Hypothesis 5 (H5).
The overall experience, including the object presence and realism, can be affected in a positive way with the CIRO diminishment, despite some visible artifacts. TP > IP > DF. Table 4 show the overall results and statistical analysis of our experiment. The non-parametric Friedman tests on the measures were applied at the 5% significance level. For the pairwise comparisons, we used the Wilcoxon signed-rank tests with Bonferroni adjustment.   Table 4. Experimental and analysis results (* p < 0.05; ** p < 0.01; *** p < 0.001).

Discussion
The prior experiment [7] has already shown that the users felt the least degree of distraction and preferred the AR experience of the complete and ideal removal of the CIRO, that is, the TP condition. This experiment assessed the practical situation of where the diminishment is not perfect in the sense that the object might show up unexpectedly in varied degrees of noticeable artifacts (i.e., IP condition). Our main underlying thought was that the quality of the proposed CIRO diminishment algorithm can be judged not just quantitatively, but more importantly by how its perceptual experience fares in regard to that of the perfect case (TP), or the default case (DF). We discuss the raw results presented in the Section 5.6 with regard to the hypotheses, as stated in Section 5.5.

H1: Subjects Are Most Distracted by the CIRO among Various Factors
Subjects were found to feel most distracted by the CIRO as appearing within the AR scene vs. the remaining part visible outside the screen in the real environment, as shown in Figure 6a. This result is consistent with the prior experiment [7]. Thus, it reasserts the need to remove it from the scene if possible. The removed look could cause an inconsistency with the CIRO visible in the real environment, but the results show subjects did not regard this factor as important as the CIRO's intrusion itself. We also notice that the latency was a relatively significant distractor at 21%. However, the latency must have been caused by the heavy computational load the mobile platform could bear for near real-time performance for the DR processing and rendering. The interactive virtual pet also was listed as a distractor, possibly due to the lack of graphic realism (in terms of blending into the real environment) and natural and physically plausible behaviors. Nevertheless, the results show the importance of having to handle the dynamically intruding objects for the best AR experience. What almost naturally follows from H1 is that subjects would rate the diminished AR content to get the best user experience with the least distraction; perhaps a lesser degree of dual view or visual inconsistency improves physical plausibility, thereby affecting the level of the object's presence (namely, the main character, the virtual pet), and even its realism. Indeed, Figure 6b shows this very result-TP was ranked higher than DF and IP. Table 4 shows that TP was judged to be the least distracting and to have less visual inconsistency than IP (both with statistical significance), and to exhibit the highest object presence, plausibility, and realism. However, no difference was found between DF and IP in the various evaluation criteria. Thus, even though IP showed good quantitative performance in terms of erasing the CIRO, perceptually, it was still insufficient to eliminate the negative factors of the CIRO (see Figure 5). Subjects have reported that this was much more so when the CIRO stayed within the AR view longer (rather than just passing by quickly). This goes to show the imperfect, not completely artifact-free IP condition caused significant visual discomfort almost as much as not erasing it at all. In our prior study, note that the semi-transparently visualized CIRO had the effect of less distraction and overall improved user experience, but no effect on the visual inconsistency (see Table 1). Similar results were found in [60,61]. Thus we can posit that it is not that the CIRO being noticeable was the problem, but that the visual artifact was aggravating. We conclude that H2 and H3 are partially supported.

H4: Diminishing CIROs May Not Worsen the Visual Inconsistency
As already indicated, AR scenes in open mobile screens can suffer from the dual view problem-where relatively large objects in the scene become broken into two pieces by the screen boundary, possibly not exactly on the same scale. Erasing or diminishing the CIRO can exacerbate this problem-part of the real object visible in the real world is now "diminished" or gone in the AR view-a physically impossible situation [7]. However, the visual inconsistency for TP was unaffected-i.e., no difference from DF. Rather, the IP showed a statistically higher visual inconsistency. Note that in the prior experiment (see Table 1), STP (semi-transparent but artifact-free CIRO representation) had no statistical differences from STP and DF for visual inconsistencies. Thus we conclude that H4 is accepted and IP exhibited high visual inconsistency-again, due to the visual artifact (e.g., occasional incorrectly inpainited rendering) and possibly other performance problems, such as perceptible latency and instability, rather than the diminished representation of the CIRO itself. This result is consistent with the prior experiment as well; see Table 1.

H5: CIRO Diminishments May Have Positive Effects on User Experience
Being less distracting and with less direct intrusion of the CIRO in the augmented content, we hypothesized that subjects would perceive virtual objects as part of the real space (object presence) and more realistically when the visibility of the CIRO is reduced in the AR scenes. H5 is partially supported for reasons similar to those mentioned previously. In the prior study [7], which used ideally removed methods, the secondary effects we considered on the user experience had significant differences in a positive direction (see Table 1: object presence, object implausibility, and object realism).
Since the passerby, i.e., the CIRO, walked over or nearby the virtual pet, participants could observe incorrect occlusion and thus physically implausible situations. Both are well-known factors that can reduce the presence and realism of virtual objects [11,16]. In this respect, removing the pedestrian in the AR scene also means eliminating the causes of perceptual problems. Thus this could be interpreted as that participants experienced fewer flaws in AR content in the TP condition than the DF condition. Semi-transparency also would have reduced the chances of recognizing such flaws; i.e., some might have been recognized.
Although the performance (see Table 2) of our inpainting method was slightly better than the existing method [52], visual artifacts inherently present in the scene seemed to play a much stronger role in the secondary effects in this experiment. As Blau et al. [62] reported a counter-intuitive phenomenon that distortion and perceptual quality are at odds with each other, we do not fully understand how the artifacts of inpainting methods affect users' perceptions yet. Thus, further investigation in this regard is needed.

Limitations
The main limitation of our study stems from the fact that the experiment was not conducted in situ and assessed through video recordings (albeit due to the current pandemic) from a fixed view point (no camera movement). Recordings of user controlled and changing AR views could have been used, but it would interfere with the making the test conditions equal, and would have introduced much difficulty in the recording process itself. Camera movement is still regarded an important factor, as it can affect the all the assessment criteria, such as the distraction, visual inconsistency, and object presence/realism.
Our segmentation and inpainting algorithm was tuned to detect only human interference with simply assumed movement types, such as simply passing by and not reappearing. Compromises had to be made to port the diminishing system to run in real time on the handheld platform. With more varied CIRO behaviors, more visual artifacts might occur and again affect the various assessment criteria. However, it still seems clear that the artifacts, rather than the choice of the diminished representation, will play the most important role in the overall user experience. Thus, future work must focus on what kind of artifacts will cause the most serious problems and find an algorithmic approach to eliminating or reducing them further.

Conclusions
In this paper, we proposed a deep learning-based inpainting method for real-time diminishment on the mobile AR platform, and evaluated its performance both objectively and subjectively. The qualitative and perceptual user study indicated that the visual diminishment had some positive impacts on perceptual issues (e.g., depth distortion and distraction) in handheld AR. Ideally diminished CIRO could also improve the realism and spatial presence of the virtual objects in real space. However, we also found that the inconsistent artifacts by the diminishment or inpainting can negatively affect the user's experience, despite the reasonable objective performance of the diminshment algorithm. However, it really depends on the type, frequency, and the degree of noticeability of the artifacts. Future mobile implementations will continue to improve, and at some point, should reach a level that is perceived as a perfect case and no different. At this point, given that our implementation is fairly up-to-date with good objective performance as compared to the current state of the art that runs on the server level computers, current DR on the mobile platforms must consider this issue and provide other ways for compensation and deal with the CIROs. In the future, we plan to continue expanding our work for more practical cases where users (points of view) can move freely.  Institutional Review Board Statement: This work involved human subjects. However, our study used an anonymous web survey and not a face-to-face interview. Thus, ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements.

Informed Consent Statement:
A consent form was not provided because this experiment was conducted through an anonymous web survey.