2. Background
For the creation and development of AR applications, two crucial issues are represented by the accuracy of user tracking and the registration of 2D/3D content to real-world features. Tracking in the literature can be based on different and complementary approaches. The most common is based on the so-called marker-based tracking, which requires the introduction in the physical scene of particular artefacts, mainly consisting of coded targets, such as ArUco markers [
4]. Other approaches are markerless and are based on tracking natural visual features visible in the scene without introducing any foreign change. Markerless strategies usually deploy additional non-visual information to ease the recognition of the scene; for instance, they can take advantage of the knowledge of the GPS position, usually available outdoors, or other specific data, such as beacons in radiofrequency [
5]. In cultural heritage applications, both the marker-based and the markerless approaches to tracking have found suitable use. Clearly, the use of standardized markers makes easier the tracking task and it is not site specific. That is, an AR application resorting to markers can be replicated in different scenarios and it is expected to work under most of the variable conditions that might be encountered outdoors, such as varying light, shadows, casted shadows of trebling elements (such as trees and foliage) and partial occlusions. However, introducing such foreign objects might be impossible or not practical in other scenarios, such as hard-to-reach walls or sculptures. Again, markers can visually disturb the sight of the physical artwork, negatively impacting the fruition of cultural heritage. Therefore, markerless approaches are to be preferred for their more immediate and possibly more engaging access to the augmented content.
However, user perception and smoothness of use are key factors that have been analyzed in several studies. For instance, in [
6], the authors analyze the feedback of users of a mobile AR experience deployed in several locations and countries through interviews and surveys. In [
7], the authors consider the analysis of a free interaction metaphor between users and heritage landmarks, allowing for its exploration throughout different periods. Metaio SDK is employed to transform this concept into an Android application. Specific examples consisting of the Leaning Tower of Pisa, the Cathedral and the Baptistery are used to validate the proposed solution and collect the users’ evaluation. It is shown that AR applications may attract visitors and can enrich the sighting.
It might be argued that target loss during tracking or the failure in recognizing the target are significant shortcomings during the fruition of AR content and might jeopardize the user’s acceptance. Therefore, it is necessary to take care of the accuracy and robustness of tracking since the registration of the 2D/3D content strongly depends on that. For further examples, we refer the reader to a complete survey of AR in cultural heritage [
8].
In general, the recognition and tracking of 2D targets outdoors are always very critical and several attempts, more or less effective, to solve the problem can be found in the literature. In [
9], a workflow is exposed to solve outdoor tracking issues. Starting from assumptions similar to ours (markerless multi-image approach in a real environment and difficulty in robust outdoor tracking), the research proposes a method to overcome some of these problems using Vuforia’s image recognition approach. The researchers analyzed the lighting dynamics of the site (Parliament Buildings National Historic Site of Canada, in Ottawa), thus, having the ability to prepare a set of images useful for recognition for each time of day during the various seasons. In the system devised by the researchers, the user had to stand in the same spot where the target images had been taken, thus, using standard locations for the experience.
Other researchers [
10], intending to recognize a series of real-world locations under different illumination conditions, developed an image recognition method based on natural features and a user location system, then processing part of the data on a server. Using location information to limit irrelevant data was critical to the system’s performance presented by the researchers. To do so, they quantized the user’s location and considered only location data from nearby location cells. They developed a method for incrementally updating the local feature database of features on the handset when the user changes location.
Another approach adopts a system called Indirect AR [
11], which replaces the live camera feed with a previously captured panoramic image. One of the most significant benefits of this approach in place of traditional AR is the greater registration accuracy. In traditional AR, any registration error is visible directly between the physical object and virtual annotation. In Indirect AR, the same registration error is only visible between the device and surroundings. That means that the registration between virtual annotations and the panorama representing the real world is always perfect, even if the registration between the device and the real world is not. One of the major drawbacks of this implementation of indirect AR is the dependence on pre-captured panoramas. To have an ideal experience, it would be necessary to have a panorama located exactly where the user was. Implementing indirect AR would typically mean having panoramas everywhere. While this is not possible, the authors approximated, using panoramas collected by Navteq as they travel most of the roads. There are two possible problems with this, however, exemplified by two questions: would users be able to look at the panorama and find the nearby point of interest in their view of the real world? Is the experience really similar to standard AR? However, if the Indirect AR solution presents an interesting alternative approach to solving outdoor image tracking problems, it currently remains a solution that does not allow the level of immersion that characterizes a standard AR experience.
Some authors presented a study [
12] that seeks to improve the robustness of outdoor AR applications by mitigating the effect of light sensitivity on marker-based AR. The proposed approach allows—by default—a ‘standard’ marker-based AR framework to detect recorded physical objects and accurately overlay AR content on them. In case marker tracking fails due to inadequate illumination conditions, they propose the complementary use of a field-of-view (FoV) estimation technique (typically used in sensor-based AR applications). The FoV estimation algorithm detects whether the physical object is actually within the user’s FoV and then attempts to project the AR content as accurately as possible. The authors “hybridized” the application by incorporating a sensor-based AR capability; the latter was enhanced by the implementation of geolocation-based raycasting to accurately detect when the user is actually in the line of sight with the sides of the registered polygon.
The above considerations show that it is worthwhile to carry out research towards the realization of more robust and accurate tracking systems for markerless AR applications. Indeed, the impact on the fruition of AR content in cultural heritage is significant and there is a growing demand and general interest. The following section introduces the proposed methods, the results of which are reported in
Section 4.
Section 5 discusses the results, while
Section 6 concludes the paper with directions for future work.
3. Materials and Methods
Identifying a development system capable of efficiently managing 2D and 3D animations with the broad support of augmented reality technology for every Android and iOS mobile device was necessary. The choice, therefore, fell on Unity 3D Engine with the AR Foundation framework, a sort of wrapper of the features of ARKit (iOS) and ARCore (Android). This framework allows one to maintain a single code (C #) for both platforms. Target image recognition and tracking were initially implemented using only one image for each mosaic. To position the 3D content in front of the objects, a custom algorithm was devised that allows for disconnecting the animation from the recognized image: in this way, the system does not continually try to correct the positioning, a method that often introduces visible jitter. For this purpose, two augmented reality features were used, namely plane detection and anchors: the user initially chooses a plane from the AR scene so that each subsequent positioning refers to it. Once a target has been recognized, the system places an “anchor” in the scene, to which the animation hooks. To remove the 3D models from the scene, a function was implemented in the code that constantly checks if the model is still within the camera view. The method is formalized in Algorithm 1.
Algorithm 1 Description of the image recognition and insertion of the AR experience. |
while the augmented reality session is live: if an image is recognized: - Wait for 60 frames if the image was tracked for 60 frames: - instantiate a new GameObject with an anchor component - find the digital content for the recognized mosaic - add the content as a child of the anchor |
The algorithm initially waits for the user to frame one of the mosaics. When the software finds a correlation with a target image present within the recognition library, it waits for a certain number of frames (experimentally fixed at 60) before beginning the instantiation of the animation. If the image is still tracked after this period, the code builds an “anchor” object in the scene to be used as a parent for the animation. In this way, the animations have stable and consistent positioning with the framed environment.
Tracking using the AR Foundation framework proved to be robust during indoor testing with scaled reproduction of three mosaics. However, in the actual outdoor situation, the application often failed to recognize the mosaic and then instantiate the corresponding animation. This issue is caused mainly by the variability in outdoor lighting conditions. Apart from variations due to weather conditions and from different times of the day, the most severe problems depend on shadows cast by plants around monuments, which irregularly invalidate the target images, rendering them unusable.
As explained in
Section 2, in general, the recognition and tracking of 2D targets outdoors are always very critical and several attempts, more or less effective, to solve the problem have been investigated.
Some frameworks, such as Vuforia or Wikitude, provide an effective solution by leveraging 3D object and model recognition, thus, extending the ability to track the framed environment. These libraries, therefore, base their operation on the recognition of 3D targets. There are many cases where this type of approach cannot be used: for example, in our case in Pinocchio Park, the targets consist of mosaics, which are basically 2D shapes.
Experiments have shown that using multiple target images, taken in different light and shadows during the day, slightly improves the recognition ability. However, the reliability is still not sufficient for a publicly available application. Further, to have adequate robustness, it is necessary to acquire many images to cover all possible variations, which is impractical and not efficient. To prevent light variation limited to even a tiny part of the target image from completely invalidating recognition, the strategy of dividing each target photo into smaller blocks was chosen. This results in higher overall reliability; in fact, the probability that at least one block would be recognized is much higher (
Figure 1).
Our approach can be described in pseudo-code as in Algorithm 2.
Algorithm 2 Steps involved in the creation of content to be attached to each block. |
- Find the center of the object using the real measures For each block: - Calculate the size of the block using texture size in pixels - Calculate the real size in meters for the blocks: - calculate the real distance from the center of the artwork - create a new texture with all data embedded in its symbolic name - create a new Unity prefab with its transform shifted to be positioned at the centre of the real object, once instantiated
|
We consider that
p is the probability of a block recognition and
q = (1 −
p) the probability of a failure. We assume it remains unaltered if the block’s dimension is still large enough. Assuming that recognitions are independent events and following the Bernoulli law
, the probability of recognizing exactly
k blocks is expressed in Equation (1).
Therefore, the probability that at least one block is recognized can be expressed as in (2).
For example, if
p = 0.2, when we divide the target into 2 × 2 blocks (that is, 4 blocks) the (2) gives 0.59 as a result, which is nearly three-times the original. Anyway, it is easy to extend and confirm such a law, because when the single block becomes too small, some other intrinsic recognition problems arise, mostly related to the device’s insufficient resolution or optical quality, etc. The optimal subdivision must be found by experimentation. To facilitate this trial-and-error process, we developed a custom tool for the Unity Editor, to automate the image division task. This tool solves the issue of positioning the actual content in the scene, now that the position of the real object retrieved is no more the same as the entire object and, especially, it is different for each recognized block. This tool, called CropTool, divides a texture into blocks using an arbitrary number of rows and columns and, for each block (using the measures in meters of the entire artwork), creates a Unity empty Game Object with its transform shifted enough to place the digital content at the center of the mosaic as if recognized in the single block solution. The CropTool is freely available at this address:
https://github.com/SolidGorbash/Tool-for-improved-open-air-AR-Image-Recognition (Accessed on 10 July 2022).