Usability Measures in Mobile-Based Augmented Reality Learning Applications: A Systematic Review

: The implementation of usability in mobile augmented reality (MAR) learning applications has been utilized in a myriad of standards, methodologies, and techniques. The usage and combination of techniques within research approaches are important in determining the quality of usability data collection. The purpose of this study is to identify, study, and analyze existing usability metrics, methods, techniques, and areas in MAR learning. This study adapts systematic literature review techniques by utilizing research questions and Boolean search strings to identify prospective studies from six established databases that are related to the research context area. Seventy-two articles, consisting of 45 journals, 25 conference proceedings, and two book chapters, were selected through a systematic process. All articles underwent a rigorous selection protocol to ensure content quality according to formulated research questions. Post-synthesis and analysis, the output of this article discusses signiﬁcant factors in usability-based MAR learning applications. This paper presents ﬁve identiﬁed gaps in the domain of study, modes of contributions, issues within usability metrics, technique approaches, and hybrid technique combinations. This paper concludes ﬁve recommendations based on identiﬁed gaps concealing potential of usability-based MAR learning research domains, varieties of unexplored research types


Introduction
A study done by Santos et al. in 2014 showed that within 43 studied augmented reality learning environment (ARLE) systems, usability evaluation focuses on improving ease of use, satisfaction, immersion, motivation, and performance [1].Albert and Tullis in [2] have coined "self-reported" and "performance" as the two types of metrics in usability.Most of the surveyed research works were conducted using self-reported metrics rather than performance metrics in this field of study.

Research Questions
The main objectives of this SLR were to study, comprehend, and summarize the evidence of current usability metrics, methods, and techniques applied in the domain of mobile augmented reality.This study also aspired to critically identify possible research gaps and areas for future opportunities in this research domain to not only implement but to possibly expand the performance of current metrics, methods, and techniques in MAR usability studies.In order to achieve these objectives, this research formulated 4 research questions (RQ) relating to the aim of this study (Table 1).
(Usability OR "User Experience" OR UX) AND (Mobile OR Handheld) AND ("Augmented Reality" OR AR) AND (Learn* OR Educat* OR Train*)

Manual Search
The manual search process for this SLR was conducted consequently after the identification of each potential paper through an automatic search.The manual search process was implemented only on the shortlisted potential papers identified through the automated search process.The manual search employed the snowballing procedure, incorporating forward and backward snowballing [9,10].This strategy was conducted to extend the search process through references cited and papers citing the pool of potential papers.Most papers identified through these references were also initiated through a Google Scholar search, as per recommendation by [9].

Literature Resources
This SLR utilized searches in 6 major databases for data extraction based on the search strings formulated in Section 2.2.1.While Google Scholar was used to basically function as a triangulation mechanism in these searches, most findings in Google Scholar eventually re-directed to 5 other major databases that are significant to computer science, software engineering, and information technology studies from 2009 to 2018.

Search Process
The search process consisted of first executing an automated search followed by the execution of a manual search process.The search was first launched in the six databases, and the papers were first collected through brief comprehension of each paper's title, abstract, and used keywords.These potential papers were then grouped to undergo a study selection based on pre-formulated search criteria.This was then followed by an arduous analysis of the related references using the snowballing procedure, and any related papers that were marked as potential papers were included in a group of papers collected through the automated search process earlier.

Study Selection
The initial collection of the prospective papers via automated process generated a pool of 1324 papers followed by an additional 116 papers collected on a sequential manual search process.With an analysis of titles, abstracts and keywords, duplicated papers collected from different databases were deleted from the pool of papers.These papers were then classified through a pre-formulated inclusion and exclusion criteria, as presented in Table 2.After screening through this process, the papers then underwent a scrutiny of comprehensive full paper reading.

Study Scrutiny
From a total of 1324 papers collected through automated search and 116 papers from manual search processes, an analysis though soft reading applied the rules of all 6 inclusion and exclusion criteria.Besides re-confirming the content of prospective papers that addressed the 4 RQs, the implementation of these 6 criteria narrowed down the number of prospective papers to a total of 208 papers, excluding papers that did not abide by one or more of the 6 criteria.All selected papers were also examined for credibility by confirming the validity of papers as originated from credible publication sources like peer-reviewed journals, conference proceedings, book chapters, and articles.From there, 208 papers underwent comprehensive readings and were coded through additional quality assessment, which was built on 5 important questions rating the suitability of these papers for this study.
Additional quality assessment questions (QAs) were implemented by answering 5 questions gauging the content of these papers (refer to Table 3).This coding method was adapted from [11], where the authors provided a credible technique for accessing the suitability of each paper for this SLR.For each QA, there were only 3 optional answers of whether the studied paper answered the QAs completely-("Yes"), partly ("Partly"), or not at all ("No").Each optional answer was given a coded point of either "Yes" = 1, "Partly" = 0.5, or "No" = 0.As per the practice of [7,11], QAs were done meticulously where results of the assessments were discussed by the authors prior to approval.From this process, a total of 72 papers scored 2.5 or more (50% or more), and these were selected to be included in the study of this research.

QA No.
Quality Assessment Questions 1 Does the paper clearly describe the method/methods of usability used? 2 Does the paper highlight the usability evaluation process clearly? 3 Does the paper clearly present the contribution of study?
4 Does the paper clearly present the metrics used relating to types of subject study (between-subjects, within-subjects, or both)? 5 Does the paper add value to contributions towards academia, industry or community?

Data Synthesis
The main objective of data synthesis was to present and show the evidence from 72 selected studies which could assist in addressing the formulated research questions in Table 1.This process consists of data identification, synchronization, and analysis.The process will then deliver information that can clearly answer the research questions.The data collection focused on extracting areas of usability application in MAR, research types, presented contributions, usability methodologies, and techniques used.Data harvested for RQ1, RQ2, and RQ3 were organized in an articulate manner where visualization tools such as organized tables, bar charts, and pie charts were used to present the findings.Each visualization segment was paired with concluding statement descriptions to assist readers' comprehension.RQ1, RQ2, and RQ3 also presented the respective classification of research types, presented contributions, usability methodologies, and techniques used.As for RQ4, two-dimensional and three-dimensional mappings were engineered to demonstrate correlational values between parameters in order to identify specific gaps.

Threats to Validity
A study by [7] highlighted four major threats to validity (TTV) in SLRs, and this research acknowledges these shortcomings by strategizing to minimize the risks of these TTVs.The four threats of validity were resolved in four separate strategies:

•
Construct validity was confirmed through the implementation of an automated and manual (snowballing) search from the very beginning of data collection aimed to mitigate calculated risks.
In order to further restraint this TTV, major steps of scrutiny plus additional QAs were carried out complementing existing RQs and clear selection criteria.

•
Internal validity was solved by adopting a method used by [7].In order to eliminate biases in paper selection through an exhaustive search, a combination approach of automated search and snowballing was carried out for a more inclusive selection approach.Every extracted study underwent strict selection protocols after being extracted from all major databases in similar research areas [1,7,11].

•
External validity was mitigated with a generalizability of results by incorporating a 10 years' timeframe in MAR studies with a usability evaluation.The incremental collection of papers by year was parallel to the number of available papers by year, which can be an indicator that this SLR is able to maintain a generalized report aligned with the research's external validity requirements.

•
Conclusion validity was managed by implementing SLR methods and techniques used in this study following the established, specific, and well-defined guidelines explored by scholars from credible publications such as [8].It is therefore possible for each and every research chronology in this SLR to be replicated with measurable and near-identical outcomes.

Detailed Information of Selected Studies
In this research, a total of 72 papers from six major databases (Table 4) were selected categorized in three different clusters-45 papers (62%) came from journal publications, 25 (35%) from conference proceedings, and 2 (3%) from book chapters.From six databases, 27 papers were collected from IEEEXplore, 24 from ScienceDirect, 12 from Web of Science, 8 from Springerlink, and 1 from ACM Digital Library.While most papers were found through Google Scholar, these papers were redirected and extracted from respective source databases.Most papers were extracted from the publication year 2017 (16 paper), followed by 2018 (12 papers), 2014 (12 papers), 2015 (9 papers), 2016 (8 papers), 2013 (8 papers), 2012 (4 papers), 2011 (1 paper), 2010 (1 paper), and 2019 (1 paper).Since the paper selection process was carried out in the middle of 2018 (Figure 2), the deadline for the paper search was decided as August 2018 so that other SLR processes could be carried out as according to a planned timeline.The details of each paper are referred to in Table 5 based on publication type, publication name, year, and quartile (journal rankings when applicable).In order to support the risk reduction of the aforementioned TTVs, one of the strategies was to also incorporate more papers with established quality based on measures like impact factors and journal rankings (Quartile 1 to Quartile 4).Two journal rankings tools were utilized.First, the journal details were extracted from the Thomson Reuters Master Journal List-Clarivate Analytics [84], and they were later on triangulated with SCImago Institutions rankings [85] for further information accuracy.The list of journals with impact factors is shown in Table 5 with personalized values for each journal per related year followed by the quartile rankings.A number of citations were extracted through both Google Scholar (which generally display a higher number of citations) and number of citations reported by each paper's respective database.Studies done by Liu [12] published in the Journal of Computer Assisted Learning (Q1, Impact Factor: 1.313) appear to have the highest citation in Google Scholar (Google Scholar: 196, Publisher: 75), while studies by Olsson et al. [16] published in Personal and Ubiquitous Computing (Q2, Impact Factor: 0.938) appear to score the highest citation in the publisher's database (SpringerLink) (Google Scholar: 181, Publisher: 87).As for proceedings, the studies done by Liu et al. in [57] appear to have had the highest number of citations in both Google Scholar and SpringerLink (Google Scholar: 77, Publisher: 19) among all collected proceedings in this research.

Domains, Research Types, and Contributions in Mobile Augmented Reality Based Usability Studies (RQ1)
In order to cater to RQ1, this section aimed to collect data and find out common domains, research types, and contributions within research works involving both MAR learning and usability studies.

Research Types
There were five research types identified during the process of data synthesis on all 72 collected papers in answering the second part of RQ1.The categorization of research type for each paper was based on combined comprehension and understanding from experienced authors' involved, referring to the definition of each research type discussed below: (1) Exploratory Exploratory research is often conducted in new areas of inquiry, where the goals of research are: (1) To scope out the magnitude or extent of a particular phenomenon, problem, or behavior; (2) to generate some initial ideas (or "hunches") about that phenomenon; or (3) to test the feasibility of undertaking a more extensive study regarding that phenomenon.In the preliminary phases of research, when a research problem is unclear and the researcher wants to scope out the nature and extent of a certain research problem, a focus group (for individual unit of analysis) or a case study (for an organizational unit of analysis) is an ideal strategy for exploratory research [86].According to [87], exploratory assessments generally include thinking aloud, cognitive walkthroughs, and other techniques.
(2) Empirical Empiricism refers to making observations to obtain knowledge.The term empirical research refers to making planned observations.By following careful plans for making observations, we engage in a systematic and thoughtful process that deserves to be called research.This process includes 7 phases namely: (1) observing; (2) selecting promising variables; (3) deciding whom to observe; (4) deciding how to observe; (5) deciding when to observe; (6) deciding how to analyze data; and (7) interpreting data to make decisions [88].By prefixing research with empirical observations, some powerful new ideas are added.According to one definition, empirical research originates in or is based on observation or experience.Another definition holds that empirical means relying on experience or observation alone, often without due regard for system and theory [89].
(3) Comparative The meaning of "comparative research" is restricted in accordance with what is commonly understood as comparative education in cross-national cases.It refers to "research in social units of given political level, regardless of homogeneity, similarity or difference in their cultures, although it is commonly assumed that nations always differ culturally to some degree" [90].Traditional understandings "are compared with respect to the same concepts" [90].In causal-comparative research, which is also called case-control research, one typically compares a group to one or more different groups, or one compares the same group at different times and does not manipulate a variable.Shadish, Cook, and Campbell argued in [91] that causal-comparative research is useful in situations in which an effect is known but the cause is unknown.For example, causal-comparative research might be used to determine what caused students to drop out of an educational program by determining how those who dropped out and those who stayed in the course differed.The key issue when running experiments is the comparison of performance between conditions: Does one condition produce better or worse performance that another?To determine "performance with condition," human participants need to perform tasks associated with Human Computer Interaction's (HCI) idea being investigated, and measurements of the overall performance for each condition are taken [92].
(4) Experimental The definition of conditions, tasks, and experimental objects is the initial focus of the experiment design and must be carefully related to the research question.The experiment itself could be described simply as presenting stimuli to human participants and asking them to perform intended tasks.There are, however, many other decisions to be made about the experimental process, as well as additional supporting materials and processes to be considered [92].In terms of experiments, scientific research may be broadly classified into two categories with slight overlap: Theoretical research and experimental research.No theory is valid until it passes one or more crucial tests of the experiment.The need for definitions in experimental research emanates from the fact that experimental researches in a given domain of nature are spread out widely over space and time [93].
(5) Quasi-Experimental Pre-experimental and quasi-experimental research designs are often used to evaluate the effects of social programs, psychotherapy or some other form of psychosocial intervention, or the results of public policy.They are also widely used in medicine to evaluate the effects of medications.Traditionally, research designs used in outcome studies have been broadly categorized into three types.Those which involve the analysis of a single group of clients have traditionally been called pre-experimental designs.Those that involve comparing the outcomes of one group receiving a treatment that is the focus of evaluation to one or more groups of clients who receive either nothing, and alternative real treatment, or a placebo-type treatment have been called quasi-experimental designs.The third type, true experiments, are characterized by creating different groups (those receiving "real" treatment vs. those receiving nothing, alternative treatment, or placebo) by randomly assigning clients (or another unit of analysis) to those various treatment conditions.Some questions quasi-experimental can answer: (1) What is the status of clients after they have received a given course of treatment?(2) Do clients improve after receiving a given course of treatment?(3) What is the status of clients who have received a given treatment compared to those who did not receive that treatment?(4) What is the status of clients who have received a novel treatment compared to those who received a credible placebo treatment?[94].A quasi-experimental study might find that clients are worse following therapy, but, absent proper comparison groups, the researcher might not know that the treated clients were actually better off than if they have not received treatment [94].In short, compared to experimental design, quasi-experimental manipulates evaluation variables but rarely has randomization in sample group (control or experimental) assignments [95].Quasi-experimental designs are therefore more prone to bias as compared to experimental design but serve a purpose when used as a stepping stone to establish rationale of a research before subsequently leading to a conventional experimental design [95].
(6) Heuristic According to Moustakas in [96], heuristic inquiry is a process that begins with a question or problem which the researcher seeks to illuminate or answer.The question is one that has been a personal challenge and puzzlement in the search to understand one's self and the world in which one lives.The heuristic process is autobiographic, yet, as with virtually every question that matters personally, there is also a social and perhaps universal significance.Heuristics is a way of engaging in scientific search through methods and processes aimed at discovery: A way of self-inquiry and dialogue with others aimed at finding the underlying meanings of important human experiences.The deepest currents of meaning and knowledge take place within the individual through one's senses, perceptions, beliefs, and judgements [96].Heuristic research started out more like an informal process of assessing and meaning-making than as a research approach.Clark Moustakas, the originator of heuristic inquiry, stated that the approach came to him as he searched for a proper word to meaningfully represent certain processes he felt were foundational to explorations of everyday human experience (1990) [96].The methodology itself was introduced in a more formalized manner to the world of research methods with the publication of Moustaka's book, in which he depicted his experience of that phenomenon as he dwelled with a decision tied to his daughter's need for heart surgery [97].Moustakas used his personal knowledge of and relationship with loneliness as a foundation for exploring the phenomenon to others [97].Moustaka described heuristic enquiry as a qualitative, social constructivist, and phenomenologically aligned research model [97].In the context of social science and educational research, a heuristic enquiry has also been identified as an autobiographical approach to qualitative research.Other descriptors and characterizations of heuristic inquiry that are not highly elaborated in the professional literature include the following: Research process that studies living experience (interrelated, interconnected, and continuing experience) rather than a study of lived experienced.[97].The word "heuristic" originates from a Greek word that means discover and explore in a wider sense [98].Heuristics are also known as approximate techniques [98].The main goal in a heuristic search is to construct a model that can be easily understood and that provides good solutions in a reasonable amount of computing time [98].Such techniques consist of a combination of scientific problems such as mathematical logic, statistics, and computing, as well as human factors such as experience-and, in many cases, a good insight of the problem that needs to be addressed [98].Heuristic design has a different perspective of research definition from the other research designs.Exploratory designs are employed when research problems are unclear in terms of scope and magnitude.Empirical research, on the other hand, uses systematically planned observation for knowledge gaining.Comparative design adopts observation of two or more competitive evaluations to derive research conclusions.Unlike the aforementioned three, experimental and quasi-experimental are preferred based on clear definition, focused conditions, tasks, and experimental objects.Heuristic, however differs from the others due to its nature of self-realization and research discoveries which take place incrementally and without systematic nor clear focused research problems.
In answering part of RQ1, a majority of 43 authors conducted the exploratory method, followed by 37 performing the comparative method, 7 performing empirical method, 6 carrying out the experimental approach, 3 carrying out the quasi-experimental approach, and 2 performing heuristic research guidelines.Among all collected papers, 48 authors carried out only one type of research methodology, 22 carried out a combination of two research methods, and 2 carried out a combination of three research approach.Table 7 shows a complete breakdown of the research type details parameterized by a number of combinations (Comb.), research types (Type) and references (Refs).Both exploratory and comparative research methods were employed by most scholars conducting research in MAR-based usability studies.Next, this research also managed to correlate research types with the publication types including the quartile (Q) journals, non-indexed journals (NI), proceedings (P), and book chapters (BC).It can be seen in Table 7 that exploratory research was published most in Q1 journals (9), followed by Q2 journals (3), Q3 journals (1), and proceedings (12).Exploratory research appears to have been published in the highest number of high impact journals.In addition, exploratory research when adapted in combination with other research types also produced the highest number of high impact factor papers.There were 3 Q1 publications when combined with empirical research, 4 Q1 and 1 Q3 publications when combined with comparative research, and finally 3 Q1 and 1 Q2 publications when combined with experimental research.The second highest research type was comparative, when published in high impact factor publications (Q1 = 5, Q2 = 4, Q3 = 2, NI = 1, P = 8, and BC = 1).Comparative research was also the second highest when combined with other research types.In summary to all details shown in Table 7, it can be summarized that most high impact journals adapted mostly exploratory and comparative approaches.
Relating back to the summary of citations for each paper mentioned earlier in this paper, among all published journals, studies done by Liu [12] with the highest number of citation in Google Scholar (Google Scholar: 196, Publisher: 75) adopted a combination of comparative and experimental research approaches.On the other hand, studies done by Olsson et al. [16] with the highest number of citations in the publisher's database (SpringerLink) (Google Scholar: 181, Publisher: 87) appear to have adopted the exploratory research approach.Both works in [12] and [16] were published in Q1 and Q2 journals, respectively.As for proceedings and book chapters, studies done by Liu et al. in [57] with the highest number of citations in both Google Scholar and the publisher's database (SpringerLink) (Google Scholar: 77, Publisher: 19) had also adopted the exploratory research approach.

Research Contributions
Referring to RQ1 from literature studies, it is important to highlight types of contribution and novelty each research paper offers.From the 72 collected papers, this research work has categorized all contribution type into five different categories defined as follows: (1) Tool This type of contribution focuses on producing MAR-based software tools, including systems, applications, learning packages, authoring tools, simulation tools, and prototypes, all of which can be integrated with other frameworks (2) Method This type of contribution concentrates on procedures and systematic processes supplementing MAR and usability research.Methods categorized here refer to learning methodologies, pedagogies, usability methodologies, and algorithms promoted with the use of MAR.
(3) Model This type of contribution investigates the relationships, comparisons of proposed techniques, existing challenges, or classification among papers [99].Model categorized here refers to learning models, new approaches, original concept, and innovative usability theories.
(4) Technique This type of contribution focuses on proposing new techniques to add values to MAR research.Techniques categorized here refer to MAR technical approaches that help innovate the technology (5) Case Study/Experience Paper This type of contribution presents evidence on case studies and user experience involving utilization of MAR technology.Some of the contributions include exploratory findings, experimental results, user requirement studies, and comparative outcomes.
In answering the final part of RQ1, it can be seen in Table 8 that most research contributions in the area of MAR-based usability studies primarily focus on producing tools for problem solution (41), followed by formulating methods (10), designing models (9), reporting on case studies or experiences (9), and the introduction of new techniques (3).

Usability Metrics (RQ2)
RQ2 set out to find common usability metrics in measuring usability factors on the MAR learning environment.According to standards given by International Organization for Standardization (ISO) 9241-11 [100], three common metrics of measuring usability are effectiveness (accuracy and completeness in given tasks leading to objectives), efficiency (resources such as time, effort, costs, and materials to achieve goals), and satisfaction (users' physical, cognitive, and emotional responses).However, throughout decades of usability research, these three metrics have been interpreted and varied in many different forms and terminology based on the structure recognized by ISO.In the practice of usability metrics, they were measured through either performance metrics, self-reported metrics, or a combination of both.While the performance metric was objective (quantitative) data collection as compared to self-reported metrics-which are associated mostly as subjective (qualitative) data collections-these methods can be practiced on different sets of user groups, namely within or between-subjects.The next two sections will briefly discuss the mentioned metrics before detailing commonly used metrics in MAR-based usability studies.4.3.1.Performance vs. Self-Reported Tullis and Albert in [2] clearly defined performance metrics as objective methods in comparison to self-reported metrics, which are mostly subjective.Performance metrics include the usability methods that were collected mostly through observation methodologies and which do not consider the factor of participants' opinion, while self-reported metrics only value the reliability of users' opinions.From the 72 collected papers, there are three categories of metric segregation practiced by scholars-some only practice performance metrics, some practice only self-reported, and some combine both metrics in collecting their usability data.In answering RQ2 on most common usability metrics in MAR learning, Table 9 shows a significant majority of the authors (49) collected only self-reported data, 20 studies collected a combination of both performance and self-reported data, and only 2 authors collect pure performance data.According to Tullis and Albert in [2], within-subject evaluation refers to studies that performed repeated measures on experimental subjects.Commonly in usability studies, within-subject evaluation refers to having participants evaluating more than one of the tested items.The advantage of within-subject evaluation is there is no such need for a big pool of sample size-however, it risks the possibilities of the participant carryover effect and prior experience biases.Between-subjects evaluation, on the other hand, refers to comparing results for different participants, where every participant evaluates only once [2].This evaluation type is capable of giving experimenters a clean data collection without the risks of the carryover effect and prior experience biases-however, it requires more effort and time to gather a larger sample pool.In answering another part of RQ2, Table 10 shows that a significant majority of the authors (48) performed between-subjects evaluation, 19 studies performed within-subject evaluation, and only 4 performed a combination of both.From the collected 72 research papers, 18 categories of usability metrics were identified with multiple interchangeable terminologies in each category (Figure 4).In finalizing the answers to RQ2, it can be seen that the highest metric used is satisfaction, which goes in line with the number of majority self-reported data presented earlier.From the collected 72 research papers, 18 categories of usability metrics were identified with multiple interchangeable terminologies in each category (Figure 4).In finalizing the answers to RQ2, it can be seen that the highest metric used is satisfaction, which goes in line with the number of majority self-reported data presented earlier.Table 11 presents the commonly used metrics together with other related terminologies within the same group.While most metrics presented below are inspired through a derivation of three major ISO 9142-11 metrics, honorable mentions of each metrics' expression are seen important for the uniquely added values in each suggested metric.[39] Stimulation (HQ-S), hedonic [14] 4.4.Usability Methods, Techniques, and Instruments (RQ3) RQ3 was formulated to find the common methods, techniques, and instruments used in gathering usability data.From the data syntheses process, there were many different usability and techniques extracted from the 72 studies collected for this research.However, rigorous analysis of the methods, techniques, and instruments used in these collected studies can be clustered into seven relative categories according to the nature of how these properties were executed.In answering part of RQ3, it is shown in Figure 5 that the questionnaire is the most used technique, registering 83 counts of usage, though some related studies use more than one type of questionnaire instruments in their studies.The next cluster is time-based tracking, which incorporates all techniques applied in the form of time collection-this registered 19 counts of usage.There were 13 occurrences of error-tracking techniques, where error counts were used to measure usability.There were 11 discussion-based techniques, which incorporate group or individual interviews.There were 10 counts of behavior tracking, 17 counts of performance-based measures, and 9 procedural or heuristics protocols in MAR-based usability studies.Table 12 shows the detail explanation of various questionnaire instruments used by selected studies parameterized by questionnaire type (Type), instruments used (Instruments), number of Likert ranges (Lik), and authors utilizing the instruments (Refs.).
used by selected studies parameterized by questionnaire type (Type), instruments used (Instruments), number of Likert ranges (Lik), and authors utilizing the instruments (Refs.).

Open-Ended Questionnaires
According to [2], most questionnaires in usability studies include some open-ended questions in addition to the various kind of rating based questionnaires.Open-ended questionnaires are instrumental in identifying ways to improve products despite the limitations of metric calculation like close-ended questionnaires.According to [2], a common use of close-ended questionnaires in usability is to ask users for five things they liked about the product.However, summarizing the responses to this type of questions is always a challenge due to its subjectivity.Based on a definition from [101], "a questionnaire is a form designed to obtain information from respondents."According to [102], open-ended questions are suitable for exploratory studies, supplementary to close-ended questionnaires.However, open-ended questionnaires can be demanding for respondents, require significant coding efforts, are difficult for results comparison, have a higher nonresponse rate, and require more times to answer.

Close-Ended Questionnaires
According to [2], even though questionnaire can be open or close-ended questions, in practical statistics for user research, most questionnaires are more typically multiple choice, with respondents selecting from a set of alternatives or points on a rating scale.According to [102], close-ended questionnaires are easier for respondents to answer since they are guided, easier to code and analysis, and appropriate when a study is certain of the possible responses.However, close-ended questionnaires can also negatively make respondents feel the absence of answers they wanted.Most closed-ended questionnaires can also be procedural, based on usability standards discussed in the next section.

Standardized Questionnaires
According to [101], a standardized usability questionnaire is designed for repeated use, with a specific set of questions, a specific order within a specific format, and specific rules.It is also customary for a questionnaire developer to report measurements of reliability, validity, and sensitivity of the questionnaire (psychometric qualification).There are several advantages of standardized questionnaires including objectivity, replicability, quantification, and economy.The details on advantages and ways of accessing standardized questionnaires were further referred to in [101].According to the same source, [101], the four classic standardized usability questionnaires (used in post-study) include Questionnaire for User Interface and Satisfaction (QUIS), Software Usability Measurement Inventory (SUMI), Post-Study System Usability Questionnaire (PSSUQ), and System Usability Scale (SUS).

Open-Ended Questionnaires
According to [2], most questionnaires in usability studies include some open-ended questions in addition to the various kind of rating based questionnaires.Open-ended questionnaires are instrumental in identifying ways to improve products despite the limitations of metric calculation like close-ended questionnaires.According to [2], a common use of close-ended questionnaires in usability is to ask users for five things they liked about the product.However, summarizing the responses to this type of questions is always a challenge due to its subjectivity.Based on a definition from [101], "a questionnaire is a form designed to obtain information from respondents."According to [102], open-ended questions are suitable for exploratory studies, supplementary to close-ended questionnaires.However, open-ended questionnaires can be demanding for respondents, require significant coding efforts, are difficult for results comparison, have a higher nonresponse rate, and require more times to answer.

Close-Ended Questionnaires
According to [2], even though questionnaire can be open or close-ended questions, in practical statistics for user research, most questionnaires are more typically multiple choice, with respondents selecting from a set of alternatives or points on a rating scale.According to [102], close-ended questionnaires are easier for respondents to answer since they are guided, easier to code and analysis, and appropriate when a study is certain of the possible responses.However, close-ended questionnaires can also negatively make respondents feel the absence of answers they wanted.Most closed-ended questionnaires can also be procedural, based on usability standards discussed in the next section.

Standardized Questionnaires
According to [101], a standardized usability questionnaire is designed for repeated use, with a specific set of questions, a specific order within a specific format, and specific rules.It is also customary for a questionnaire developer to report measurements of reliability, validity, and sensitivity of the questionnaire (psychometric qualification).There are several advantages of standardized questionnaires including objectivity, replicability, quantification, and economy.The details on advantages and ways of accessing standardized questionnaires were further referred to in [101].According to the same source, [101], the four classic standardized usability questionnaires (used in post-study) include Questionnaire for User Interface and Satisfaction (QUIS), Software Usability Measurement Inventory (SUMI), Post-Study System Usability Questionnaire (PSSUQ), and System Usability Scale (SUS).

Time-Based Tracking
In this paper, all usability techniques that utilize time measures as evaluation parameters are classified in this category.According to [2], there are two common time measures that go hand in hand.One is the time of completion (also known as time-on-tasks), which refers to how quickly can users get their tasks done and how successful can these tasks be within time.According to [2], it is more reliable to register time for tasks that were done correctly, since this reflects the duration users need to perform the task given correctly.

Error Tracking
All usability techniques that utilize error registration as evaluation parameters are classified in this category.Sauro and Lewis in [101] defined errors as any unintended action, slip, mistake, or omission a user makes while attempting a task.According to [2], some professionals believe errors and usability issues are the same essentially.According to the same source, [2], errors are a useful way of evaluating user performance."While being able to complete a task successfully within a reasonable amount of time is important, the number of errors made during the interaction is also very revealing."Albert and Tullis in [2] have also highlighted three general situations where error registration is useful in usability studies.According to Tullis, error is defined as entering incorrect data, making wrong menu-based choices, and failing to take a key action.

Discussion-Based
All usability techniques that practices user interviews and focus groups are classified in this category.Interviewing, according to [103], is a technique that favors depth over sample size, which makes it a technique that are not suitable for every problem.A focus group, on the other hand, is a moderated discussion with between four and twelve participants in a research facility, often used to explore preferences among other different solutions [103].Both data collection methods of user interview and focus group are classified under qualitative (insights) category in a research techniques taxonomy presented in [103].

Expression Observation
In a taxonomy presented in [103], a research technique to gather data based on user behaviors simply means registering what people do.An example would be assessing users' behavior by reading their facial expression.In [104], it was mentioned that changes in human facial expression reflect the individual's current emotional state, which can be a means of communicating emotional information.Therefore, all usability techniques that register users' expression as a data collection approach are classified in this category.

Performance-Based Tracking
This category groups all usability techniques that collect data from users' performance through given experimental tasks.Tasks can be in the form of navigating a product's functionality or learning ability on the content a product is delivering.Based on insights given by [2], performance metrics rely on user behaviors and measure success based on given tasks.Performance metrics are also best used in evaluating effectiveness and efficiency [2].Since educational assessments are oftentimes benchmarked, performance-based assessment techniques are also used due to outcome-based content standards [105].

Procedural
This category comprises of all reported works utilizing usability standards, procedures, heuristics, models, and protocols that have been established in the domain.According to [106], the term "standard" can be used to refer to documents approved by a recognized body or "de facto standard" that has not been approved by any recognized body but is accepted through widespread use.The importance of usability standards and procedural-based evaluation, according to [104], increase speed and cost of mobile application development where designers do not have to reinvent wheels in development processes.Besides providing better consistency, standards also improve the quality of user experience [104].Table 13, on the other hand, presents a myriad of usability instruments' categories parameterized by categories by definition (Category), techniques or instruments used for measure (Techniques/ Instruments), and references of respective authors (Refs.).
In conjunction with RQ3, this study also breaks down the more commonly used usability techniques involving observation, questionnaires, discussion-based (interviews), and procedural/heuristics (cognitive walkthroughs, think aloud, heuristics, expert reviews).Figure 6 shows the seven techniques and percentages of usage across 72 selected studies.Questionnaire (57%) is preferred most by most studies, followed by observation (23%), interviews (11%), expert review (3%), think aloud (3%), heuristic (2%), and cognitive walkthrough (1%).Among the 72 collected articles, 45 utilized only one technique at a time, 18 used a combination of two techniques, 8 utilized a combination of three techniques, and 1 study did not clearly specify specific the technique used.In Table 14, the following abbreviations apply: Questionnaire (Q), interview (Iw), observation (Obs), think aloud (TA), expert review (ER), heuristic (Hc), and cognitive walkthrough (CW).There are a maximum of three usability techniques combination at a time (Table 14). Figure 6 shows the questionnaire to be the largest pool used (57%), which is more than half, combining all the works.Figure 7 shows that although there are many techniques used, some functions are within different combination modes.Table 14 and Figure 7 show that there are 8 scholars who used a combination of three techniques in their research (Figures 8  and 9), followed by a two technique combination (18) (Figures 10 and 11).A majority of 45 scholars used only one technique at a time (Figure 12), while 1 paper did not clearly elaborate on the technique used.Figure 13 shows the frequency of all correlated techniques.
For the three technique combination, it can be comprehended that the highest used combination of techniques was that of observation, questionnaire, and interview (4), followed by the combinations of observation, questionnaire, and think aloud (1); observation, think aloud, and interview (1); observation, expert review, and interview (1); and heuristic, questionnaire, and think aloud (1).For the two technique combination, the highest combination was observation and questionnaire ( 14), followed by questionnaire/cognitive walkthrough ( 1), heuristic/expert review (1), expert analysis/interview (1), and observation/interview (1).Scholars who used only one technique in their research work fell into the following quantities: Questionnaire only (39), interviews (4), and observation (2).
Note: Questionnaire (Q), interview (Iw), observation (Obs), think aloud (TA), expert review (ER), heuristic (Hc), and cognitive walkthrough (CW).In order to answer the fourth research question, two-dimensional and three-dimensional mappings have been employed to demonstrate co-relational factors of these collected data attributes.First, three main components-namely research types and contribution types-were mapped against the general usability metrics of performance, self-reported or combination of the two.It can be seen in Figure 14 that most studies engage in exploratory studies use self-reported metrics (32), followed by comparative studies that also utilize mostly self-reported metrics (24).The least used combination of research types and metrics are exploratory with performance (1), empirical with performance (1), and comparative with performance (1).On the other hand, most studies that contributed to the research of tools employ mostly self-reported metrics (27), followed by a hybrid of performance and selfreported metrics (13).The least used combination of contribution types and metrics are tool and performance (1), technique and performance (1), and model and hybrid metrics (1).In order to answer the fourth research question, two-dimensional and three-dimensional mappings have been employed to demonstrate co-relational factors of these collected data attributes.First, three main components-namely research types and contribution types-were mapped against the general usability metrics of performance, self-reported or combination of the two.It can be seen in Figure 14 that most studies engage in exploratory studies use self-reported metrics (32), followed by comparative studies that also utilize mostly self-reported metrics (24).The least used combination of research types and metrics are exploratory with performance (1), empirical with performance (1), and comparative with performance (1).On the other hand, most studies that contributed to the research of tools employ mostly self-reported metrics (27), followed by a hybrid of performance and selfreported metrics (13).The least used combination of contribution types and metrics are tool and performance (1), technique and performance (1), and model and hybrid metrics (1).In order to answer the fourth research question, two-dimensional and three-dimensional mappings have been employed to demonstrate co-relational factors of these collected data attributes.First, three main components-namely research types and contribution types-were mapped against the general usability metrics of performance, self-reported or combination of the two.It can be seen in Figure 14 that most studies engage in exploratory studies use self-reported metrics (32), followed by comparative studies that also utilize mostly self-reported metrics (24).The least used combination of research types and metrics are exploratory with performance (1), empirical with performance (1), and comparative with performance (1).On the other hand, most studies that contributed to the research of tools employ mostly self-reported metrics (27), followed by a hybrid of performance and self-reported metrics (13).The least used combination of contribution types and metrics are tool and performance (1), technique and performance (1), and model and hybrid metrics (1).The second correlation relates to a three-dimensional mapping of the mentioned metrics (performance (white), self-reported (grey) and hybrid metrics (dark grey)), research types, and contributions types.The mapping in Figure 15 shows that most studies have conducted exploratory research used self-reported metrics in the contributions of producing the MAR tool ( 18), followed by 12 papers that conducted comparative research using self-reported metrics in producing the MAR tool as their major contribution.Figure 16, on the other hand, presents a two-dimensional mapping of research types with seven commonly used usability techniques, followed by contribution types with the seven usability techniques.It can be seen in Figure 16 that the largest population belongs to groups of scholars who carried out exploratory research with questionnaire (36), followed by comparative research with questionnaire (34).As for techniques and tools combination, the largest pools incorporate research that combines questionnaire with tool contribution (33), followed by the combination of observation and tool contribution (14).The second correlation relates to a three-dimensional mapping of the mentioned metrics (performance (white), self-reported (grey) and hybrid metrics (dark grey)), research types, and contributions types.The mapping in Figure 15 shows that most studies have conducted exploratory research used self-reported metrics in the contributions of producing the MAR tool (18), followed by The second correlation relates to a three-dimensional mapping of the mentioned metrics (performance (white), self-reported (grey) and hybrid metrics (dark grey)), research types, and contributions types.The mapping in Figure 15 shows that most studies have conducted exploratory research used self-reported metrics in the contributions of producing the MAR tool ( 18), followed by 12 papers that conducted comparative research using self-reported metrics in producing the MAR tool as their major contribution.Figure 16, on the other hand, presents a two-dimensional mapping of research types with seven commonly used usability techniques, followed by contribution types with the seven usability techniques.It can be seen in Figure 16 that the largest population belongs to groups of scholars who carried out exploratory research with questionnaire (36), followed by comparative research with questionnaire (34).As for techniques and tools combination, the largest pools incorporate research that combines questionnaire with tool contribution (33), followed by the combination of observation and tool contribution (14).The second correlation relates to a three-dimensional mapping of the mentioned metrics (performance (white), self-reported (grey) and hybrid metrics (dark grey)), research types, and contributions types.The mapping in Figure 15 shows that most studies have conducted exploratory research used self-reported metrics in the contributions of producing the MAR tool ( 18), followed by 12 papers that conducted comparative research using self-reported metrics in producing the MAR tool as their major contribution.Figure 16, on the other hand, presents a two-dimensional mapping of research types with seven commonly used usability techniques, followed by contribution types with the seven usability techniques.It can be seen in Figure 16 that the largest population belongs to groups of scholars who carried out exploratory research with questionnaire (36), followed by comparative research with questionnaire (34).As for techniques and tools combination, the largest pools incorporate research that combines questionnaire with tool contribution (33), followed by the combination of observation and tool contribution (14).Figure 17 shows a two-dimensional mapping on the correlational relationships of research types with types of evaluation and contribution type with types of evaluation.It can be interpreted from Figure 17 that most researchers who conduct exploratory research evaluate respondents using the between-subjects technique (32), followed by comparative research, which also uses the between-subject evaluation technique (20).On the other hand, the largest pool of researchers who produce tool-related contribution also utilizes the between-subjects method (30), followed by tool-related contribution with within-subject (7), and the model-related contribution with between-subjects testing (7).In can be derived from the analysis that the largest pool of researchers uses questionnaires when measuring through between-subjects testing (41), followed by a questionnaire with within-subject testing (15).Subsequently, most scholars who use between-subjects testing performed self-reported metrics (36), followed by self-reported metrics with within-subject testing (11), and hybrid metrics with between-subjects testing (11).
Appl.Sci.2019, 9, x FOR PEER REVIEW 27 of 38 Figure 17 shows a two-dimensional mapping on the correlational relationships of research types with types of evaluation and contribution type with types of evaluation.It can be interpreted from Figure 17 that most researchers who conduct exploratory research evaluate respondents using the between-subjects technique (32), followed by comparative research, which also uses the betweensubject evaluation technique (20).On the other hand, the largest pool of researchers who produce tool-related contribution also utilizes the between-subjects method (30), followed by tool-related contribution with within-subject (7), and the model-related contribution with between-subjects testing (7).In can be derived from the analysis that the largest pool of researchers uses questionnaires when measuring through between-subjects testing (41), followed by a questionnaire with withinsubject testing (15).Subsequently, most scholars who use between-subjects testing performed selfreported metrics (36), followed by self-reported metrics with within-subject testing (11), and hybrid metrics with between-subjects testing (11).Referring to Figure 18, a three-dimensional mapping has been constructed to better understand the three-way correlation between common usability techniques, evaluation types and used metrics.It can be derived that the largest pool of scholars who applied questionnaire instruments in their research obviously performed self-reported metrics in a between-subjects testing fashion (30).The second largest group employed a questionnaire but also performed both metrics in a betweensubjects testing setup (11).The third largest pool employed observation techniques and also carried out both metrics in a between-subjects experimental setup (10).

Research Findings on Identified Gaps
This section presents five research gaps (G1-G5) derived from results and discussion presented in Section 4 above.

Educational Domains versus Others (G1)
Figure 3 and Table 6 shows that from 12 major domain categories, the majority of the selected papers (50%) conducted usability studies on MAR in the education domain, which can be broken Referring to Figure 18, a three-dimensional mapping has been constructed to better understand the three-way correlation between common usability techniques, evaluation types and used metrics.It can be derived that the largest pool of scholars who applied questionnaire instruments in their research obviously performed self-reported metrics in a between-subjects testing fashion (30).The second largest group employed a questionnaire but also performed both metrics in a between-subjects testing setup (11).The third largest pool employed observation techniques and also carried out both metrics in a between-subjects experimental setup (10).
Appl.Sci.2019, 9, x FOR PEER REVIEW 27 of 38 Figure 17 shows a two-dimensional mapping on the correlational relationships of research types with types of evaluation and contribution type with types of evaluation.It can be interpreted from Figure 17 that most researchers who conduct exploratory research evaluate respondents using the between-subjects technique (32), followed by comparative research, which also uses the betweensubject evaluation technique (20).On the other hand, the largest pool of researchers who produce tool-related contribution also utilizes the between-subjects method (30), followed by tool-related contribution with within-subject (7), and the model-related contribution with between-subjects testing (7).In can be derived from the analysis that the largest pool of researchers uses questionnaires when measuring through between-subjects testing (41), followed by a questionnaire with withinsubject testing (15).Subsequently, most scholars who use between-subjects testing performed selfreported metrics (36), followed by self-reported metrics with within-subject testing (11), and hybrid metrics with between-subjects testing (11).Referring to Figure 18, a three-dimensional mapping has been constructed to better understand the three-way correlation between common usability techniques, evaluation types and used metrics.It can be derived that the largest pool of scholars who applied questionnaire instruments in their research obviously performed self-reported metrics in a between-subjects testing fashion (30).The second largest group employed a questionnaire but also performed both metrics in a betweensubjects testing setup (11).The third largest pool employed observation techniques and also carried out both metrics in a between-subjects experimental setup (10).

Research Findings on Identified Gaps
This section presents five research gaps (G1-G5) derived from results and discussion presented in Section 4 above.

Educational Domains versus Others (G1)
Figure 3 and Table 6 shows that from 12 major domain categories, the majority of the selected papers (50%) conducted usability studies on MAR in the education domain, which can be broken

Research Findings on Identified Gaps
This section presents five research gaps (G1-G5) derived from results and discussion presented in Section 4 above.

Educational Domains versus Others (G1)
Figure 3 and Table 6 shows that from 12 major domain categories, the majority of the selected papers (50%) conducted usability studies on MAR in the education domain, which can be broken down into several sub-categories like engineering, architecture, and language.While performing usability studies in MAR-based educational research is promising, it can also be obsolete and seen as a complacent effort in research by focusing most studies within the education domain.While other domains, such as navigational MAR research, are catching up, the pair of MAR and usability research are still within infancy in other exploratory areas such as automotive [36], basic skills improvement [82], and AR technical research-such as works done in [44,52], gaming [38], and security [29].As in the domains of medical health, architecture, construction, management, marketing and advertising, there have been much technical AR and applied research carried out, but has not include usability studies as one of the measured factors.

Modes of Contributions (G2)
Referring to Table 7, it can be derived that exploratory [13] and comparative [20] research types dominated among the 72 collected works.The majority of the papers produced contribution and insights relating to MAR tools.However, it is apparent that research contributions in MAR learning and usability are still lacking in the other four types of contribution.The other four types of contribution relates to research novelty on methodologies, models (especially on usability model tailored for MAR), techniques, and case studies (experience paper).Exploratory and comparative research has also been overly saturated in the area of MAR learning and usability.It is shown in the findings of this paper that there has been no research utilizing experimental and empirical approaches on their own, let alone with the combination of either exploratory or comparative studies.The utilization of quasi-experimental and heuristic methods was been minimal from the data synthesis of this research.

Standardization of Usability Metrics (G3)
While 18 categories of usability metrics performed in all 72 collected studies can be seen from Table 11 on, most used metrics are one way or another inherited from three major usability components given by ISO 9241-11 [100].Other de facto standards like the ones recommended by prominent figures in usability studies such as Alan Dix [134] and Jakob Nielsen [115] have been instrumental in many new emerging metrics in usability domain.There are also some notable mention metrics like escapism [49], facilitating conditions [39], bundled of identification (HQ-I), pragmatic quality (PQ), stimulation (HQ-S) [14], novelty [53], price value [27], and social influence [39], which can be considered new and very much related to measures of usability from an array of different perspectives.While metrics like effectiveness, efficiency, satisfaction, and learnability are some of the most applied usability metrics across these selected papers, many other emerging metrics might or might not face validation issues since some of the mentioned metrics have yet to be accepted by the majority.Relating to G2, there are still many research loopholes in contributing to models and methodologies which can help in classifying and validating many developing usability metrics introduced in MAR.While one reference work in [83] has been identified to use procedural usability principles proposed by [132], specifically for AR in mobile environment, there is still an absence in the research of formulating standards for usability metrics in MAR, since most works adapted models and guidelines from diverse application areas.

Limited Quality versus Large Sample Convenience (G4)
One of the well-known facts in usability evaluation is the advantages and disadvantages of performance versus self-reported metrics is either risk of biases or quality of data.In usability, self-reported or subjective measures are merely means opinion-based input given by respondents expressing their experiences.They are also based heavily on subjective judgement of respondents channeled through instruments such as questionnaires and interviews [135].As mentioned by Olsson in [120], user experience measurements, in general, should essentially be self-reported in order to cover the subjective nature of user experience.However, according to [2,136] data collected through self-reported techniques can be subjected to social desirability bias and central tendency biases.This can lead self-reported data to be subjected to biases, inconsistencies, and validity.However, the usage of self-reported can reach larger audiences especially through scaled close-ended questionnaires.This might be the reason justifying how evidently through results of data synthesis shows 40 out of 72 scholars used only a questionnaire in tracking usability in MAR, followed by combinations of a questionnaire with other commonly used techniques.According to [101], there is an importance to highlight questionnaires reliability through several correlational approaches.While a total of 60 (57%) authors who used questionnaires as measures, only a handful (12 authors) validated the questionnaires' reliability despite risks of mentioned self-reported biases.Some of the authors who performed reliability measures includes utilization of Cronbach's Alpha [12,14,32,40,42,49,57], Cohen's Kappa [15,24], and Pearson's correlation [15,22].Techniques such as performance metrics, which are evidently more reliable than self-reported approach were rarely used under the assumption that these processes are much more technical, time-consuming and tedious.Only a trickle pool of 24 authors (23%) executed performance metrics.A self-reported approach like questionnaire is still used primarily due to the supposition that the processes are swifter and able to reach bigger audiences compared to performance approach.Authors who performed only performance metrics like [28] reached as little as 1 sample, and [36] reachesd 6 samples.Authors like [32,34] only performed self-reported measures and managed to reach 978 and 318 samples, respectively.Not to mention there are also works who used a questionnaire with smaller samples, as with [62] (11 samples) and [58] (10 samples).Therefore, it is really a matter of opting for quality data within a small pool of respondents using stricter protocols or reaching a bigger audience with convenience by executing simpler procedures despite the risks of questionable and bias data collection.Hence, the identified gaps here are justifications of why self-reported metrics are still widely used despite risks of bias as compared to performance approaches.

Limitation of Hybrid Usability Methods (G5)
Figures 6-13 show the limited approaches used with usability technique combinations.As can be seen in Table 14, there are only 8 papers that promoted three technique combinations, while 18 papers proposed two technique combinations.The majority of 45 papers still preferred only one type of technique at a time.According to the ISO 9241-11 standard [100], the ideal evaluation measuring all three usability components (efficiency, effectiveness, and satisfaction) uses a combination of both performance (efficiency and effectiveness) and self-reported (satisfaction) metrics.Besides having more data angles for analysis, benefits of hybrid performance and the self-reported approach allows platforms for data triangulation, such as the counter measuring the validity of the data collected.However, perhaps due to the complexity of these procedures, less than half (26 papers) leanied on using technique combinations.The reliability of a single technique, as mentioned in G4, can be questionable, with common problems of user performances where the audience can be small and self-reported approaches that carry risks of invalid data inconsistencies.Despite the disadvantages of both approaches, some authors like [137] and [79] could still produce results with big sample audiences (50 samples or more) and report on tangible results using hybrid usability techniques.

Potential of MAR Usability in Myriad of Domains
Based on G1 and the findings presented in Section 4.2.1, MAR learning has been applied mostly in the education and navigation research areas.While there is much potential in many other areas mentioned in Section 4.2.1 and Table 6, the involvement of MAR learning in these other areas are significantly lower compared to education and navigation domain.Seeing the saturated effort of applying combination research of MAR learning and usability in the two aforementioned areas, there are many untapped opportunities to conduct similar research penetrating into real industries aligned with requirements of Industrial Revolution 4.0.

Implementation of Research Types
Based on G2, while there is no harm in conducting more MAR-based usability studies through exploratory and comparative methodologies, and scholars might also want to look into possible study ventures using other research methodologies such empirical, experimental, quasi-experimental, and heuristic approaches since these aforementioned few have been kick-started by other researchers in the domain of MAR-based usability research.More works can also be carried out in several contribution types that are lacking in references, such as an introduction to new methodologies, models, techniques, reports of case studies, and experience papers.

Validation of New Usability Metrics in MAR
Referring to G3, there is still little-to-none research which focuses on tailoring new standards and usability metrics validation for MAR.Whilst it is crucial to produce models of usability for MAR, it is also important to validate through several recognized metrics that are relative to usability studies.Even though this SLR paper has managed to group these usability metrics in several category types (Table 11), there is still work to be done in systematically categorizing metrics within established terminologies and de facto standards through rigorous future evidence-based studies.

Utilization of Performance Metrics
It is of utmost important to highlight in G4 that performance metrics are manifestly underused despite the concrete logic of better data collection.New models and methodologies can be proposed in utilizing performance metrics that can be beneficial and at the same time eliminate commonly known limitations of objective measures.Despite having several know disadvantages in performance metrics, there are many opportunities in improving the protocols so that it can be utilized more in usability-based MAR studies.While self-reported metrics also have its set of advantages, there are also many opportunities to improve their risks on top of reliable statistically driven countermeasures.

Potential of Hybrid Techniques in MAR Usability Evaluation
The discussion of G5 has highlighted the potential of hybrid usability models that can maintain data consistency while reaching larger groups.Research and standardization work in hybrid usability approaches has opened a new gap for models and methodology introduction that can serve the objectives of improving usability in MAR.While there were several reported works in this paper on authors that utilize technique combinations, the amount of research in this area can still be improved in order to generate more result patterns achieved through hybrid usability technique in MAR.

Correlational Research
Figure 14 shows that there are still limited areas which combine hybrid usability approaches with technique contributions, the contribution of model, method, and case study/experience paper in performance metrics.Subsequently, there has also been little-to-none studies carried out using performance metrics in experimental, quasi-experimental, and heuristic research approaches.Similarly, no study had carried out hybrid metrics through quasi-experimental and heuristic approaches.Figure 17 shows that there have been no studies using an empirical approach contributing to a method or model; no studies using an experimental approach contributing to producing technique or case study/experimental results; no studies using a quasi-experimental approach contributing to model, technique, or case study/experimental results; and no studies using a heuristic approach contributing to producing method, model, technique, or case study/experimental results.Subsequently, as also shown in Figure 17, mapping has shown that there are plenty of opportunities to investigate research type, subject study, and contribution-these include experimental with within-subject, experimental, quasi-experimental and heuristic with hybrid metrics, between-subjects studies that contribute to technique, and hybrid metrics that contribute to the method, model, technique and case study/experimental results.Figures 14-18 further elaborate on the visualization of limited research areas which could be utilized for future works in usability-based MAR studies.

Quality of Work
Due to the rigorous effort of carefully comparing each paper, formulating research questions (RQ), checking through inclusion/exclusion criteria, and finally performing quality assessment questions (QAs), the confidence of the quality of work in each selected paper can be presumably high.However, there is still a risk in the definition of quality in each paper according to a different set of comprehension objectives.Despite rigorous data collection leading to a synthesis of all selected papers, we can only classify the efforts put into this article as level best and not 100% error free in assessing the quality of each selected paper.As mentioned in 3.1, automated and manual (snowballing) methods were carried out in the process in reducing inaccuracy, incompletion, and risk of validity for data collected as much as possible.

Biases in Paper Selection
As mentioned in Section 3, both methods used in Section 7.1 were implemented to reduce biases in paper selection, but there is still no guarantee that this research has overlooked some related papers.However, there is a guarantee that all protocols had been carried out specifically to avoid any anomalies in between data collection processes.

Data Synthesis
In any review paper, external validity and conclusion validity, as discussed in Section 3, can be evident where the validity of data collected are questionable and non-general.Despite clear process implementation from the start, no processes were carried out without miscalculations, including the processes conducted in this study.Though, due to calculated risks of possible threats based on identified parameters, any errors are assumed to be minimal due to the consistent employment of SLR methods.This research followed established SLR techniques suggested by predecessor authors who had carried out similar approaches with clear evidence in minimizing the risks of TTV.

Conclusion
This paper aimed to study existing usability implementation in mobile augmented reality in regard to specific scope determined through four research questions.These research questions were primarily formulated to find out the existing domain of application, research types, usability metrics, methods, techniques, and approaches targeted to comprehend current issues and gaps through systematic identification.With an initial pool of 1324 papers followed by an additional 116 papers using both automated and manual searches, an arduous multi-layer process was implemented to narrow down to only 72 articles defined by pre-determined quality.Data synthesis allowed the authors in this review to understand and analyze pre-designed objectives, which eventually contributed to: (1) The classification of research demographics; (2) the categorization of usability metrics, methods, and techniques; (3) two-dimensional and three-dimensional correlational mapping between research parameters; (4) the identification of relevant research gaps; and (5) recommendations for future research in usability-based MAR derived through identified gaps and correlational mappings.The findings of this research has managed to answer the four research questions formulated earlier in this paper.RQ1 has shown evidence that the most used research domain which dominates in MAR learning is education, followed by navigational exploration.RQ1 also highlighted the exploratory as the most adopted research type and MAR tool production as the most registered research contribution.RQ2 was answered when evidence showed self-reported metrics to be the most used usability metrics, between-subjects testing as the most preferred evaluation, and user experience to be the most measured usability parameter.In RQ3, the questionnaire was shown to be the most preferred usability techniques.Answers to RQ3 on the other hand, explained in detail the adopted combinations of usability methods, metrics, and techniques.RQ4 showed the mapping of research types, contribution types, and usability metrics from several different perspectives.Besides contributing to the detailed evidence of usability correlational variables in MAR learning, by answering all 4 RQs, this research has also managed to contribute in highlighting five research gaps addressing the varieties of related domains, a lack of contributions in several research outputs, a lack of usability standardization in MAR learning, a significant gap in usability metric utilization, and the limitation of hybrid usability methods.This paper then concluded with five recommendations founded on identified gaps in MAR learning research.The findings, synthesis, identified gaps, relational mappings, and recommendation are hoped to add value to future research works and sources that initiate more concrete studies in usability-based MAR.

Figure 5 .
Figure 5. Used usability metrics instruments by count.

Figure 5 .
Figure 5. Used usability metrics instruments by count.

Figure 6 .
Figure 6.Percentage of usability techniques used.

Figure 7 .
Figure 7. Research papers and usability technique combinations.

Figure 8 .
Figure 8. Papers with three technique combination.

Figure 7 .
Figure 7. Research papers and usability technique combinations.

Figure 8 .
Figure 8. Papers with three technique combination.

Figure 7 .
Figure 7. Research papers and usability technique combinations.

Figure 8 .
Figure 8. Papers with three technique combination.Figure 8. Papers with three technique combination.

Figure 8 .
Figure 8. Papers with three technique combination.Figure 8. Papers with three technique combination.

Figure 9 .
Figure 9. Works with a three technique combination.

Figure 10 .
Figure 10.Papers with two technique combinations.

Figure 11 .
Figure 11.A number of two technique combinations.

Figure 10 .
Figure 10.Papers with two technique combinations.

Figure 11 .
Figure 11.A number of two technique combinations.

Figure 10 .
Figure 10.Papers with two technique combinations.

Figure 11 .
Figure 11.A number of two technique combinations.Figure 11.A number of two technique combinations.

Figure 11 .
Figure 11.A number of two technique combinations.Figure 11.A number of two technique combinations.

Figure 12 .
Figure 12.Papers with one main technique.

Figure 14 .
Figure 14.Two-dimensional mapping of research types, contribution types, and metrics.

Figure 15 .
Figure 15.Three-dimensional mapping of research types, metrics, and contribution types.

Figure 16 .
Figure 16.Two-dimensional mapping of research types, common techniques and contribution types.

Figure 14 .
Figure 14.Two-dimensional mapping of research types, contribution types, and metrics.

Figure 14 .
Figure 14.Two-dimensional mapping of research types, contribution types, and metrics.

Figure 15 .
Figure 15.Three-dimensional mapping of research types, metrics, and contribution types.

Figure 16 .
Figure 16.Two-dimensional mapping of research types, common techniques and contribution types.

Figure 15 .
Figure 15.Three-dimensional mapping of research types, metrics, and contribution types.

38 Figure 14 .
Figure 14.Two-dimensional mapping of research types, contribution types, and metrics.

Figure 15 .
Figure 15.Three-dimensional mapping of research types, metrics, and contribution types.

Figure 16 .
Figure 16.Two-dimensional mapping of research types, common techniques and contribution types.

Figure 16 .
Figure 16.Two-dimensional mapping of research types, common techniques and contribution types.

Figure 17 .
Figure 17.Two-dimensional mapping of research types with evaluation types and contribution types with evaluation types.

Figure 18 .
Figure 18.Three-dimensional mapping of common usability techniques with evaluation types and used metrics.

Figure 17 .
Figure 17.Two-dimensional mapping of research types with evaluation types and contribution types with evaluation types.

Figure 17 .
Figure 17.Two-dimensional mapping of research types with evaluation types and contribution types with evaluation types.

Figure 18 .
Figure 18.Three-dimensional mapping of common usability techniques with evaluation types and used metrics.

Figure 18 .
Figure 18.Three-dimensional mapping of common usability techniques with evaluation types and used metrics.

Table 2 .
Inclusion and exclusion criteria.
Exclude all: 1. Articles published in languages other than English; 2. Articles that discuss only about application development and does not implement usability measures; 3. Articles that present study other than handheld mobile MAR learning applications.

Table 3 .
Quality assessment questions.

Table 4 .
Collected articles from different online databases.

Table 4 .
Collected articles from different online databases.

Table 5 .
List of publications.

Table 6 .
Research domains and sub-domains.

Table 6 .
Research domains and sub-domains.

Table 7 .
Combination of research types.

Table 8 .
Types of research contribution.

Table 9 .
Types of usability metrics category.
Table 11presents the commonly used metrics together with other related terminologies within the same group.While most metrics presented below are inspired through a derivation of three major ISO 9142-11 metrics, honorable mentions of each metrics' expression are seen important for the uniquely added values in each suggested metric.

Table 11 .
Usability metrics and interchangeable terminologies used by selected studies.