Gesture Elicitation Studies for Mid-Air Interaction: A Review

: Mid-air interaction involves touchless manipulations of digital content or remote devices, based on sensor tracking of body movements and gestures. There are no established, universal gesture vocabularies for mid-air interactions with digital content or remote devices based on sensor tracking of body movements and gestures. On the contrary, it is widely acknowledged that the identiﬁcation of appropriate gestures depends on the context of use, thus the identiﬁcation of mid-air gestures is an important design decision. The method of gesture elicitation is increasingly applied by designers to help them identify appropriate gesture sets for mid-air applications. This paper presents a review of elicitation studies in mid-air interaction based on a selected set of 47 papers published within 2011–2018. It reports on: (1) the application domains of mid-air interactions examined; (2) the level of technological maturity of systems at hand; (3) the gesture elicitation procedure and its variations; (4) the appropriateness criteria for a gesture; (5) participants number and proﬁle; (6) user evaluation methods (of the gesture vocabulary); (7) data analysis and related metrics. This paper conﬁrms that the elicitation method has been applied extensively but with variability and some ambiguity and discusses under-explored research questions and potential improvements of related research.


Introduction
Mid-air interaction is about touchless manipulations of digital content or remote devices, based on tracking of body movements, postures and gestures with non-intrusive sensors (or minimally-intrusive, mainly based on computer vision). Over the last few years, mid-air interaction has evolved into a distinguishable interaction style of human-computer interaction (HCI). The origins of mid-air interaction can be traced back in the late seventies in the MIT Media Room and the "Put That There demo" [1] and in the eighties at the live music performances of Vincent John Vincent and Francis MacDougall (https://www.youtube.com/watch?v=-zQ-2kb5nvs). In the last decade, we have witnessed popular gaming platforms that rest on mid-air interactions like the Wii and Xbox, impressive demos like the wearable SixthSense project [2] and several public installations in museums and technology-enhanced rooms [3]. Lately, mid-air interaction is explored as an alternative or complementary interaction style in several application domains that require touchless manipulation like mobile [4] and desktop micro gestures [5], gesture-based control of the TV [6] and other "smart" home appliances [7], remote interaction with distant displays in the wild [8] and in particular contexts (e.g., operating rooms) [9,10], interaction with smartwatches [11], in secondary driving tasks [12,13] and so forth.
There is not an established, universal gesture vocabulary for typical mid-air manipulations in any of these aforementioned application domains. On the contrary, it is widely acknowledged that "each input method is best at something and worse at something else" [14] and that "there is no 1.
What are the application domains of mid-air interaction elicitation studies? Given that mid-air interaction is researched in a wide application scope, it is useful to provide an overview of the diverse domains and possibly identify application trends.

2.
What is the level of technological maturity of the systems at hand? Since that the elicitation method is applied early in the requirements or design phases of the interactive systems development lifecycle, on which design ground do users make gesture proposals? 3.
What is the basic process followed and its variations? Given that the method does not have a strict procedure it is important to identify the main steps and their outcomes. 4.
What are the dimensions of appropriateness for gesture selection? Various dimensions are met in related work (like discoverability, memorability, performance, reliability, comfort, usability); it is useful to identify the degree to which these are considered in elicitation studies. 5.
What is the profile of participants in elicitation studies? As in every user-centred method, the number and the characteristics of participants are critical for the validity and generalizability of the results. 6.
How are user proposals (about gestures) evaluated? Various methods have been employed, it is useful to identify them as well as possible trends in this respect. 7.
What data analysis of user proposals is conducted and based on which metrics? Several metrics have been proposed and applied, with variation, in gesture elicitation studies that need to be reviewed and discussed.
The rest of the paper is structured as follows: Section 2 presents the paper selection process and criteria and the characteristics of the papers reviewed. Section 3 presents the findings of the survey in terms of the aforementioned research questions. Section 4 discusses the findings of this survey, identifying further challenges and trends. Section 5 presents the conclusions.

Method
A systematic method for paper selection and analysis was followed, which adopted the Quality of Reporting of Meta-Analyses (QUORUM) statement [18] and includes the following steps. The selection and analysis of papers took place in two periods: first in late May 2018 and late August 2018 (when the paper corpus was finalized).
Step 1. Potentially relevant publications identified and screened for retrieval Source selection. We selected the following online digital libraries: ACM Digital Library, IEEEXplore, Taylor & Francis online, Springer (link), Elsevier (sciencedirect) and the Google scholar search engine. These online services provide direct access to the vast majority of high-quality journals and conferences of Human-Computer Interaction (HCI).
Search queries. At first, we explored search results with a few search terms that combined ("mid-air interaction" or "in-air interaction" or "touchless interaction" or "kinaesthetic interaction") and ("elicitation" or "user defined gestures" or "empirical study"). We observed that the most comprehensive lists of search results were obtained when we used two queries: ("mid-air interaction" and "elicitation study") and ("mid-air interaction" and "user defined gestures"). Therefore, we decided to retain these queries for all online sources.
Search constraints (refinement). We refined the results for the years 2011-2018 for all online sources. We set 2011 as the starting year because it was then that the first affordable sensors were released in the market. We shorted the results by "relevance". We approximately examined the first 100 search results of each online library (if available); since that we (gradually) observed that results after the first 100 were less relevant to our search. The online sources of ACM DL, Elsevier and Google scholar returned several hundred results on these queries, while the other three online sources returned only a few results. In addition, we used the feature "cited by", in cases of highly cited papers to identify more results that may have not been identified directly by the queries employed.
Criteria for screening. We screened more than 1000 search results (based on their title and summary) on the following criteria: (a) It potentially referred to an elicitation study of mid-air interaction (disambiguation was required since that some search results were clearly out of context), (b) it was a scientific paper (i.e., not a book, a thesis, an editorial, etc.), (b) it was accessible by the subscription from our academic institution (e.g., not a mere citation without a link to the source publication), (d) it was a relevant paper to this study, even if not an elicitation study per se (e.g., other review survey papers of mid-air interaction). A total number of 247 papers met those criteria.
Step 2. Publications retrieved for detailed evaluation Further screening criteria. In this phase, the papers were reviewed by abstract and rapid examination of structure and content to: (a) exclude other relevant papers (identified in previous screening, phase (d)), (b) eliminate duplicates and (c) identify if they referred to an elicitation study, typically by containing a section on "user elicitation", or "gesture design" or "user definition of gestures". A total of 106 papers remained from this examination.
Step 3. Potentially appropriate publications for the review Further screening criteria. In this phase, papers were cross-examined to (a) validate that each paper referred to a unique study (when more than one papers referred to the same study, the most comprehensive was kept), (b) exclude short papers and works in progress. Furthermore, papers were reviewed in further detail to ensure that they included reasonably sufficient information about the elicitation study. A total of 47 papers remained.
Step 4. Publications included for the review General characteristics of the papers selected. The papers included in this survey span along the 5-year period like this (Table 1): 2011: 2 papers (4.3%); 2012: 5 papers (10.6%); 2013: 3 papers (6.4%); 2014: 6 papers (12.8%); 2015: 4 papers (8.5%); 2016: 8 papers (17%); 2017: 11 papers (23.4%); 2008: 8 papers (17%). Given that the paper selection was finalized in August 2018, papers to appear later in 2018 have not been included in this review. From the corpus of 47 publications (Table 2), there were 11 papers (23.4%) published in journals (in 10 journals) and 36 (76.6%) in conferences. A considerable number of papers selected (9 papers, 19%) were published in ACM CHI conference. Most papers were published in conferences rather than in journals, which may be attributed to the fact that the scope of elicitation studies is in user research and design. Papers on gesture elicitation do not often lead to development and evaluation of interactive systems. All publication venues are related to the field of HCI, most at its core, like the ACM CHI (Computer-Human Interaction), DIS (Designing Interactive Systems) and ITS (Interactive Tabletops and Surfaces) conferences and the International Journal of Human-Computer Studies and some in relation to applications of HCI like for example the Ubicomp conference.

Findings
In this section, we report on the findings based on the research questions of our review (posed in Section 1).

Application Domains
There is a considerable variety of application domains of mid-air interaction research, as well as variability in terms of the technological maturity of systems at hand ( Table 3).
The control of various types of media, like music, videos, (types of) image collections and so forth, in the mid-air is a challenging design issue. A considerable number of elicitation studies (7/47, 14.9%) focus on mid-air media control, like the work of Ruiz et al. [34] who elicit a gesture set for controlling 360 degrees videos or the work of Siddhpuria et al. [40] who explore the use of discrete micro-gestures through a smartwatch for control of remote media.
Several studies employ mid-air interaction to improve the UX of (smart) TVs (6 out of 47, 12.8%), which have evolved beyond the passive TV-watching paradigm into interactive multimedia devices with features like web browsing, content manipulation, media playback and so forth. Notably, in Reference [28] freehand gesture vocabularies for controlling the TV have been proposed, while in Reference [37] the gesture vocabulary is for blind users.
Mid-air control of public displays appeared in 12.8% of all studies examined (6/47). For example, Di Geronimo et al. [49] conduct an elicitation study to identify mid-air gestures to share data among mobile devices, Ref. [60] elicited user-defined gestures for defining virtual interaction spaces within a pervasive environment, while Rodriguez and Marquardt [44] present a gesture elicitation study on how to opt-in and opt-out from interactions with public displays to address the need for user registration and avoid "false-positive" content activation.

Technological Maturity of Systems
Another issue that emerged from this survey is related to the technological development of systems at hand. Generally, elicitation studies are concerned with user research at the early stages of the design process, when it is not always possible or desirable to have a system or prototype. Therefore, the technological development of the system at hand may vary a lot, which affects the type of study.
We have identified three levels of technological design maturity: Systems (fully-developed). These are employed in the gesture elicitation study (that may happen "in the wild") and they are possibly redesigned according to the results. For example, in the work of Lee et al. [46], an elicitation study was conducted with a "walk up and use" application about academic information installed on a public display in a university campus. Nine out of 47 elicitation studies (19.1%) were conducted with a fully developed system at hand.
Working prototypes. These are functional in the sense of providing interactive digital content and gesture tracking and demonstrating a working component of the system, however they are not fully-developed systems. For example, in the work of [54], the elicitation study takes place with reference to a working prototype of an image gallery. Sixteen out of 47 (34%) elicitation studies were conducted with working prototypes.
Referents. In this case, a set of referents (typically user commands) about a known system (e.g., the TV) is provided to users. For example, in the work of [19], three elicitation studies are conducted to identify different gesture sets based on different body parts for intense gameplay. Almost half of the elicitation studies identified (22 out of 47, 46.8%) were conducted on the basis of referent sets about a known concept.

Gesture Elicitation Process and Variations
One of the most adopted user-elicitation methods is the "Guessability study" that was first introduced by Wobbrock et al. [62], as a unified approach for maximizing and evaluating the guessability of symbol input that was entered by users on a touchpad. This method was initially applied for interactive surfaces and later for mid-air gesture interactions. The main idea is that it is unrealistic to expect from novice, as well as expert users, to have the time or desire to undergo extensive training to learn new ways of interacting with a system.
In general, the guessability study starts by having participants presented with a referent (the effect of an action). Then, they are asked to propose a gesture that better matches (or is easy, intuitive etc.) the indented use. During the process, data collection takes place (video-audio recording, semi-structured interviews or think-aloud protocol, Likert scales questionnaires, etc.). At the end of the process, the gesture set is derived after data analysis with various quantitative and qualitative metrics and categorized into various taxonomies. Although the guessability study was initially applied for surface-based systems, it appears to be as the most adopted method (36 out of 47, 77%) for eliciting gestures for mid-air interaction systems (Table 4). Of the 47 papers reviewed, 9 papers (19.1%) followed the Wobbrock's methodology by the book (a. applying a Wizard-of-Oz, b. analysing proposed gestures with the metrics that Wobbrock et al. proposed (Level-of-agreement or Agreement-rate) and c. categorizing gestures into taxonomies), while 4 papers out of 47 (8.5%) enriched that methodology by applying additional metrics for analysing the gesture set. An interesting variation of the usual guessability method is the choice-based elicitation study which consists of two phases. The first one is a guessability study. The second phase, which can also be considered as a refinement technique, re-examines only the referents that scored low agreement during the previous phase. A survey is conducted and users select the most appropriate gestures from a predefined list of gestures, produced by the experts or the users. For instance, Dim et al. [37], while investigating mid-air gestures that would enable blind people to control the TV, conducted two elicitation studies. In the first one, gestures proposed by users were analysed on a basis of consensus for each referent. In the second study (also referred as "choice-based elicitation study" [19,37]), for those referents that scored a low consensus rate, a pre-defined list of representative gestures (proposed by experts) was presented to participants. Similarly, in a research investigating concurrent body gestures during intense gameplay, Silpasuwanchai and Ren [19], conducted a choice-based elicitation study, since the notion of simultaneous gestures was relatively uncommon for users. In their study, the list of pre-defined gestures was not proposed by the authors but it was populated by the two, previously conducted, elicitation studies. A similar approach of user-defined (instead of expert-defined) choice-based elicitation study was also utilized in Dong's et al. [22] research, in which the most frequently proposed gestures from a preliminary study, were presented to the users in a multiple-choice manner. Although the choice-based elicitation study is a time-consuming process, it is claimed as a necessary complementary approach that improves creativity, especially for novel interfaces where users are not familiar with the design space.
Another user-elicitation method is the one proposed by [15] also referred as the "Intuitive and Ergonomic" Method. Drawing upon usability principles, they highlight the importance of designing gestures that are easy to perform and remember, intuitive, ergonomic and metaphorically logical. They imply that mid-air gesture interaction is not a panacea for every application and therefore it should be examined beforehand whether it is the most appropriate interaction technique for the system to be developed. Then, in order to produce intuitive gestures for a system, they propose a human-to-human non-verbal communication approach with the use of scenarios where users interact with the "operator" (i.e., the person who conducts the experiment) or another user by using gestures that are more appropriate for the specific function. In cases when the design of the interface or the feedback is to be tested, a human-to-computer approach can be applied by using a Wizard-of-Oz technique [63]. After collecting the proposed gestures and evaluating their ergonomic characteristics, the resulting gesture vocabulary is benchmarked in terms of memorability, stress and guessability. From our review, it appears that two studies (2 of 47, 4%) adopted the general approach of Nielsen's "Intuitive and Ergonomic" Method. In particular, [23] conducted an intuitive and ergonomic method to investigate and compare gestures proposed by experts and novices using a vision-based Anaesthesia-related system within an operating room, while [26], adopted Nielsen's approach to elicit the set of commands of a music player and the gestures that are more appropriate for each task.
Also, there are two studies (4%) that conducted an extensive user-based elicitation study, by combining both Wobbrock's and Nielsen's approach. More specifically, in Reference [7], a user-based guessability study was conducted to find the most appropriate gesture set for Mid-air interactivity within a smart environment, followed by the Intuitive and Ergonomic method to investigate memorability and performance aspects of the proposed gestures. In their study [27], investigate gesture-based TV-control, by adopting Wobbrock's methodology to collect the user-defined gestures for every task and Nielsen's techniques to initially identify the available commands of the intended system and then benchmark the gesture set in terms of memorability, comfort and gesture-command matching degree.

Controlling the Legacy Bias
A primary concern on user elicitation studies is legacy bias, which refers to "the prior experience with interfaces and technologies, that makes it hard to uncover new gestures for an emerging medium" [17]. According to Morris at al. [16] users tend to transfer prior knowledge to new technologies because biased interaction techniques minimize physical and mental effort and because sometimes users cannot understand the fundamental capabilities of novel technologies. Undoubtedly, legacy bias has a direct effect on the proposed gestures. Some researchers mention legacy bias as a pitfall of elicitation studies since it might limit the potential for producing gestures that take full advantage of emerging technologies and sensing capabilities [16,17].
Morris at al. [16] proposed three techniques (the 3-Ps) that might reduce the phenomenon of legacy bias; Priming, Production and Partners. Priming involves exposing users to a stimulus that sub-consciously influences the responses to future stimuli. Production means to require users to propose many gestures for each referent and Partners suggests recruiting users in groups in order to leverage their ideas. From our survey, 15 out of 47 (31.9%) studies adopted at least one of the aforementioned techniques to reduce legacy bias, while Production was the most frequent technique used (12 out of 15, 80%). For example, in Reference [51], users were kinaesthetically primed by lifting and moving boxes before the elicitation study, in References [54,55] users were prompted to produce more than 3 gestures for each referent and in Reference [41] users were recruited in groups to brainstorm several interactions and then to come up with the most preferable one.
Although legacy bias is considered as a factor that may not produce originality in gesture proposals, sometimes it is considered to have positive effects in elicitation studies. Köpsel and Bubalo [64] argue that biased gestures have, in most cases, the advantage of being simplistic, do not require much time to be learned, or effort to be guessed, resulting in high agreement scores in elicitation studies. Such gestures are appropriate in cases when users do not have the time or desire to learn new interaction methods, or when the user's cognitive load should not be burdened. It becomes apparent that tackling or not the legacy bias is a matter of design decision and it mainly depends on whether the end product/system is meant to be a walk-up-and-use system or a system that would take full advantage of the novel interaction techniques. In our survey, 32 out of 47 studies (68.1%), did not utilize a technique to reduce legacy bias.

Referent Presentation
Following Wobbrock's terminology, the referent is the effect which is triggered by a gesture. Referents can be presented to the participants in various ways. Depending on the type and maturity of the prototype used, referents were either demonstrated through GUI animations [41,54], described as a text message on the screen, or verbally [4,28,37,61], presented as a video [51,61] or still images [19,24,34,55], or presented by manipulating the actual artefact [7,45].

Think-Aloud Protocol
Although the main method of elicitation studies are similar to Wobbrock's [62,65] or Nielsen's [15] approach, various complementary techniques were employed during the stage of user participation and gesture proposals. The most apparent one is the think-aloud protocol [66,67], which involves prompting participants to verbalize their thoughts, such as why they chose the proposed gesture or refer to other systems or previous experiences [54], as well as describing the gesture, especially the beginning and the end of it. The benefit here is two-fold, it helps the observer to understand the gesture delimiters [68] and it gives valuable insight on the mental models of users [69]. In our survey, 22 out of 47 (46.8%) studies implemented a think-aloud technique.

Wizard of Oz
The Wizard of Oz (WOz) method [70,71] has been claimed as a useful user-centred inquiry approach for novel interface researches when the design space is unknown or under investigation [69]. In the traditional WOz method, the participant has the impression that she is directly communicating with the system. In fact, this is done by an expert (the Wizard) who, in most cases, is hidden. The main process of the WOz method is, first to inform/present the participant for the task to be done, then allow her to start gesturing with the gestures that she prefers (with no expert intervention) and then present the effect of the gesture, giving the impression that the interaction is direct.
However, from the studies reviewed, although the term Wizard of Oz was highly referenced, the approach adopted was not the traditional one but rather a "reversed Wizard of Oz" process. In most studies, the effect (referent) was first presented to the participant, who then was prompted to suggest an appropriate gesture. Moreover, the expert was not hidden to the participant, suggesting that there was no significant interest in eluding the users for a direct human-to-system communication.
In general, the Wizard of Oz technique that was employed in most studies (45 out of 47, 95.7%) was actually about the concept of having an expert to control the system and present the referent (to help participants better understand the task in question) and then ask the user to suggest a gesture.

On Gesture "Appropriateness"
The main goal of an elicitation study is to elicit appropriate gestures for mid-air interactions but how is appropriateness interpreted in elicitation studies of mid-air interaction?
A deeper look into the meaning of "appropriateness" in the papers examined (Table 5), reveals that a considerable number of studies (14 out of 47, 29.8%) investigate a gesture set that is a "better match" or "fit for purpose" for its intended use (without analysing this into a more specific meaning). For example, elicitation studies were conducted to elicit appropriate gesture sets for controlling a media player [54] or for 3D travelling within a pseudo-universe [59], with the aim improve the user interaction and experience in general. Many elicitation studies focus on finding gestures that are easy to perform (12 out of 47, 25.5%). For example, in their study Ruiz et al. [4], were interested in gaining user insights about the ease of gesture application, after asking them to repeat each gesture application (five times).
An equivalent number of studies investigate whether gestures are intuitive or natural (11 out of 47, 23.4%) that is, gestures that are intuitive or natural in the sense that they "enable users to use the interface with little or no instructions" [72]. For instance, in their study Jahani et al. [52], highlighted the importance to find a set of mid-air gestures for in-vehicle secondary tasks, that are natural and intuitive to drivers, in order to avoid increasing their cognitive-load.
Another considerable number of studies (7/47, 14.9%) examined the memorability of proposed gestures which reflects on "how easy users can recall the gesture set after some time of inactivity" [72]. For example, Bostan et al. [36] investigated hand-specific on-skin gestures, showing that intuitive gestures were easily memorable, while Kühnel et al. [7], investigated the correlation of memorability with gesture suitability and perceived effort.
Most of the studies reviewed are driven by at least one dimension of appropriateness. These dimensions may be regarded as belonging into two more general categories: the mental model of the users (i.e., memorability, intuitiveness, discoverability, learnability and guessability) and the ergonomic characteristics of the gestures (ease of application, fatigue, simplicity, comfort, number of hand/body parts and concurrent gestures). Each dimension of gestural interaction is investigated with various empirical methods and techniques as shown in Section 3.6.

Participants: Number and Profile
In a user-centred approach, it is important to carefully select the participants, so they represent the user population adequately and appropriately. Participant selection (or recruitment) must address the questions of "how many participants are enough?" (sample size) and "what participant profiles are representative of the population?" The latter is based on qualitative criteria or possible previous analysis (i.e., user segmentation, personas, etc.). Table 6. Two studies employed only four participants [47,55] and another one employed 9 participants [26], while four studies employed from 35 to 89 users [22,29,53,57]. All other studies recruited between 10 to 30 participants. Generally, there was not much discussion about the required number of participants employed in gesture elicitation. However, the validity of the outcomes and recommendations of elicitation studies is significantly affected by the number of users employed. Regarding the participant profile (Table 7), we saw that about one-half of participant groups were drawn from the academic environment, that is, they were either students (23.4%) or a mix of students and researchers (academic staff 23.4%). These are often more accessible than other types of users and they often volunteer to participate in user-centred activities, to learn about the method or the technology, especially if they are rewarded. However, they may not always be representative of the user population; unfortunately, there was not much discussion about the representativeness, of participants employed in gesture elicitation studies. The other half of participants (48.9%) were adult users of various characteristics. In general, participants were aged between 18 and 60 years old. Only in two works, the age range of the participants was wider, including elderly people [37] as well as a few children (an in-the-wild study) [41]. Regarding prior experience with mid-air gesture interaction, 17 of the 47 (36.2%) studies employed users with mixed experience to the technology examined. Another 9 studies (19.1%) exclusively employed experienced and 7 studies (14.9%) non-experienced participants, depending on whether they wished to control the legacy bias effect. Notably, 14 (29.8%) studies did not provide sufficient information to the aspect or prior user experience. Finally, 8 out of 47 studies (17%) applied gender balancing among the participant group.

User Evaluation of Proposed Gestures
An important aspect of an elicitation study is the user evaluation of proposed gestures according to a possible number of criteria, scales, as well as qualitative comments and remarks (Table 8).
Most elicitation studies (42 out of 47, 89.3%) include one or more user evaluation methods for gesture proposals (23 of the 42, include two or three methods). Generally, user evaluation methods may occur during the production of gestures (concurrently), and/or after the production of a single gesture ("post-task methods" of Table 8) and/or after the end of production of all gestures ("post-test methods").
More than one-third of the elicitation studies examined involve concurrent user evaluation. This was largely conducted with the think-aloud protocol (46.8%), in which participants are encouraged to speak out their thoughts, feelings and opinions. A single study [44], placed users in pairs to discuss about their gesture productions and proposals, therefore adopting an approach co-discovery learning; this is an interesting approach since that researchers can concurrently observe the rationale of participant proposals, while participants themselves may be stimulated to argue for more during the production of a gesture.
More than half studies conduct user evaluations just after the production of a single gesture, repeatedly (post-task). In particular, 25/47 (53.2%) studies adopt a post-task rating scale for the gesture produced; this includes one or more questions about the appropriateness of the gesture from the user perspective (e.g., ease of use, fit for the task, etc.). An additional pair of studies (2/47) adopt post-task (short) interviews instead. One study adopted a post-test memorability test. Another significant number of studies (19/47, 40.4%) conduct user evaluation at the end of the elicitation procedure with various methods. Post-test interviews were employed in 21.3% of elicitation studies. A questionnaire was adopted in 14.9% of studies examined, which was often wider in scope than Likert scales, including questions about user opinions that extend gesture production issues. Post-test surveys were adopted in a couple of studies [52], which were online, confirmatory in nature and involved other participants than those of the elicitation study.
Finally, there are two studies [49,54] that proceeded to the technical development of gestures in a prototype and conducted usability evaluations of alternate gesture sets. These usability tests included typical usability metrics (time to task, errors) as well as measures of perceived usability (questionnaires) and fatigue.

Data Analysis and Metrics
The data analysis is conducted by the researchers and takes place after the user participation has ended. It involves gathering, organizing and coding the recorded data from the user-elicitation study, extracting the most appropriate gesture set, after being processed with various metrics and in many cases categorizing gestures into taxonomies.

Metrics
A significantly large number of studies ( Table 9, 34 out of 47, 72.3%) utilized at least one metric to extract the gestures that better match to a specific referent, or to understand the conceptual complexity of a referent. The most apparent metric (30 out of 34, 88.2%), especially for those studies that follow Wobbrock's guessability study [65], is the "Level of agreement", which shows the level of consensus among participants for each referent. The Level of agreement was initially introduced by Wobbrock in 2005 [62] and was refined by Vatavu and Wobbrock in 2015 [74] which is called "Agreement Rate". Even though Agreement rate was claimed as an improved version of the Level of agreement, only half of the papers that were published after 2015 (8 out of 15, 53.3%) have utilized it [24,28,30,31,[38][39][40]44].
Apart from Level-of-Agreement and Agreement-Rate (30 out of 47, 63.8%) the next most frequent metrics used were time-related (10 out of 47, 21.2%), which are the "Time-of-Thinking" and the "Time-of-Gesture-articulation". Time-of-thinking is the time the participant needs to think before defining a gesture and after the referent has been presented to her. In their research, Hoff et al. [51] utilized gesture thinking time to examine whether priming has a positive or negative effect in gesture proposal. Dim et al. [37] also used the thinking time as an indicator of how easily blind people can imagine their gestures. Kühnel et al. [7], examined the correlation of referents' conceptual complexity with the thinking time, showing that the longer the time needed, the higher the referent's complexity. Gesture articulation time is the time the user needs to perform a gesture. According to their findings Kühnel et al. [7], the gesture articulation time affects negatively the rate of easiness of performance.

Gesture Taxonomy
Almost half of the papers reviewed (22 out of 47, 46.8%), analysed various characteristics of the proposed gestures by classifying them into different categories, called taxonomies. Taxonomies help researchers to gain some insights about the mental model of users [65] and guide designers to understand the type of gestures that are appropriate for various referents [40].
The most frequent taxonomy type is the Nature ( Table 10, 13 out of 47, 27.7%) of the gesture which denotes the relationship between gesture and meaning/object [37] and has various dimensions such as symbolic, metaphorical, abstract, physical and deictic. Symbolic gestures are depictions of symbols, such as drawing a question mark in the air. Metaphorical gestures are linked to their meanings (not to their visual similarities), while abstract gestures map the interactive task arbitrarily. Physical gestures manipulate directly the content/object, such as scaling or rotating an object. Deictic gestures usually involve a stretched index finger, a palm or multiple fingers to indicate objects and directions. The number of hands (unimanual or bimanual), or body parts involved to a gesture was a frequent categorization type of the gestures proposed, in a significant amount of papers (12 out of 22, 25.5%). An equivalent number of studies (11 out of 22, 23.4%) also classified gestures in a taxonomy type called Form, which includes Static gestures (postures that does not vary over time), Dynamic gestures (gestures that involve body or hand movement) and Static gestures with path (gestures that involve hand movement while the hand pose remains the same). Gesture flow was another taxonomy frequently used (9 out of 22, 19.1%), which distinguish gestures into continuous and discrete describing whether the referent occurs during the gesture or after it respectively.
In a few studies, there has been an attempt to derive some design conclusions by contrasting the results from the metrics analysis with the taxonomy of the gesture vocabulary. For example, physical gestures were proved to have higher agreement rates, while abstract gestures require longer time to propose and articulate [7]. Although gesture taxonomization helps researchers to better understand the users' mental model, it was often the case that this categorization did not yield design guidance. Therefore, it appears that there is a need for more practical guidelines in this respect, for example regarding the number of body parts/joints employed for a gesture and respective estimations of stress or fatigue.

Discussion
In this section, we discuss our findings with respect to the current and future practice of conducting elicitation studies for mid-air interaction.

Variability of Application Domains and Systems' Technological Maturity
Our survey has identified that the gesture elicitation method is evolving into standard practice of a user-centred approach to mid-air interaction design. Its adoption is growing by researchers in various mid-air interaction scenarios involving interactive technologies that range from TV control to interaction with smartwatches or drones and it may also concern applications or services provided through these technologies.
It is interesting and fruitful that a user-centred design method is applied to novel interaction contexts and can reach to suggestions about design directions. However, it is equally important to capitalize on the results of elicitation studies in order to identify some prevailing gestures or gesture sets for basic mid-air interactions in any of these domains. This requires careful reviews and assessments of the content of design suggestions for particular interactive technologies or contexts of use. For example, in the work of [75], a systematic survey of mid-air hand gestures with interactive surfaces and displays is presented. More content-based surveys of this sort can help the community to summarize and reflect on previous results and provide feedthrough and inspiration for analogous design contexts.
This survey identifies that there is considerable variability of the technological design of the system under investigation. In many studies, there is not a prototype or system at hand but the referents are presented verbally or with cards (about user commands), which presupposes that the users have a good mental model about the referent. Other studies present prototypes to users or apply the WOz approach, which offers an orientation of the technological context but might 'contaminate' the gesture production process with legacy bias. Of course, the form of the referent mediates the response (production of gestures) and therefore the validity and quality of results. Further research is required to identify the ways by which the form and media of the stimuli affect the responses in elicitation studies.

On the Steps and Process of an Elicitation Study
Analysis of elicitation studies conducted in the papers reviewed, revealed some similarities in the process itself and the steps taken, as well as some variations. There are different approaches to the methods proposed by Wobbrock, Nielsen and choice-based elicitation, which were evident in the studies reviewed.
In particular, in Nielsen's methodology, includes a step called "Find the functions" [15], from the pre-development stage in which users suggest the functions needed by the application with the help of scenarios. In the next step (Collect the gestures), a human-to-human non-verbal communication approach is adopted, where users, with the help of scenarios, are given the functions and are asked to find the matching gestures. A user-evaluation of the proposed gestures in terms of memorability, guessability and stress, is conducted during the final step.
Wobbrock's "Guessability Study" includes two stages. The purpose of the first stage is to collect gestures from users by showing them the referent (the effect of a gesture) while asking them to propose appropriate gestures. The second stage involves analysing the data collected using metrics to measure the consensus level among users for each referent and is conducted by the researchers.
A choice-based elicitation study can be considered as an enhanced variation of the guessability study. It consists of four stages and the first two are similar to the guessability study. The difference in this method is that for those referents that have scored low consensus level, a second round of investigation is conducted. So, in the next stage, experts create a list of gestures that are more appropriate for each referent (of those with low consensus level). That predefined list is then presented to the users in order to select the gesture that better match the corresponding referent. In a choice-based elicitation study, user participation is necessary during the first and the last step.
In general, the design and conduction of elicitation studies vary on the assumptions and research goals related to envisaged contexts of use. It seems that the strong points of the Wobbrock's approach are that it is a simple and practical procedure accompanied by a solid mathematical groundwork on how to analyse gestures before committing them to the gesture set. Nielsen's approach focusses on user-evaluation of the resulting gesture set considering several criteria beyond guessability in this method, user participation is essential in all the steps of the study [69]. Last but not least, the choicebased elicitation method can provide further validation of intermediate gesture proposals with the conduction of surveys in additional user groups.
All three variations of the elicitation method are to be conducted in a controlled environment like a classroom or a computer lab and with an instrumental (task-based) procedure. Of course, controlled studies have particular advantages for research but they inherently ignore contextual factors like other people's presence, environmental conditions variability (e.g., lighting, noise, etc.) and they may bypass important aspects of gesture appropriateness like gesture variability [76], which can be measured and captured automatically. There is recent work on gesture elicitation in more authentic contexts ("in-the-wild"), without a task-based procedure, like in the work of [77]. Further work can compare the results of gesture elicitation between lab and field studies, as well as adopt elicitation method for in-the-wild settings. This must also consider recent developments of agreement metrics in between-subjects elicitation studies [74].

On the Number and Profile of Participants in Gesture Elicitation Studies
Throughout our survey, we have found a scarce discussion on the criteria of participant selection and recruitment in elicitation studies. In many studies, we saw relatively homogeneous groups of users, like students or researchers, despite the wide appeal of the domains of applications examined. This may harm the validity of design suggestions, since it is plausible to assume that agreement scores of gesture proposals would differ (significantly) between diverse user groups of mid-air interaction applications. In a similar vein, there was not much discussion about the number of participants required in elicitation studies.
Lessons from other user-centred methods like usability and card-sorting studies indicate that there is not a "magic number" of minimum users for every study but a reasonably small range of carefully selected participants can yield wide-in-scope results and recommendations in particular contexts, possibly in a user-centred approach which can include repeated studies. More specifically, there has been a long held discussion about the minimum number of users in usability tests with more recent opinions in agreement to that "if you are interested in identifying major usability issues as part of an iterative design process, you can get useful feedback from three or four representative participants . . . as the design gets closer to completion, you need more participants" [78]. In the context of card-sorting studies, according to [79], "reasonable structures are obtained from 20-30 participants." Therefore, further research is required in this respect, which can provide methodological guidance on how many users are required for elicitation study, which and can help the designers and practitioners to better validate the results of their studies.

On the Dimensions of Gesture Appropriateness, User Evaluation and Data Analysis
Our survey identifies a number of dimensions considered by researchers in search of appropriate gestures for mid-air interaction, with most frequent those of "ease of application", "intuitive", "fit for purpose". Some studies are more exploratory, attempting to identify these dimensions during the elicitation process, for example, with the think-aloud protocol. In most studies, researchers are making use of self-developed Likert scales about those dimensions (either post-task or post-test). In a few studies, some standardized questionnaires have been employed, like the NASA-TLX (perceived mental and physical effort) [80]. Thus, the perceived appropriateness of a gesture is a source of differentiation in elicitation studies. Further work is needed towards the proposal and validation of an instrument of post-task user assessment of the dimensions of gesture appropriateness. Additionally, these user estimations about dimensions of gesture appropriateness might be taken into account into the calculation of agreement scores.
As expected, the level of agreement on gesture proposals is the metric most often used in elicitation studies. However, we have identified that some studies attempt to go further than agreement scores and assess gesture appropriateness on other grounds, like measured usability [54] and measured physiological risk [24]. In addition, other measures of fatigue that may be integrated to elicitation studies like consumed endurance [81] and the distance measured by hands [82], which can be automatically calculated provided an interactive system with gesture sensing capabilities is in-place. Thus, another area of further work is to further validate the results of gesture elicitation with technical tests about measured usability and fatigue. There are some works in this respect that need to be combined with elicitation studies like the work on consumed endurance [81].

On the Results of Elicitation Studies: Implications for Design
The typical outcome of an elicitation study is a (set of) gesture(s) for each referent (operation or user command), based on user agreement rates (or other metrics). Many elicitation studies have produced tables with listings of gestures for referents in the aforementioned domains of mid-air interaction. We need to ask ourselves is this is sufficient information for a designer or a developer to carry out detailed design and system implementation.
For example, a short description of a gesture (e.g., To swipe) does not specify important details of the gesture, such as: if it is performed with fingers or the hand, what human joints participate in the gesture and need to be monitored by the sensor, what is the time duration of the gesture and so on. A few studies have identified such factors, like the work of Riener et al. [13], who investigate the interaction space, that is, the physical 3D space that is available or preferable for a driver to apply mid-air gestures for secondary driving tasks. Therefore, an area of further work is to develop a protocol for reporting results of mid-air elicitation studies that specifies detailed design information.
Another related issue is whether user preferred gestures at design-time are indeed the most usable at the end of the process. There are some studies that indicate otherwise, such as in Koutsabasis and Domouzis [54], who proceed to an implementation of alternate mid-air gestures (produced from an elicitation study) and test their usability with many metrics: task time, errors, perceived usability, perceived effort and so forth, They conclude that the most usable and more preferred gesture for the manipulation of image collections (hand sideways extension) was different than the gesture originally preferred in the user elicitation study (swipe). Notably, the same users participated in both studies. This is an interesting result that indicates that the context of user production of gesture proposals is very important and should be carefully prepared so that it is realistic. Further to that and despite that Morris et al. [83] have shown the benefits of elicited gestures to designed gestures, additional factors (notably those related to performance, actual fatigue and usability) affect the user acceptance at the system implementation level. To investigate these factors at design time is a big challenge for future elicitation studies.

Limitations of This Survey
The aim of this paper was to provide a review on the elicitation studies in mid-air interaction design. As with any survey, our approach has limitations. Our survey is a process-based analysis, focusing on the constituting elements and phases of the application of the method with a breadth of studies examined in terms of their domain of application. As a consequence, the discussion of the content and results (gesture proposals) of elicitation studies has been brief.
In addition, we have reviewed a sample of elicitation studies that was determined by the query method employed and the criteria of selection that have inevitably constrained the sample into ways that may not be easily assessed at the time of writing. For example, one of the criteria for selection of papers was to focus on full-papers, which might have left some high-quality short papers outside the corpus examined.
This review is limited to mid-air gesture elicitation studies alone. These are a large corpus of elicitation studies but there are also other gesture types, such as (multi-)touch, whole-body, user-defined gesture input for wearables, that have been investigated with this methodology. We did not broaden the scope of our review to these domains for reasons of motivation and also because this would lead to an inflation of surveys. Furthermore, most of these studies rest on referents rather than on working systems or prototypes.

Conclusions
This paper presented a survey of elicitation studies in mid-air interaction design. The survey is systematic in the sense that it followed an analytical approach to the selection and examination of related papers. It is critical to the extent that it discusses several issues and possible shortcomings of the elicitation studies identified, as well as it identifies a number of directions for further work. We envisage that this survey can contribute to a better understanding of elicitation studies in current and future mid-air interaction scenarios and applications and that researchers and practitioners in mid-air interaction design will be stimulated by the facts and ideas presented in this survey, reflect on the issues identified, enrich their knowledge of the state-of-the-art on conducting elicitation studies and possibly re-think and improve their own work and practice.