Encoding Variables, Evaluation Criteria and Evaluation Methods for Data Physicalizations: A Review

Data Physicalization focuses on understanding how physical representations of data can support communication, learning and problem-solving. As an emerging area, Data Physicalization research needs conceptual foundations to support thinking about and designing new physical representations of data. Yet, it remains unclear at the moment (i) what encoding variables are at the designer's disposal during the creation of physicalizations, (ii) what evaluation criteria could be useful, and (iii) what methods can be used to evaluate physicalizations. This article addresses these three questions through a narrative review and a systematic review. The narrative review draws on the literature from Information Visualization, HCI and Cartography to provide a holistic view of encoding variables for data. The systematic review looks closely into the evaluation criteria and methods that can be used to evaluate data physicalizations. Both reviews offer a conceptual framework for researchers and designers interested in designing and studying data physicalizations.


Introduction
Data physicalisation or the physical visualisation of data focuses on representing data using geometric or material properties of physical media [1]. While data visualisations primarily focus on the sense of vision and creating data representations that can be "seen", data physicalisations have the potential to create data representations that can not only be seen, but also can be touched, smelled, heard, or tasted. Thus they enable new ways of interacting with data and multisensory data experiences [2]. They are becoming a means for narrowing the gap between people and data and have shown cognitive benefits that come along with their tangible nature (e.g., be effective for self-reflection, attention, and access to data) [3][4][5][6][7][8]. Furthermore, data physicalisations have the potential to reach audiences (such as for example people with disabilities) that are difficult to reach with traditional data visualisations. Although physical representations of data have existed for many years, data physicalisation is still emerging as a research area [1]. In recent years, there has been a growing interest in establishing theoretical and design foundations for data physicalisation. For example, design principles and guidelines for physicalizing data have started to emerge (e.g., Hogan and Hornecker [9], Sauvé et al. [10], Bae et al. [11], Sosa et al. [12], Hogan [13], Willett et al. [14]). Yet, the development of design guidelines and theoretical foundations for data physicalisation research still has a long way to go. In this research, we aim to address two important dimensions of data physicalisation research: encoding variables and evaluation.
Why encoding variables and evaluation: Encoding variables (i.e., the properties of the material used to encode data) are a key design dimension of any data communication activity. This is especially important for designing multisensory and immersive data experiences. Design guidelines for data visualisations assume non-disabled populations; thus, the resulting visualisations inhibit people with intellectual and developmental disabilities (IDD) from accessing and effectively engaging with data [15]. Understanding encoding variables that are perceivable through different human sensory channels is important for designing inclusive and accessible data representations for these user groups. Although visual variables have been well explored and are established in visualisation research, a shared understanding and a common vocabulary for variables of other perceptual modalities (beyond the sense of vision) still need to be established, especially to develop guidelines for designing multisensory and immersive data experiences [9,13,16,17]. There have been some efforts in the past to provide a partial inventory of encoding variables and develop a grammar about experiential encoding variables (e.g., [18][19][20][21][22]). However, a compilation of encoding variables for all human sensory modalities and their usage is not available yet. Thus, finding out what multisensory encoding variables are available and which ones are practically used in data physicalisations is useful for filling this research gap.
In addition, evaluation is important for researchers to assess the quality and impact of their data physicalisations (for example, to ensure that the users perceive the data embedded in physical representations and what short-term and long-term impact they could have on people). Methods and criteria for evaluating data visualisations, especially evaluating their ability to effectively and efficiently analyse and discover information, are well established. Physicalisations substantially differ from visualisations, for example, in their ability to engage people, spark interest, trigger interaction, and stimulate emotions. Therefore, new criteria for describing and evaluating the value of physicalisations are emerging. For example, Wang et al. [23] introduced a model for describing the value of a physicalisation based on its creativity and its ability to engage beyond the raw information content (engagement related to affective, physical, intellectual, and social). However, the evaluation methods and criteria that are currently used, as well as the aspects that are evaluated, are not sufficiently known. For example, there is no common knowledge about what evaluation methods are available and used for evaluating aspects related to information discovery/analysis, hedonic aspects, or more open-ended aspects (such as, for example, behavioural stimulation or initiating a social dialogue). This paper tries to fill these two research gaps by seeking answers to the following research questions (RQs): • RQ1: Which encoding variables can be used to create data physicalisations? • RQ2: Which evaluation criteria are relevant to the study of data physicalisations? • RQ3: Which evaluation methods are relevant to the study of data physicalisations?
Method and contributions: Methodically, these questions are examined through two complementary reviews. Encoding variables (also called 'perceptual variables' [16]) are mentioned at different places in the literature, often using different terms to refer to the same notion, and the same term to refer to different notions, due to the interdisciplinary nature of data physicalisation research. For this reason, a systematic review is not appropriate to answer RQ1. Instead, a narrative review holds more potential to cover the breadth of ideas originating from the overlapping fields with data physicalisation research. A narrative review (see e.g., [24,25]) identifies potentially relevant research that has implications for a topic and synthesizes these using meta-narratives. This narrative review uses knowledge from the Visualisation, HCI and Cartography literature and yields a synthesis of the scattered literature on encoding variables into a coherent framework (Contribution 1). The Cartography literature was included in the narrative review because the two fields of Information Visualisation and Cartography (i) share an object of study (i.e., maps), and (ii) there is evidence of their mutual interplay. Most importantly, the use of nonvisual modalities to communicate geographic information has been extensively studied by cartographers (e.g., for the design of tactile [26] and sonic [27] maps), and some of the insights in that context can benefit data physicalisation research. Answering RQ2 and RQ3 is done through a systematic review of papers published between 2009 and 2022. The systematic review helps to learn about evaluation criteria and evaluation methods relevant to data physicalisation research (Contribution 2).

Existing Design Spaces for Physicalisations
Previous work has suggested several design spaces/concepts that describe the dimensions that characterize data physicalisations. Thus, these design dimensions can be used to guide the design and evaluation of data physicalisations. We first analysed these design spaces to understand the extent to which they covered the two aspects of our focus: encoding variables and evaluation. Although each individual design framework uses different terms, they cover a total of 13 different design dimensions, as outlined in Table 1. The exact terms used by each design framework are summarised in Appendix A and Table A1. These distinct dimensions include Data that describes the nature of data represented by the physicalisation, Audience, which refers to the type of the target audience, Representational Intent that describes the purpose of the physicalisation (e.g., the effective and efficient discovery of information, evoking specific feelings, and initiating social dialogue), Representational Material that refers to the material used for the physicalisation, Sensory Modality that refers to the human sensory channel used to perceive data, Encoding Variables that describes the physical variables used to encode data, Representational Fidelity that describes the metaphorical relationship between data and the materials used to encode data), Interaction that describes the type and the nature of interactions that the physicalisation allows, Proximity to the Data Referent, which refers to the degree of embodiment (i.e., proximity/situatedness) of the data physicalisation with respect to the data they represent (data referent), Proximity to the User that describes the degree of embodiment (proximity/situatedness) of the data physicalisation with respect to the user/user's environment, Physical Setup that details the distribution of components of the physicalisation (i.e., the physical setup of components), Mobility that indicates whether the physicalisation is bound to a specific location or not, and Narrative formulation that describes how data physicalisation facilitates the discovery of information through its external physical form and through any interactive affordances it provides [28].  N  5  1  5  3  5  6  3  4  1  2  2  1  1  1 As Table 1 shows, existing design frameworks, especially the full-scale design spaces for data physicalisation that were introduced recently (e.g., [10,11]) identify Encoding Variables as a key design dimension. Nonetheless, they left it under-specified. In particular, the encoding variables that are available for each perceptual modality and how these variables have been practically used in existing data physicalisations are not fully discussed. Furthermore, none of the frameworks covers evaluation aspects (see Table 1). That is, the criteria to evaluate the merits of a physicalisation or the methods that can be used to evaluate them remain largely unexplored. Previous research on data physicalisation (e.g., [1,13]) also recommends these two aspects as important aspects that need further exploration and detail. The two gaps can now be addressed through a narrative and a systematic review.

Narrative Review: Encoding Variables for Physicalisations
Which encoding variables can be used to create physicalisations (RQ1)? The starting point for the review conducted to answer this question is Jansen et al. [1]'s definition: "A data physicalisation (or simply physicalisation) is a physical artefact whose geometry or material properties encode data". This definition suggests one important axis for data physicalisation research, namely, that of data encoding. The data encoding axis is referred to in the literature as the representation dimension (see [32,33]). Representation happens through a representational medium, i.e., an artefact that is used to encode and store information. Representational media make use of one or more representational material and have different information channels (i.e., "perceptual aspect of some medium which can be used to carry information" [34]). Colour, shape, and orientation are examples of information channels. Information channels are manipulated through one or more variables: visual variables (properties of visual information channels), haptic variables (properties of haptic information channels), olfactory variables (properties of olfactory information channels), and so on.
The choice of a material inevitably restricts the space of possibilities regarding the encoding variables. For instance, the choice of sound as the material for a scenario precludes the use of visual variables to encode information for that scenario. The number of materials that can be used to encode data is potentially infinite. Hogan and Hornecker [9] have provided 37 examples (e.g., glass, water, bread, electronic motors, infrared light, and many more) based on a review of 154 physicalisations. Once a material is chosen, several variables (physically, these can include five variable types related to sensory channels, and one variable type related to change) are at the designer's disposal. These are briefly reviewed below and summarised in Table 2. Most definitions for the variables start intentionally with 'variations/changes of...' to stress the fact that a property by itself is not a variable, it is used as a variable, when changes in this property communicate information. The number of potential variables per sensory channel is put in brackets next to each encoding variable.
Physical variables (∞): Physical variables are variations in material properties that are used to encode information. An exhaustive listing of these variables is still an area of ongoing research, but a few candidates were brought forth in previous work. Hence, the number of physical variables is initialized to infinity for now. Jansen et al. [1] mentioned smoothness, hardness (called compliance in [19]) and sponginess as examples of physical variables. Additional examples include viscosity [35], permeability [35], slipperiness [19], weight [19,20], reflectance [20], density [20], thermal diffusivity [20], stiffness [20], pyrotechnic color [20], tensile strength [20], electrical resistance [20], and thermal expansion [20]. An important remark about physical variables is that, while the encoding activity (i.e., what [29] calls data mapping) is done using material properties, the decoding can only be done using sensory information channels. Consider, for instance, viscosity. Though it is a property of the material, information encoded using it can be perceived through the haptic and the visual information channels. Visual variables (13): Seven visual variables were originally proposed by [36]. These were extended to a list of 12 by [37], and recently synthesized in [38,39]. The following definitions are largely taken from [39]: visual size (variations in the length, area, volume or repetitions of a symbol); visual shape (variations in the appearance or form of a symbol); color hue (variations in the dominant wavelength of visible light, e.g., red, blue, and green); color value (light or dark variations of a single hue); color saturation (the intensity of a single hue); visual orientation (variations in the direction or angle of rotation of a symbol); (visual) pattern arrangement (variations in the distribution of individual marks that make up a symbol); (visual) pattern texture (variations in coarseness of the pattern within a symbol); transparency (variations in the blend level of a symbol and a background layer); crispness (variations in the sharpness of boundaries); resolution (variations in the level of detail at which the map symbol is displayed); and visual location (variations in the x, y position of a symbol relative to a frame of reference). Next to these 'atomic' visual variables, previous work has also pondered the question of 'composite' visual variables. MacEachren [37] proposed to consider 'pattern' as a higher-level visual variable consisting of units that have shape, size, orientation, texture, and arrangement. This is already reflected in the naming of the variables above (e.g., pattern arrangement, pattern texture). Caivano [40] proposed three dimensions of texture, namely, directionality (i.e., dimension that depends on the proportionality of units of texture), size (i.e., surface of the texturing element) and density (i.e., relation of the texturing elements to the background). This suggests that texture itself is a composite variable. Finally, Kraak et al. [41] mentioned 'numerousness' (arrangement combined with size) as a composite variable used in dot density maps. Hence, numerousness was included as the 13th visual variable in Table 2.
Haptic variables (13): A few haptic variables were mentioned in [42]. These were the following: actuator position, force-strength, vibration frequency, and surface texture. A similar list is found in [30]: force, position, vibration, texture, and temperature. An earlier, much more comprehensive suggestion of haptic variables was proposed by [43]. She proposed that haptic sensations can be decomposed into three categories of variables: those derived from touch (tactile), those derived from kinesthesia (kinesthetic), and those derived from visual analogues (i.e., variables that can be perceived by both vision and touch). Tactile sensations are perceived when the skin comes into contact with an object; kinesthetic sensations are stimulated by bodily movements and tensions. Based on these lists and the summary of [39], the following haptic variables can be mentioned. There are four tactile variables: vibration amplitude (also called force, see [44]), vibration frequency (also called flutter, see [39,43], or speed), pressure (changes in the perceived physical force exerted upon a surface or body), and temperature (changes in the perceived temperature of a surface). The perceived intensity of vibration patterns is a function of both their amplitude and frequency (see [45]). There are three kinesthetic variables: resistance (felt when attempting to deform a surface, e.g. push a button), friction (felt when the hand moves across or through a surface), and kinesthetic location (changes in the location of the hand in relation to the body). Haptic variables derived from their visual analogues include the following: tangible size (changes in length, area, or volume), tangible elevation (changes in z locations), tangible shape (changes in form), tangible texture/grain (changes in patterns), tangible orientation (changes in alignment), and tangible location (changes in x,y locations) (Tangible location generalizes what [42] called the actuator position; tangible texture is synonymous with surface texture from [42]; the terms 'pressure' [43] and 'force-strength' [42] describe the same reality from different perspectives: the encoder uses the strength of the force to communicate data, and the decoder perceives a pressure). The adjective 'tangible' is added to make clear that the information can be perceived by the haptic senses. For instance, a bar chart printed on a T-shirt [46] can be perceived by the eyes (visual size), but not perceived through the hands or kinesthetic. Thus, in that example, visual size is used to communicate information while tangible size is not.
Olfactory variables (5): Patnaik et al. [53] discussed how scent can be used to convey data using introduced olfactory marks and a few olfactory variables. Olfactory marks (i.e., glyph, bouquet, or burst) are analogous to visual marks (i.e., points, lines, or polygons) and refer to the most primitive blocks that can be used to encode scent. Attributes of these marks form the olfactory variables and include the following: the scent type (i.e., the signature of the mark), the direction of the mark (e.g., changes in the position in space where the scent originates), the saturation, a.k.a. chemo-intensity (changes in the concentration of odour molecules in the air), the airflow rate, a.k.a. kinetic intensity, the air quality (e.g., humidity, temperature, and other non-olfactory properties of the air that can be used to encode information), and the temporal pattern, a.k.a. scent animation. Since dynamic variables are discussed separately as an orthogonal dimension to all other variables, the temporal pattern is not listed here as an olfactory variable. For a related discussion on the olfactory design space, see [54]. The four key dimensions identified in [54]-namely, chemical, emotional, spatial, and temporal-overlap to a great extent with the variables from [53].
Gustatory variables (2): How can the gustatory channel be used to encode information? This is a slightly different question from "which properties of food can be used to encode information?" (For example, one may use the food's shape and colour to communicate information as discussed in [55]. Shape and colour are visual variables, not gustatory variables) This has been discussed, for example, in [55,56]. This is also different from the question "how do people describe taste sensations?", which was discussed in previous work (e.g., [57]). In that respect, two gustatory variables can be mentioned: the signature of the taste carrier (i.e., changes in the taste type) and the temperature of the taste carrier (e.g., hot or cold). Note that 'taste type' here does not only refer to the basic taste types mentioned in the literature (sweet, salty, sour, bitter, and umami, see e.g., [58]), but more broadly to any taste that can be uniquely distinguished from another one. For instance, nominal data values can be mapped to different types of unique tastes, while ordinal values can be mapped to different types of unique temperatures (e.g., the hotter, the higher). In practice, taste variables are used in conjunction with other modalities (e.g., smell and sight) during food consumption, and there is documented evidence in the literature that inputs from other modalities (e.g., visual) to affect gustatory perception [59][60][61].
Sonic variables (6): Sonification is the transformation of data relations into perceived relations in an acoustic signal for communication or interpretation purposes (see [62]). The following properties of sound mentioned in [39,63] can be used to this end: sound source location (variations in the perception of the placement of the sound's source in a two/three-dimensional space (that perception depends on the physical location of the sound source, the environmental acoustics, and the shape of the ear, see [64])), loudness (variations in the magnitude of the sound), pitch (variations in the frequency of the sound, i.e., highness or lowness), register (variations in the location of a pitch within a range of pitches), timbre (variations in the general prevailing characteristic or quality of the sound), and attack/decay (variations in the time needed by the sound to reach its maximum or minimum). Duration (variations in the length of time during which a sound or silence is heard), rate of change (variations in relation between the duration of sound and silence over time), and order (variations in the sequence of sounds over time) were also mentioned in [39,63] as sonic variables, but are not included here, because these are dynamic variables discussed below. Finally, rhythmic patterns, mentioned for example in [30], can be used to encode information. Rhythms result from grouping separated sounds into periodic patterns, see e.g., [65,66]. In principle, they can be generated through a combination of other variables (e.g., pitch, timbre, duration, and order). Nonetheless, since they can be used on their own to communicate information (e.g., at least theoretically, nothing prevents the use of different patterns to communicate variations in the data), they are mentioned in the table. Rhythmic patterns are of a composite sonic variable. Melodic patterns, which are the basic units for musification (see [67]), are an example of rhythmic patterns.
Dynamic variables (6): Representing change over time is a recurrent need during the creation of artefacts encoding information, and dynamic variables are useful to this end. Dynamic variables are helpful when designing animations and self-reconfigurable physicalisations. Discussions on dynamic variables in the literature have focused separately on visualisations [37,[68][69][70][71], sound [39,63], and scent [53]. Nonetheless, given that dynamic variables are orthogonal to all other variables, an account that abstracts from the specifics of sensory modalities is needed. The review of previous descriptions led to the observation that a unifying notion across all modalities is currently missing. We propose the concept of the representational state (or state for short) to fill this gap. A representational state refers to the particular condition of a representation (i.e., visualisation, tactile/kinesthetic sensation, scent, taste sensation, and sound) at a given point in time. Then, similarly to other variables (visual variables are properties of visual marks, sonic variables are properties of sound, and so on), dynamic variables could be conceptualised as properties of representational states. Six variables inspired from the works mentioned above and illustrated in Figure 1 are relevant: perception time ((variations in the moments in time (a.k.a. temporal locations) the user perceives representational states); temporal order (variations in the sequences in which the representational states are perceived); duration (variations in the temporal life of the representational states, i.e., how long a representational state is perceived); and temporal frequency (variations in the temporal distances between representational states, i.e., how fast/slow new representational states are communicated to users) (This variable can also be called 'rate of occurrences of representational states' or 'number of identifiable representational states per unit time'); rate of change (variations in the difference in magnitude of change per unit time for a sequence of representational states); and synchronization or phase correspondence (variations in the temporal correspondences of two or more time series). Synchronization is useful to highlight the potential relationship between two phenomena (see e.g., [37]).   Examples of encoding variables from papers of the systematic review. Physical: different types of material are used to represent the users' core academic interests (Yellow stands here for 'folding paper') and their additional research interests (Orange stands here for 'acrylic'). For the original figure, see [47]. Visual: the average effort of users during a running segment is encoded as the length of a pin on the board [48]. Haptic: indoor air quality data is encoded as vibration in the haptic probe from [49]. Sonic: the muscle tension of flutists is used to create live water sounds as they play their flutes [50]. Olfactory: the fan's speed is used to control the airflow rate [51]. Dynamic: the LED ring encircling the device fades in/out slowly or quickly to convey if the overall emotional experience of a participant is positive or negative [52] .

Systematic Review: Encoding Variables, Evaluation Criteria, and Methods
The systematic review presented in this section is an attempt to understand the evaluation criteria and methods used to assess data physicalisations (RQ2 and RQ3). In addition to the annotation of articles with evaluation criteria/methods, and since the encoding variables are a dimension related to representation (see Section 3), we have included all dimensions from Table 1 that touch on an aspect of data representation (as opposed to interaction) in the annotation process. This is useful to assess how representational dimensions relate to each other on the one hand and to the evaluation dimension on the other hand. Hence, the annotation focused on the following dimensions: data type represented, representational material, representational intent, representational fidelity, and the encoding variables. A by-product of the review is to learn about the completeness of the encoding variables derived from the current textbooks (mostly originating from work on visualisation). The remainder of this section describes the procedure used for the literature search, the screening criteria, the annotation of the papers, and the coding schemes.

Searching and Retrieving Publications
We employed an analytical approach using a representative sample of publications on empirical work on data physicalisations and followed a systematic procedure similar to previous CHI reviews [72][73][74][75][76]. We used the ACM Digital Library and Scopus as the scientific repositories for our search, as many of the publications related to data physicalisation are included in these outlets. We limited our search to articles written in English and published from 2009 to 2022. The search was carried out in February 2022. We used data physicalization and physical visualization as keywords for the search and we searched within article title, abstract, and author keywords. The following search queries were used: Search query for the ACM full-text collection : "query": Title:("data physicalization"; "physical visualization") OR Abstract:("data physicalization"; "physical visualization") OR Keyword:("data physicalization"; "physical visualization") "filter":

Screening and Paper Selection
All articles retrieved went through a screening using inclusion/exclusion criteria: • Criteria 1: Articles that were not original peer-reviewed articles or that were not full papers (to ensure that the papers had a complete full scale evaluation of a data physicalisation) (e.g., late breaking works, workshops, pictorials, posters, speeches, doctoral consortium papers, etc.) were excluded. • Criteria 2: Only the articles that discussed an artefact of data physicalisation and empirically evaluated that physicalisation were selected. Therefore, publications that introduced frameworks, theories, processes, opinions, methodologies, concepts, and reviews, as well as publications that did not empirically evaluate a physicalisation, were excluded. • Criteria 3: Articles that discussed augmented physicalisations (for example, [77]) were excluded from the analysis. • Criteria 4: Articles that discussed the same data physicalisation discussed in another article were removed, as our objective was to review different data physicalisation artefacts.
The removal of the duplicates and the application of inclusion/ exclusion criteria 1 resulted in 64 articles. The subsequent screening using criteria 2, 3, and 4, which resulted in 36, 34, and 31 articles, respectively ( Figure 3). Therefore, our final set of papers selected for the analysis consisted of 31 articles (n = 31). The majority of the selected publications were from CHI (n = 9), followed by TEI (n = 6), IEEE TVCG (Transactions on Visualisation and Computer Graphics) (n = 4), IEEE CG&A (Computer Graphics and Applications) (n = 3), DIS (n = 3), NordiCHI (n = 2), and AI & Society, Elsevier C&G (Computers and Graphics), ASSETS and SVR (n = 1 each). These 31 articles included 50 different data physicalisation artefacts (some papers contributed to more than one data physicalisation, see Table 6). For example, [78] contains four physicalisations that used four different physical modalities (light, vibration, movement, and air) to encode data.

Paper Annotation
Every paper was coded independently by two coders, and the coding results of all papers were discussed afterwards among the two coders. When there were conflicting codes, the reasons for the individual decisions were discussed before resolving the inconsistencies. We looked not at papers, but at artefacts mentioned in these papers. One paper could, therefore, present more than one physicalisation. That is, even if an evaluation criterion was used multiple times in one paper to assess several physicalisations, we still counted that the criterion had been used multiple times. The rationale was simple: given the exploratory nature of the work, the number of artefacts over which a criterion has been used matters more at this stage than the number of different authors/research groups who used that criterion/method.

Coding Schemes
We used the following coding schemes to annotate the artefacts with respect to their evaluation criteria/methods and all the dimensions of Table 1 that touch upon an aspect of data representation.
Data scale: This refers to the type of data that is encoded. Drawing on Stevens [79]'s seminal taxonomy, a three-fold classification for data has become widely used in Information Visualisation and HCI research: nominal (categorical data without a natural order or rank); ordinal (ranked categories); and numerical (quantitative data).
Type of representational material: Bae et al. [11] draw a useful distinction between electronic and non-electronic material: an electronic material has a least one electronic component, while a non-electronic material has none. An electronic component is an entity that has the ability to control electric current (e.g., microcontrollers, sensors, computers).
Representational fidelity: This refers to the metaphorical distance between the physicalisation and the data it represents. Vande Moere and Patel [28] proposed three types of representational fidelity: iconic (the physicalisation bears some relationship to the data being represented through a defined metaphorical relationship); indexical (the physicalisation bears a direct relationship [either physical or causal] to the data being represented); and symbolic (the physicalisation bears no resemblance to the data being represented, and the relationship between the two must be learned using a defined convention). A detailed discussion of the concept of metaphorical distance and examples for each type of representation fidelity are available in [29]. Though only three types of representational fidelity were proposed in [28], a fourth type was identified during the annotation of the articles. The representational fidelity is dynamic if it can vary between iconic, indexical, and symbolic within the physicalisation (not necessarily automatically). For example, in PhysiAir [78], air (coming from a fan) was used to represent ambient air (e.g., CO 2 level or NO 2 level) quality (thus, it can be considered "iconic"). However, PhysiAir can also be configured so that air can also represent other ambient parameters such as humidity and temperature (thus, in that case the fidelity becomes "symbolic"). Therefore, the representation fidelity in PhysiAir [78] can vary between iconic and symbolic depending on the configuration; thus, the fidelity is "dynamic" (in this case via manual reconfiguration).
Representational intent: This refers to the system designer's intention for encoding the data (see [9]). Utilitarian representations were defined as those that 'target a specific audience to reveal data insight related to an explicit task' [9]; casual representations instead are 'intended for a much broader audience and the exploration of data may be more openended and not related to a work task' [9]. The dimension of intent was also mentioned in [8,80]. Dragicevic et al. [8] distinguished between the motivation to discover/present and the motivation to enjoy. Nonetheless, the utilitarian/casual distinction was preferred in this work because, as Dragicevic et al. [8] noted, it is challenging to determine in hindsight whether a physicalisation was created for the purpose of analysing data (discovery) or for the sole purpose of communicating/teaching data insights (presentation). A similar argument applies to Djavaherpour et al. [80]'s distinction between physicalisations with a pragmatic goal (present information in a way that allows the user to thoroughly understand the data) or artistic intent (communicate a concern, rather than show data). The two goals are not mutually exclusive, which makes it challenging to know in hindsight if the motivation of the designer was one solely or the other. In the work, a simple rule was used to classify a physicalisation as utilitarian or casual a posteriori. The intent is classified as utilitarian if the physicalisation is designed to support a specific task and the evaluation (e.g., efficiency, effectiveness/understanding) concerning that task has been done. Otherwise, it is classified as casual.
Evaluation criteria and methods: For the evaluation criteria, we used the performancerelated criteria mentioned in [81] and UX-related criteria mentioned in [82] as a starting point for the annotation. As for evaluation techniques, the methods to evaluate the UX identified in [74] were used as a starting point.
Encoding variables: We used the dimensions identified and described in Section 3 to annotate the articles with the encoding variables. The guiding questions used to identify the presence/absence of a variable type in a physicalisation a posteriori are presented in Appendix B.

Systematic Review: Results
The main objective of the systematic review was to explore how the data physicalisations were evaluated (i.e. what evaluation criteria are relevant (RQ2) and what evaluation methods can be used (RQ3)). In addition, we also wanted to explore the representation dimension (i.e., what encoding variables have been used in practice, how the dimensions related to representation relate to each other and to the evaluation dimension). The following sections thus present the results of the systematic review along these lines: evaluation criteria/methods (Section 5.1), connection between evaluation criteria and the intention of the data physicalisations from Section 5.2, and the lessons learned about the encoding variables (Section 5.3.1) and the interrelationships observed between the dimensions related to representation and evaluation (Section 5.3.2).

Evaluation Criteria and Methods
The systematic analysis of the physicalisation artefacts revealed several evaluation criteria that can be used to assess the impact of data physicalisations (Table 3) and a wide variety of methods to collect data about these criteria (Table 4). This subsection summarises these evaluation criteria and the methods used to implement them. Table 3 summarises the evaluation criteria used to assess the data physicalisations in our sample, ordered based on the frequency of use (most to less frequent). There is the intuition that thecriteria used in HCI/Information Visualisation can be used to evaluate data physicalisations when appropriate (see e.g., [9]), but an open question is whether there are some criteria that could be distinctive to data physicalisation research. It can be seen that UX and performance-related criteria that are widely used in HCI/ Information Visualisation were also used to evaluate data physicalisations. We also discovered several evaluation criteria that seemed particular to data physicalisations. Thus, we grouped the evaluation criteria into those that wer enot particular to, and those that seemed particular to data physicalisation research (the dashed line in Table 3 serves that purpose).  Table 4 summarises the evaluation methods used to evaluate data physicalisations, which are presented in order of frequency of use (most to less frequently used). It can be seen that the methods that were widely used in HCI/ Information Visualisation were frequently used in data physicalisation research. While both lab-based and field-based experiments were used in equal frequencies in our sample, we discovered that the percentage of longitudinal/ repeated studies was significantly low compared to one-time studies.
It is useful for a data physicalisation researcher to understand the methods that have been employed to evaluate each criterion. Table 5 presents the evaluation methods used to assess a criterion (the evaluation criteria are presented in their order of appearance in Table 3 (i.e., most to less frequent in the sample)). A reference next to an evaluation method stands for an example of an article implementing it. A detailed description of the meaning of the evaluation criteria is provided in Appendix C. A broad discussion on the takeaways and implications of the results of our review of evaluation criteria and methods is available in Section 6.  [9] commented that "more attention is needed to evaluate representations whose purpose is more openended" and Jansen et al. [1] pointed out that finding appropriate ways of studying how people engage in data exploration when no clear task is defined is a pending evaluationspecific challenge for data physicalisation research. Thus, we also analysed how the criteria from our sample related to the utilitarian and casual representational intents ( Figure 4). The findings outlined in Figure 4 can inform designers about how they can evaluate their physicalisations, should these have a similar type of intent. It shows how the criteria were used so far to evaluate one type of intent (either utilitarian or casual) or both: • Criteria used for physicalisations with a casual intent: intellectual engagement, social engagement, affective engagement, the potential for self-reflection, motivational potential, creativity, user's reactions, quality of the design, potential for self-expression, quality of the information content, aesthetics, and remote awareness of physiological states. • Criteria used for physicalisations with a utilitarian intent: effectiveness, efficiency, size judgement, confidence, and orientation consistency. • Criteria used for both types of physicalisations: user experience, utility, understanding (qualitative), attitude change/behavioural stimulation, memorability, enjoyment/satisfaction, ease of use, design parameters, learning curve/ease of learning, social acceptance/ ease of adoption, and physical engagement.

Representation Dimensions
A second objective of this review is to explore how encoding variables have been used in practice and how dimensions related to representation relate to each other. This section presents the results: lessons learned about the encoding variables (Section 5.3.1) and interrelationships between the dimensions (Section 5.3.2). Table 6 summarises the coding of the dimensions related to representation and Table 7 shows the frequencies of the encoding variables found in the sample.

Lessons Learned about the Encoding Variables
During the coding process, we discovered the following extensions to the initial list of encoding variables: • Physical variables: Material should be added to the list in addition to the properties of the material. A nice example can be found in [47], which used the tokens' material (folding paper vs acrylic) to differently encode information related to the core academic background and the additional academic interests of the users. • Haptic variables: The list of haptic variables that are derived from visual analogues can be extended with at least two variables: Tangible arrangement (variations of the distribution of individual marks that make up a symbol) and tangible numerousness (arrangement combined with size), as both can be perceived through touch. For instance, the number of squares and their size were used in the 'Dressed in Data' clothes to communicate data about indoor air chemicals, and this resulted in a lace pattern [46]. That lace pattern (both the arrangement of the squares and their numbers) can be perceived by touch. This is an example of both tangible arrangement and tangible numerousness. • Dynamic variables: The list of dynamic variables should be extended with change pattern (variations in animation/movement patterns used to communicate change) as a new variable. For instance, the PhysiMove physicalisation [78] used counterclockwise movements to indicate decreases in value, clockwise movements for increases in value, and no movement for the lack of change; Keefe et al. [93] used different animation effects to communicate the occurrence of different weather events (i.e. rain, snow, and cloud cover) and Pepping et al. [52] used the slow/fast fading of LED lights to communicate whether or not an emotional experience was positive/negative.  In addition, though previous work (e.g., [17]) mentioned the ambiguity surrounding the use of 'size' as an encoding variable (as size can have different aspects), a systematic account of possible usage is still lacking. Our annotations led to the following dimensions of "size" for data physicalisation research: length [47,99], height [7,17,46,85,95,96,98,99], diameter [17,85,87,89,95], area [46], surface area [47], and volume [4,6,91]. Size was also used, not as an encoding variable, but to denote the overall size of the physicalisation. This use of size in the sense of a design parameter that influences the user experience was investigated in [92] (the authors used 'scale' as a synonym for size in their work). The multiplicity of interpretations for 'size' suggests a necessary precisification by the authors investigating it: either as a variable or as a design parameter.

Interrelationships between Dimensions
Several pipelines describing the process of creating physicalisations were proposed in previous work, which include, for example, the extended version of the infovis pipeline to accommodate the physical rendering of data [104], the data sensification workflow that focuses on encoding data in the experience people have with representations [13], and the pipeline for the digital fabrication of physicalisations from [80]. While these pipelines are valuable, none has explicity linked the dimensions we have examined in our systematic review. Hence, we looked into possible connections between the dimensions examined as a first step towards a theory of representation in data physicalisation research. Such a theory would inform researchers and designers about the consequences of their choices during the process of building and evaluating physicalisations (e.g., how the choices made at early stages impact the options available at later stages). There are four important elements of theory development according to [105]: (1) extract key concepts, (2) identify the (causal) relationships between these concepts, (3) elaborate on the rationales for these relationships, and (4) clarify the range of application of the theory. We address the four elements in turn.
Key concepts: These are the dimensions considered during the annotation of the articles: all dimensions from existing design spaces touching on data representation, plus the evaluation dimension (see Section 4). The interaction concept is key to the design of data physicalisations (see Table 1) and is hence included in the model, even if it was not explicitly examined during the work.
Relationships: We proposed to link the dimensions considered sequentially into a seven-stage model for designing and evaluating data physicalisations, as shown in Figure 5. Blue arrows indicate a statistically significant association between two dimensions. The interaction dimension is coloured grey because we did not study this dimension in our systematic review. The process is iterative, but arrows describing iterations are omitted in the figure to ease readability.
The steps in this seven-stage model are the following:
Select a dataset (categorical, ordinal, numerical ,or a mix of these); 3.
Choose the representational fidelity (iconic, indexical, symbolic, or dynamic, see Design the interaction (not discussed in this article, but useful references can be found in [10,35]); 7.
Evaluate the artefact (examples in Section 5.1).
In a nutshell, the researcher interested in studying data physicalisations starts with a purpose and selects a dataset in line with that purpose. Afterwards, they choose a representational material, which is a choice that is strongly tied to the choice of the representational metaphor. Since both representational material and fidelity strongly determine each other, they are given an equal footing on the diagram. The choice of the encoding variables follows that of the material. In practice, the design of the interaction happens concurrently with the design of other aspects of the representation, but since the choice of the encoding variables (e.g., visual vs sonic) constrains the interaction possibilities, they are shown sequentially. The systematic evaluation of the artefact happens last (and is the step that distinguishes the researcher from the designer in this model). The meaning of the arrow −→ is 'precedes and constrains'. The whole process is iterative, which means that designers can come back to any stage from any stage, but, for simplicity, arrows representing iterations are not shown in Figure 5.
Range of application: The relationships proposed above and tested below are based on the operationalizations of the concepts related to data physicalisation described in Section 4.4. They may not be valid for other operationalisations (e.g., other taxonomies for data type [80] or representational intent [8,80]).
Quantitative analysis: We computed Fisher's exact test [106] (and, when appropriate, used Pearson's chi-squared test instead) and the Cramér's V for all pairs of dimensions (Table 8). A significant value for the Fisher's exact test or chi-squared test indicates a nonrandom association between two categorical variables, while Cramér's V indicates the strength of the association (0 = no association; 1 = complete association). We now report on the findings for all pairs of consecutive dimensions of the model: • Intent-dataset (p-value < 0.001; Cramér's V = 0.67): There were differences in proportions for nearly all types of datasets. Most notably, physicalisations with a casual intent used the combination of categorical and ordinal and numerical datasets more often than those with a utilitarian intent; they also used numerical data much more often than those with a utilitarian intent. The nonrandom association observed here could be due to some bias in the sample: all physicalisations where the type of dataset was 'not documented' were those having a utilitarian intent (these physicalisations were used to investigate the theoretical properties of physicalisation in [17,48,99]: orientation consistency, size judgment, and graph physicalisation). • Dataset-material type (p-value < 0.001; Cramér's V = 0.65): Physicalisations encoding three types of datasets (categorical and ordinal and numerical) all used electronic material. The nonrandom association observed here could also be due to some bias in the sample: all physicalisations where the type of dataset was 'not documented' were those using non-electronic material (investigation of theoretical properties). • Dataset-representational fidelity: The Fisher's exact test between the data type and the representational fidelity was not significant. Nonetheless, the association between the number of datasets and the fidelity was significant (p-value = 0.01; Cramér's V = 0.46). In particular, there was no physicalisation with two/three datasets that had an indexical fidelity (i.e., the physicalisation bore a direct relationship [physical or causal] to the data being represented) in our sample. • Material type-encoding variables (p-value < 0.001; Cramér's V = 0.69): Physicalisations combining variables beyond the visual and haptic dimensions (e.g., visual and sonic and haptic and olfactory) all used electronic material. • Representational fidelity-encoding variables: The Fisher's exact test was not significant. • Encoding variables-evaluation criteria: We grouped the evaluation criteria into three categories: traditional, novel and traditional and novel. 'Traditional' refers to the criteria above the dashed line (except physical engagement), whereas 'novel' refers to physical engagement and criteria below the dashed line. The Fisher's exact test was not significant. Table 8. Relationships between the different dimensions. A number within a cell is the Cramér's V (strength of the association) for a statistically significant association (i.e., a possible non-random association between two dimensions). A '-' indicates statistically non-significant associations. To improve readability, some names were abbreviated in the table: evaluation (evaluation criteria), n_modalities (number of modalities), n_datasets (number of datasets), material (material type).  Table 8 summarises the results from the analysis. The key observations from Table 8 are the following:

Intent Fidelity Evaluation Variables n_Modalities Dynamicity Data_Type n_Datasets Material
• The Cramer's V between n_modalities/variables, n_datasets/data_type, and dynamicity/variables was 1 because the dimensions were derived from one another. In particular, n_modalities counted the number of encoding variables used, n_datasets counted the number of data types used, and 'dynamicity' documented whether (or not) dynamic variables were part of the encoding variables. • Only non-random associations between two consecutive dimensions are highlighted in Figure 5. Nonetheless, the data suggests that there were more non-random associations (e.g., intent/evaluation, intent/data, and data/material). Overall, the material dimension exhibited significant correlations with other non-derived dimensions most often (4/5: intent, evaluation, variables, and data type), followed by the data type dimension (4/5: intent, evaluation, variables, and material) and the intent dimension (3/5: evaluation, data type, and material). The fidelity dimension correlated with other non-derived dimensions the least often.
Overall, the fact that the material/data type/intent dimensions exhibied nonrandom associations with other dimensions is in line with intuition. The number of datasets to encode and the combination of encoding variables emerged as determinants to watch, but since this study is the first to assess the interrelationships between these different representational dimensions and given the size of the sample, more work is needed to unveil the exact nature of the influences between dimensions. Thus, the observations above should be taken as working hypotheses [107] about the relationships between the representation dimensions in physicalisation research.

Discussion
So far, this work has provided a synthesis of encoding variables for physicalisations (Section 3 and Table 2), a snapshot of evaluation criteria, as well as examples of methods to apply these criteria (Section 5.1) and working hypotheses about the relationships between different representational aspects of data physicalisation (representational intent, material type, representational fidelity, data type, and encoding variables, see Section 5.3.2). We now discuss general observations made about encoding variables and the evaluation criteria/methods, as well as their implications.

Encoding Variables
Takeaways: One takeaway from the narrative review is that data encoding as an object of study is a fertile ground for interdisciplinary research. Indeed, several variables synthesized in Table 2 were mentioned separately (and sometimes under slightly different names) in the literature on Information Visualisation, Cartography, Human-Computer Interaction, Sonification, Immersive Analytics, and Neuroscience. A case in point is dynamicity (the representation of change), for which the variables were 'rediscovered' separately for the visual, sonic, and olfactory modalities. As for the systematic review, one takeaway is that inclusiveness is always realized to a certain extent, namely to the extent to which a given sensory modality is supported. In that sense, none of the physicalisations in the sample was fully inclusive (Table 6). Finally, we have observed that only a few physicalisations actually used physical variables (i.e., changes in material properties) to convey messages about phenomena. A similar observation was made in [10], who reported that information communicated through (a change in) physical or material form has been so far rare in practice.
Implications: Looking forward, researchers can use the framework from Table 2 as a vocabulary to describe experiments assessing the effectiveness of variables (across disciplines). That is, the variables can serve as 'boundary objects' [108] between data physicalisation researchers and researchers from the fields mentioned just above (Information Visualisation, Cartography, Sonification, and so on). Boundary objects are concepts shared by different communities, which can be viewed or used differently by each. For instance, a subject that can benefit from a plurality of perspectives is the study of users' perception of time-varying representations across different modalities (visual, haptic, aural, etc.). Here, the dynamic variables from Table 2 can serve as boundary object between the different communities investigating the user experience of time-varying representations. Another subject that can benefit from a plurality of perspectives is the notion of variable syntactics (Interpretive flexibility is only one distinguishing characteristic of boundary objects. Another important characteristic is the arrangement of how to operate and collaborate [109]. As Vuillemot et al. [110] put it, "Groups can work on common objects locally, making them more tailored to their local use and needs, i.e. something that is not interdisciplinary, and then share it back in a way that works across the various groups". It is challenging to exactly predict how arrangement will look like, as several disciplines use the variables. Hence, arrangement is not further specified in this article). As indicated in [38], variable syntactics prescribe the use of a variable given a type of dataset (e.g., nominal, ordinal, or numerical). That is, variable syntactics tell how effective/ineffective a given variable is with respect to encoding a given data type. While variable syntactics have been sug-gested (mostly for the visual [37], aural [63], and dynamic variables [37]), researchers have so far been using different schemes while relating variables to data types. For example, 'unacceptable/acceptable' was used for Bertin's original visual variable syntactics [37], 'not effective/effective' was used for auditory variables in [63], and 'poor/marginally effective/good' was used for visual and dynamic variables in [37]. Data physicalisation research will benefit from the harmonization of these schemes so that they abstract from the specifics of sensory modalities in a similar way to that done for dynamic variables in Section 3. Next to researchers, designers of physicalisations can use the framework to identify design opportunities (e.g., through unexplored variables or an untried mix of variables). In that sense, the framework can be useful to support their goal, discussed in [111], of creating the not-yet-existing.

Evaluation Criteria and Methods
Takeaways: Jansen et al. [1] identified several key challenges for evaluating data physicalisations: (i) finding appropriate ways of studying how people engage in data exploration when no clear task is defined, (ii) assessing the merits of data representations that go beyond pure time and error metrics, (iii) exploring methodologies to understand how people reason, collaborate, and communicate with physicalisations, and (iv) finding fair alternative representations to use as a baseline for comparison. Our sample and analysis suggest that research that provides answers to challenges (i) and (ii) is ongoing. Notably, several criteria for evaluating aspects that go beyond the traditional time and error metrics have emerged (see Section 5.1). However, we found fewer answers relevant to challenges (iii) and (iv) in our sample. For example, many of the papers in our corpus used on-screen visualisations (e.g., [98]), paper representations (e.g., [6]) and VR representations (e.g., [85]) as a baseline for comparing the effects of data physicalisations. The extent to which these baselines provide a fair ground for comparison still needs to be systematically assessed and discussed.
Furthermore, the majority of the studies in our corpus (70%, Table 4) were one-time studies. The evaluation of variables such as behavioural change and the impact on learning/ skills development requires long-term studies to understand the long-term effects. This long-term assessment would be relevant, for instance, to work using personal data physicalisations for teaching (e.g., [112]), or personal data physicalisations in real-world contexts (e.g., [113]). Besides, the evaluation of some aspects of physicalisations has not appeared in our sample, and this suggests that they could be under-explored or not explored at all. Since the systematic review has focused on representational aspects, we mention here a few, related to representation primarily. These include the following: strategies to communicate uncertainties in the underlying dataset; the impact of the material and representation fidelity on meaning-making and memorability; the connection between material properties and data types (e.g., would viscosity be a good material to represent numerical/ordinal/categorical data?); evaluating aspects related to affordances (cognitive affordances, physical affordances, sensory affordances, and functional affordances) in relation to data physicalisation; evaluating the multisensory perception of data; evaluating the interplay between representation and situatedness (e.g., physicalisations that are situated in close spatial proximity to their data referents [14] compared to non-situated data physicalisations); and evaluating representation strategies and their adequacy for diverse user groups (e.g., children or elderly).
Implications: We have already mentioned above that, despite the progress, there are still many unanswered questions. In particular, challenges (iii) and (iv) mentioned above deserve more attention. To these, the gap related to the long-term assessment of physicalisations' impacts and the need for more systematic accounts of the impact of representational features of physicalisations on users can be added.

Relationships between Representational Dimensions
Takeaways: Though the precedence links connecting the dimensions remain conjectural at this point, our analysis has highlighted that there is a non-random association between several dimensions touching on representational aspects for physicalisations. The results of the quantitative analysis suggests plausible relationships between the most important dimensions related to representation (i.e., material, data type, and intent). We were, however, surprised to see that the fidelity dimension did not seem to strongly connect with other dimensions related to representation. This may be due to the fact that the majority of the physicalisations in our sample (78%) had a symbolic intent, and, hence, bore no resemblance to the data represented.
Implications: Looking forward, the observations of the non-random associations encourage further research towards structural equation models for data physicalisation research. The exact nature of these non-random associations will be uncovered with more examples (and the working hypotheses mentioned in Section 5.3.2 about the associations that can be used as a starting point). Models that describe the expected consequences of design choices during the process of building and evaluating physicalisations will benefit researchers and designers alike. These models will need to provide an account of the indirect relationships between dimensions. For instance, there was no significant association between encoding variables and evaluation, but there still was a non-random association between intent and evaluation ( Table 8). The documentation of researchers' work, using a consistent vocabulary to facilitate cross-comparison (e.g., the coding scheme from Section 4.4), will be needed to catalyze progress along these lines.

Reflections on the Methodical Approach
As Roberts and Walker [16] pointed out, we need a body of research that helps researchers tackle questions such as 'what are the perceptual variables that are available?', 'what are their limitations?', and 'what guidelines are there for each variable?'. While it is clear that these questions still deserve attention, what is less clear are the methodical steps to arrive at general answers. This article has addressed the first question through a combination of a narrative and a systematic review. The narrative review has given the flexibility to draw ideas from different disciplines and reconcile differences in terminologies where appropriate. The systematic review has highlighted how work done in Information Visualisation and Human-Computer Interaction has been implementing the framework from the narrative review (Table 2). We anticipate that the framework from the narrative review can serve as a starting point for answering the other questions above. We also anticipate that replicating the study using articles from other communities will help to progressively extend that framework. For example, given the current corpus with predominant papers from the ACM Digital Library as input for the analysis, physicalisations from the Geography and Cartography communities (e.g., [114,115]), the variables that they use, and the lessons learned about them were not taken into account. Furthermore, since we wanted to learn about evaluation criteria and methods, we restricted ourselves to articles that evaluated their physicalisations in some way. Hence, some articles that could have possibly been useful to extend the list of variables (e.g., [116]) were excluded. Replicating the study by removing this constraint would also be useful as we expand our understanding of perceptual variables for physicalisation research. In summary, though not without flaws, the combination of narrative and systematic review seems promising as we seek answers to the questions mentioned above.

Limitations
Our search criteria might have excluded some relevant papers, such as those that did not contain the search keywords that we used (e.g., data sculpture, composite physicalisation, constructive visualisation) or if they did not contain an empirical evaluation of a physicalisation. Also, we did not search for particular forms of physicalisations such as, for example, 'sonification', 'haptification' or 'olfaction'. Consequently, our findings are depen-dent on the sample of papers we selected. Hence, the work does not claim to be exhaustive with respect to the evaluation criteria/methods collected. For instance, one may draw a distinction between physicalisations as designed artefacts, and physicalisations as printed artefacts. The former are physicalisations that were outcomes of a design process (some components of these physicalisations may be 3D printed and some not), while the latter denote physicalisations that were created entirely through printing (i.e., data is rendered as a physically fabricated object, see [80]). Our sample is biased towards the former. Thus, we likely missed criteria relevant to the evaluation of physicalisations as printed artefacts (e.g., the accuracy of the printed artefact was proposed in [117] to document the errors introduced by the printing process and does not appear in Table 3).

Conclusions and Future Work
This research provides two contributions to data physicalisation research: (i) a synthesis of the scattered literature on perceptual variables into a coherent framework, and (ii) a snapshot of evaluation criteria and methods relevant to the study of physicalisations. These two contributions can serve as a starting point for further work on the theories and guidelines for data physicalisations such as , notably, the empirical effectiveness of encoding variables for physicalisations and the applicability of perceptual variables to data communication/analysis scenarios.
A question that could guide follow-up reviews to this article is the following: 'what do we know to be true of all perceptual variables, empirically?' In addition to including more examples of physicalisations, follow-up reviews could also cover more dimensions (e.g., reconfigurability discussed in [1], the interaction discussed in [1,9], and the audience mentioned in [10]) and the relationships, if any, with encoding variables, as well as evaluation criteria/methods.
Regarding encoding variables, there is a need for a more systematic investigation of which of these are atomic and which are composite. For instance, air quality is currently listed as an olfactory variable, but is in fact an umbrella term for many variables (e.g., air temperature and air humidity). Another direction for future research is the investigation of the effect of redundant sensorization (i.e., the combined use of several modalities) on user experience. There are works in the literature documenting the positive effects of redundant symbolization for the visual channel (e.g., [118,119]), as well as the visual and haptic channels used in combination (e.g., [48]), and more work along these lines is needed to increase our understanding of the use of redundancy during data encoding more broadly.
Finally, some criteria will benefit from a breaking down of factors that constitute them. This is the case, for example, for "users' reactions", "design quality", and the "aesthetics of the physicalisation". Developing standardized questionnaires that support the evaluation of these criteria, and more generally of criteria unique to data physicalisation research, is also an interesting direction for future work.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available in the article.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Dimensions of Existing Design Spaces
The dimensions of existing design spaces along with their original names are summarised in Table A1.

Appendix B. Guideline: Identifying When a Variable Type Has Been Used
How to recognize the presence of a variable a posteriori: a variable is used if the sensory modality can be used independently to perceive differences in data. Guiding questions: • Imagine I were blind; would I still perceive differences in the data? • Imagine I could not touch; would I still perceive differences in the data? • Imagine I could not smell; would I still perceive differences in the data? • Imagine I could not hear; would I still perceive differences in the data? • Imagine I could not taste; would I still perceive differences in the data?
If any of these questions is answered by "No", then it is evidence that only the variable type corresponding to the sensory encoding channel mentioned in the question (visual, haptic, olfactory, sonic, and gustatory) has been used. If the question is answered by "Yes", it is evidence that a different sensory encoding channel from the one mentioned in the question has been used to encode data. Finding out whether or not dynamic variables are used can be done by asking the question: is animation or self-reconfiguration implemented?

Appendix C. Definitions of Evaluation Criteria
This supplementary material provides additional details (e.g., definitions) about the evaluation criteria mentioned in the article. Some evaluation criteria were defined explicitly in the literature, and, for these, we add the references of the original articles next to their names. The remaining criteria were either (i) mentioned without explicit definition in the articles annotated, or (ii) needed relabelling to reflect the deeper notion they point at. For these criteria, we provide a tentative definition congruent with the article(s) annotated. The criteria are mentioned in their order of appearance in the article (most frequent to least frequent).
Criteria not particular to data physicalisation research.

•
Intellectual engagement [23]: Refers to the ability to engage the user in intellectual activities such as recognition, analysis, and contemplation. • Social engagement [23]: This is present when observers talk with companions, but also when laughing, gesturing, and mimicking the body postures of others. It was assessed, for instance, in [83] through the use of a confederate.
-Confederates are individuals recruited by lead experimenters to play the role of a bystander, participant, or teammate (see e.g., [120]).
• Affective engagement [23]: Refers to the emotional experience of users. The arousing of feelings such as awe, respect, wonder, concern, fear, disgust, anger, or intimidation are indicators of an affective engagement. • Engagement over time: The evolution of engagement over a given time period. • User experience [100,101]: The review of definitions by Law et al. [101] pointed out that the ISO definition of UX,"A person's perceptions and responses that result from the use or anticipated use of a product, system or service", is in line with what most UX researchers associate to the concept. In essence, UX refers to all aspects of the users' interaction with a product. It has pragmatic attributes and hedonic attributes [100]. • Utility [81,102]: It is the usefulness of an interface for completing the user's desired set of objectives [102]. • Effectiveness (question answering) [103]: In the sample analyzed, effectiveness was measured through the accuracy with which participants completed information retrieval tasks [6,89] and interaction tasks [96].
-Information retrieval tasks are specifically directed at retrieving information (e.g., cluster, maxima, or minima of a dataset), whereas interaction tasks are more open-ended (e.g., data analysis tasks such as annotation, filtering or navigation). Hence, not every interaction task is an information retrieval task.
• Efficiency (question answering) [81,103]: This is the time taken by participants to complete information retrieval tasks or interaction tasks. • Potential for self-reflection: This is the ability of the physicalisations to prompt users to think about themselves. Thudt et al. [113] identified four types of personal reflection in the context of data physicalisation: reflection on (their) data, reflection on (their) context, reflection on (their) action, and reflection on (their) values. • Understanding (qualitative): This refers to the assessment of the understanding of datasets through qualitative feedback during an interview [85] or as a rating on a self-developed questionnaire [83,91].
-This assessment may touch upon the understanding by an individual (in that case we talk about personal understanding, see [87,91]), or a group of people (in that case, we talk about collaborative understanding, see [90]).
• Attitude change/behavioural stimulation: This refers to the extent to which a physicalisation can change the attitudes of users (e.g., do they care more about a given subject?) or inspire them to take some action [83]. • Memorability [7,82]: Memorability has different facets, for instance, recognition or recall (see [121]), explicit or implicit memorability (see [7]), and the storage of information in short-term memory or long-term memory (see [122]). It is the capability of maintaining and retrieving information [82]. • Enjoyment/satisfaction [82,103]: Enjoyment is a feeling that causes a person to experience pleasure [82]. Satisfaction denotes the freedom from discomfort and positive attitudes towards the use of the product [103]. • Motivational potential: The ability of the physicalisation to promote gradual changes in individuals' behaviour or sustain the changes over time. It was evaluated through self-developed questionnaires [95]. • Ease of use: The perceived ease of use. • Design parameters: sSme studies intended to find optimal design parameters and conducted a systematic evaluation of these parameters to that end. For instance, Daniel et al. [89] systematically varied motion speeds to find out the best speed to animate the CairnFORM physicalisation. López García and Hornecker [92] systematically varied the size of two physicalisations and assessed the impact of these changes on ease of viewing and understanding. • Learning curve/ease of learning: This refers to the perceived learning curve.
• Social acceptance/ease of adoption [51]: this refers to participants' opinions about the possible introduction of the physicalisation in their lives or sentiments regarding the ease of adoption of the physicalisation. • Size judgment: Although this was assessed primarily through the accuracy of participants on information retrieval tasks in [17] (and, hence, could have been said to belong to the assessment of the effectiveness of the physicalisation), we still kept this criterion as separate, because it is important for the development of theories of perceptual effectiveness of variables. Ratio estimation [123] and constant sum [124,125] are two methods to collect data about participants' judgments. • Confidence [51]: This refers to the self-reported confidence levels of users. • Creativity [23]: The ability of the physicalisation to support the introduction of new and original ideas.
Criteria that seem particular to data physicalisation research. • Physical engagement [23]: It invites people to spend time touching and interacting with the data (even if just in imagination), moving around it to take different perspectives, bending down to read a label, and employing senses including smell and hearing. • Users' reactions: Some articles used the term 'user reaction' [88,96] or 'ad-hoc impression' [49] to refer to how the users react to a physicalisation. While there are overlaps with engagement (e.g., the user reactions mentioned in [96] could be classified as an assessment of physical engagement, and part of the reactions documented in [88] could be classified as an assessment of affective engagement), we still keep this evaluation criterion as distinct, because it could be useful for exploratory studies. • Orientation consistency [99]: The consistency of user responses to information retrieval tasks across different orientations. • Quality of the design: This touches upon participants' general feedback about design decisions and material choices. It was evaluated, for instance, through self-developed questionnaires [95], post-it note feedback, [84] and unstructured interviews [88]. • Potential for self-expression: The extent to which the physicalisation can help users express some personal characteristics (e.g., academic profile or running performance). It has at least two components mentioned in [47]: representational possibilities (what the user can say through the physicalisation) and representational precision (how accurately they can say what they intend to say). • Quality of the information content: Evaluated in [95] through self-developed questionnaires. • Aesthetics of the physicalisation: This touches upon the appearance of the physicalisation. It was evaluated using self-developed questionnaires in [95]. • Remote awareness of physiological states: Some studies [52,88] explored the use of physicalisations as a means for remote monitoring. That is, a user uses a physicalisation to infer the physiological state (e.g., emotional state [52] or arterial blood pressure [88]) of another distant user.