In this section, we provide an overview of user interactions with digital systems, including natural interactions, and summarize the research findings on defining accessibility parameters with a focus on environments with 3D interactions.
2.1. User Interactions with Digital Learning Systems
A digital learning system is basically a collection of digital tools and resources, such as texts, videos, quizzes, and simulations, that work together to enhance learning. These systems include hardware, software, and digital materials that allow students to access educational content, monitor their progress, and interact with teachers and other students [
16]. The ability to interact with such a system is a fundamental requirement for the learning process, regardless of whether it takes place in a traditional or online context [
17].
Usability factors have an important influence on user attitudes toward e-learning applications and indirectly on the results achieved. Previous studies have shown that the user’s satisfaction when using an application is influenced not only by the quality of the information but above all by the user’s attitude toward the application and the interface [
18]. Users initially explore the various functions of the system to familiarize themselves with its capabilities. Over time, they focus on a stable set of functionalities that best suit their tasks. The increase in usage functionality has a positive effect on the perception of current performance, but also on objective performance measurements in later phases of use [
19]. If the task is clearly defined and users are using a familiar tool, previous experience can help them to solve tasks with new technologies. In practice, however, it has been shown that users’ habits and previous experiences can reduce the effectiveness of use if the task is open and the new technology works differently from what the user is used to [
20].
User interaction latency in an XR system refers to the time delay between the user’s physical action and the corresponding system response [
21]. This latency is an important factor that affects the user experience, especially in terms of perceived realism, immersion, and task performance [
13]. High latency can disrupt the sense of presence, cause motion sickness, and reduce the accuracy of the task. This is particularly problematic for applications that require fine motor control or fast reactions. Experiments in non-VR environments suggest that a latency of less than 100 ms has no impact on the user experience. Other research suggests that latency in VR environments should be less than 50 ms to feel responsive, while the recommended latency should be less than 20 ms [
21].
Estimating the size of objects that users interact with is a particular challenge in XR environments. The accuracy of such estimates can vary considerably depending on the research context, the technological setup, and the specific characteristics of the virtual environment. A consistent finding of most studies dealing with the estimation of object size in VR environments is a tendency to underestimate, which often depends on the actual size of the objects being evaluated. One study reports that the size of virtual objects is underestimated by about 5%, regardless of their position in space [
22]. Other research shows that when using VR devices with binocular disparity, such as head-mounted displays (HMDs), users tend to perceive virtual objects as 7.7% to 11.1% smaller than their actual size, regardless of the shape of the object [
23]. In AR environments with handheld controllers, the detection threshold for size changes ranged from 3.10% to 5.18% [
24]. In contrast, other studies reported even finer perception thresholds—less than 1.13% of the height of the target object and less than 2% of the width. These values are usually considered as the point of subjective equality (PSE), which indicates the deviation at which users have a 50% probability of correctly estimating the size of a virtual object [
25].
User interaction encompasses the actions and reactions of a user in a digital environment. It describes the user’s activity when using a computer or applications. Most people are used to interacting with computers through standard 2D interfaces using a mouse and keyboard, with the screen being the output device. However, with the advancement of technology, especially in virtual environments, conventional input is not always practical [
26]. This still raises the question of what is the most natural way to interact with these virtual systems and which devices are best suited for this.
In contrast, 3D interaction is a type of interaction between humans and computers in which users can move and interact freely in a 3D space. The interaction itself involves information processing by both humans and machines, where the physical position of elements within the environment is critical to achieving the desired results. The space in such a system can be defined as a real physical environment, a virtual environment created by computer simulation, or a hybrid combination of both environments. If the real physical space is used for data input, the user interacts with the machine via an input device that recognizes the user’s gestures. If, on the other hand, it is used for data output, the simulated virtual 3D scene is projected into the real world using a suitable output device (hologram, VR glasses, etc.). It is important to note that interactive systems that display 3D graphics do not necessarily require 3D interaction. For example, if a user views a building model on their desktop computer by selecting different views from a classic menu, this is not a form of 3D interaction. If the user clicks on a target object within the same application to navigate to it, then the 2D input is converted directly into a virtual 3D location. This type of interaction is called 3D interaction [
27].
Most 3D interactions are usually more complex than 2D interactions because they require new interface components, which are usually realized with special devices. These types of devices offer many opportunities for designing new user experience-based interactions [
28]. The main categories of these devices are standard input devices, tracking, control, navigation and gesture interfaces, 3D mice, and brain–computer interfaces.
Two different methods of interaction in the XR environment are direct and indirect. When considering their impact on user experience, engagement, and learning outcomes, both offer different benefits that can complement each other in a learning context. Direct interaction in XR environments mimics real-life activities, increasing physical involvement and potentially boosting intrinsic motivation. Indirect interactions with classic UI elements provide greater precision through an interface, although they usually require more cognitive effort and affect usability and motivation in different ways. A combination of both interaction methods can create a balanced and effective learning environment, as this approach supports hands-on learning in the initial phase and facilitates precision tasks in more advanced phases [
29].
Various approaches have been developed to overcome the limitations of mapping 2D inputs to 3D spaces. One of the earliest methods is the triad cursor, which allows the manipulation of objects using a mouse relative to their projected axes [
30], while another method uses two cursors to perform translation, rotation, and scaling simultaneously [
31]. Alternatives include a handle box [
32] and virtual handles [
33] for applying transformations to virtual objects, while Shoemake presented a technique that allows object rotation by drawing arcs on a virtual sphere with the mouse [
34]. Although these techniques are more than 30 years old, they are still used in the form of widgets in modern tools such as Unity3D. Other tools, such as AutoCAD, 3D Studio Max, and Blender, use orthogonal views instead of widgets for more precise manipulation of 3D objects [
35].
Numerous studies have investigated different approaches for the selection and manipulation of objects in virtual 3D spaces. One of the basic techniques is the use of a “virtual hand”, which enables direct object manipulation by mirroring the user’s real hand movements onto a virtual counterpart. This type of interaction is very intuitive for humans [
35], while the size of the object and the goals influence the grasping kinematics in 3D interactions. When designing a VR assessment, it is important to consider the size of the virtual object and the goals of the task as factors that influence participants’ performance [
36]. The user’s perception of object size in XR environments is important for natural and intuitive interaction, as it influences the feeling of presence and navigation, and, indirectly, learning efficiency. Misperception of object size can lead to reduced efficiency and frustration for the user when interacting with virtual content. The point of subjective equality (PSE) is the point at which a user perceives two objects to be the same size, even if they are physically different sizes. In XR environments, the PSE is often used to quantify perceptual errors in size estimation, which helps to optimize the display of objects so that the user experience is as realistic and intuitive as possible [
25].
The distractibility of objects in 3D interactions can affect task performance and decision making. Distraction can slow down behavior and increase costly body movements. Most importantly, distraction increases the cognitive load required for encoding, slows visual search processes, and decreases reliance on working memory. While the impact of visual distraction during natural interactions may appear localized, it can still trigger a range of downstream effects [
37].
There are various devices, such as wearable sensors, touch screens, and computer vision-based interaction devices, that allow users to interact realistically in 3D [
38,
39]. Among the sensors that successfully recognize natural hand gestures, the Leap Motion controller [
40] stands out. Leap Motion is a small peripheral device that is primarily used to recognize hand gestures and finger positions of the user. The device uses three infrared LEDs and two CCD sensors. According to the manufacturer of Leap Motion, the accuracy of the sensor in detecting the position of the fingertip is 0.01 mm. The latency of the Leap Motion Controller is influenced by various factors, including hardware, software, and display systems. As discussed in Leap Motion’s latency analysis, the latency of the overall system can be reduced to less than 30 milliseconds with specific adjustments and under optimal conditions [
41]. In practice, however, latency can vary, and the reported average motion time error is around 40 milliseconds [
42]. Due to its high recognition accuracy and fast processing speed, Leap Motion is often used by researchers for gesture-based interactions [
27,
43].
One of the challenges in implementing hand gesture interaction is the fact that there is no standard, model, or scientifically proven prototype for how the user can interact with a 3D object [
26]. Although webcams are one of the most important devices for exploring possible interactions with 3D objects, no solution has yet been found for the optimal estimation of hand positions considering environmental and circuit constraints [
44]. In [
45], a system for hand tracking in VR and AR using a web camera was presented. In contrast to the cameras built into VR devices, which restrict the positioning of the hand and are uncomfortable for some users, this system allows greater freedom of hand movement and a more natural position. The system achieves a high gesture prediction accuracy of 99% and enables simultaneous interaction of multiple users, which significantly improves collaboration in 3D VR/AR environments. The use of markers as one of the possible solutions for realizing interactions via webcams is presented in [
46].
2.2. Natural Interactions
Natural interaction, where humans interact with machines in the same way as in human communication—through movements, gestures, or speech—is one of the possible solutions for intuitive interaction with 3D interfaces [
14].
A natural user interface (NUI) allows users to interact with digital systems through intuitive, human-like actions, such as speech, hand gestures, and body movements that are similar to the way humans interact with the physical world. This approach moves away from traditional input devices such as keyboard, mouse, or touchpad [
47]. Natural interactions are not necessarily 3D interactions, even though they are often used in 3D interfaces. We often associate the term NUI with natural interactions. They have the advantage that the user can use a wider range of basic skills when interacting than in conventional GUIs (e.g., UIs, which mainly rely on the mouse and keyboard for interaction) [
48].
Much of the concept of natural interactions and interfaces is based on Microsoft’s definitions and guidelines. Bill Buxton, a senior researcher at Microsoft, points out that NUI “exploits skills that we have acquired through a lifetime of living in the world, which minimizes the cognitive load and therefore minimizes the distraction” [
49]. He also emphasizes that NUI should always be developed with the context of use in mind [
49].
Gestures are seen as a technique that can provide more natural and creative ways of interacting with different software solutions. An important reason for this is the fact that hands are the primary choice for gestures compared to other body parts and serve as a natural means of communication between people [
50]. In general, it is assumed that a UI becomes natural through the use of gestures as a means of interaction. However, when looking at the multi-touch gestures, e.g., on the Apple iPad [
51], which are generally considered an example of natural interaction, it becomes clear that the reality is somewhat more complex. Some gestures on the iPad are natural and intuitive to the user, such as swiping left or right with one finger on the screen, which allows you to turn pages or move content from one side of the screen to the other, simulating the analog world. However, some gestures need to be learned, such as swiping left or right with four fingers to switch between applications. Such gestures are not intuitive for users and require additional learning, as it is not obvious how such an interaction should be performed. This requires an understanding of the relationship between the gesture and the action being performed [
49].
By interaction types, NUIs can be divided into four groups: multi-touch (the use of hand gestures on the touch screen), voice (the use of speech), gaze (the use of visual interaction), and gestures (the use of body movements) [
52]. Another new group of interfaces based on electrical biosignal sensing technologies can be added to this classification [
53].
The emergence of multi-touch displays represents a promising form of natural interaction that offers a wide range of degrees of freedom beyond the expressive capabilities of a standard mouse. Unlike traditional 2D interaction models, touch-based NUIs go beyond flat surfaces by incorporating depth, enhancing immersion, and simulating the behavior of real 3D objects. This allows users to interact and navigate in fully spatial, multi-dimensional environments [
54]. Touchscreen interactions are considered the most commonly used method for natural interactions as they are available on the most commonly used devices, such as smartphones and tablets.
Voice user interfaces (VUIs) have experienced significant growth due to their ability to enable natural, hands-free interactions. They are usually supported by machine learning models for automatic speech recognition. However, they still face major challenges when it comes to accommodating the enormous diversity and complexity of human languages [
55]. A major problem is the limited support for numerous global languages and dialects, many of which are underrepresented in existing language datasets. This underrepresentation can lead to misinterpretation and a lack of inclusivity in VUI applications [
56]. In addition, nuances such as context, ambiguity, intonation, and cultural differences pose difficulties for current natural language processing systems. Addressing these issues is critical to the broader adoption and effectiveness of VUIs, especially in multilingual and multicultural environments [
57]. Recent advances in VUI technology include emotion recognition, where the system can recognize and respond to the user’s emotional tone of voice, and multilingual support, which enables interactions in multiple languages [
58]. In two-way voice communication, natural language processing enables the system to understand the meaning of the spoken words, while the speech synthesis system generates responses in the form of human-like speech. Modern VUIs often integrate systems to handle complex conversations with multi-round interactions [
59].
Gaze-tracking UIs use eye movements, in particular the direction of the user’s gaze, to control or manipulate digital environments. They interpret eye movements and translate them into commands or actions on a screen or in a virtual environment [
60]. As hands-free interfaces, they are particularly beneficial for assistive technologies, XR, and public installations where traditional input methods can be impractical [
52]. To ensure that the device works as accurately as possible, it must be individually calibrated for each user to calculate the gaze vector [
61]. The two main types of gaze-based interaction are gaze pointing and gaze gestures. Gaze pointing uses the user’s gaze as a pointer and allows the user to select or interact with objects on a screen without having to physically touch the device. The UI responds to the location on the screen or in space that the user is looking at, often with a cursor or focus indicator [
62]. Gaze gestures recognize and interpret complex eye movement patterns such as blinking or switching gaze between targets as commands, enabling a wider range of gestures and interactions [
63,
64]. Gaze-recognizing systems can adapt the content or interface depending on where the user is looking [
65]. Recent advances, including multimodal approaches that combine gaze with speech or gestures, aim to mitigate these issues and improve the user experience [
66] or combine them with technologies that measure brain activity to create a brain–computer interface [
67]. Advances in this area aim to improve accuracy and overcome problems with interaction errors such as the “midas touch”, which occurs when the system’s response to a user action and the user’s expected outcome of that action do not match [
68]. New findings in the field of deep learning have also influenced gaze-tracking technologies, so that more and more authors are working on the application of machine and deep learning in this field. Numerous authors have created and used their own annotated datasets for deep learning, some of which are publicly available, such as the Open EDS Dataset [
69] or Gaze-in-Wild [
70]. Among the approaches used, convolutional neural networks (CNNs) showed the best results in eye tracking and segmentation [
71,
72].
Gesture-based NUIs allow users to interact with devices by using certain gestures, such as hand or body movements or facial expressions, as input commands. These technologies rely on users’ body movements to interact with virtual objects and perform tasks, and they rely heavily on the ergonomics associated with various interactions [
73]. One form is hand gesture recognition, where systems use cameras and different sensors (e.g., Leap Motion or Kinect) to capture hand movements and interpret them as commands for tasks such as selecting objects or navigating menus [
44,
74]. Wearables such as data gloves, video cameras, and sensors such as infrared or radar sensors [
75] are also used for gesture recognition. Wearables provide more accurate and reliable results and can give haptic feedback, which gives them an advantage over visual systems. In addition, there is a third category for systems that combine both approaches. Nowadays, sensors and cameras are preferred for hand tracking as they offer greater freedom of movement [
76]. Another form of gesture-based NUIs is whole-body gesture recognition, which captures body movements for more immersive interactions, as often seen in VR or AR applications [
77,
78]. Similar to hand gesture recognition, cameras and depth sensors are often used to recognize full-body gestures. Sensors are also used in the form of suits that track body movements via various sensors such as accelerometers, gyroscopes, and magnetometers that detect limb movements, posture, and orientation [
44]. On a larger scale, ultrasonic or radar sensors are used to recognize the position and movement of the body in space [
75], as well as computer vision algorithms that enable the recognition of whole-body gestures through video analysis [
79,
80]. Facial gesture recognition focuses on recognizing facial expressions as input so that devices can respond to emotions or gaze, which is common in accessibility technologies [
81]. Finally, multimodal gestures combine different types of inputs, such as hand, voice, and gaze inputs, to create a more intuitive and efficient interaction experience [
82].
Technologies for recording electrical biosignals, such as brain–machine interfaces (BCIs), which are based on electroencephalography (EEG) and electromyography (EMG), are often referred to under the collective term ExG [
53]. With BCIs, it is possible to send commands to a computer using only the power of thought. A standard non-invasive BCI device that does not require brain implants is usually available in the form of an EEG headset or a VR headset with EEG sensors. The design of the device has evolved so that today smaller devices are produced in the form of EEG headbands that measure brain activity and other data such as blood flow in the brain, while the mapping of the collected signals to the user’s gestures is complex and relies heavily on a learned model [
83]. EMG signals are used for various purposes, independently or in combination with other sensors, for example, for interpreting the user’s intentions when interacting with virtual objects [
84], for device management [
85,
86], for analyzing muscle mass during rehabilitation [
87], or for speech recognition [
88,
89]. The application of electrical interfaces to capture biosignals faces a number of problems, such as the detection of biological signals and the difficulty of precise localization of activity. Biological signals are variable, and errors in detection and classification increase the error rate, resulting in a lower speed of information processing compared to classical interaction methods. An additional problem is the so-called “midas touch” effect, which has already been mentioned for UIs with eye tracking. In addition to the technical challenges, there are also ethical problems, as, for example, the EEG data used during processing can reveal personal information of the user [
53]. The scope of such interfaces in mobile and XR applications is limited due to performance accuracy as well as stability between sessions [
83], but they are essential in certain areas such as rehabilitation, prosthetics, and assistive technologies, especially for people with severe disabilities for whom such interfaces are the only possible form of interaction with systems [
53].
Achieving naturalness in every context of use and for all users is a challenge. While gestures, speech, and touch play an important role in many NUIs, they only feel truly natural to users if they match their abilities and specific usage scenarios as well as the context of the application and gestures. Given the technology available at the beginning of the 21st century, it is almost impossible to create a 3D UI that feels natural to all users. Instead of trying to make NUIs universal, one should focus on tailoring each of these interfaces to specific users and contexts [
49].
2.3. Accessibility Parameters
UIs for gesture-based interaction should be designed according to established principles to ensure an intuitive and effective user experience. The system must be able to determine exactly when gesture recognition should begin and end to ensure that unintended movements are not misinterpreted. The order in which actions are performed is also crucial. Designers must clearly define each interaction step required to complete a process. In addition, the UI should be context-aware and provide immediate feedback to the user to confirm a successful action [
38], while individual spatial abilities should be considered, as these abilities significantly affect the user’s performance [
90]. In a global and culturally diverse environment, it is important to consider cultural differences, as they can influence the meaning, pace, and execution style of gestures [
8].
Universal design aims to ensure that interactive systems can be used by as many users as possible, regardless of their abilities or experience [
91]. In this context, in 3D environments, usability parameters such as object size, UI responsiveness, and intuitiveness of interaction must be adaptable to different physical, cognitive, and perceptual needs. A flexible design that allows users to customize interaction modes or change spatial layouts improves accessibility and supports user autonomy. This inclusive approach not only improves the overall user experience but also promotes equity in accessing immersive 3D technologies.
The most neglected accessibility principles in serious games are “operability” and “robustness”, where the principle of operability concerns user control and interaction, while robustness can be improved through assistive technologies. To improve accessibility, it is recommended to include features such as automatic transcriptions, sign language, photosensitivity control, external VR devices, and contextual help. It is emphasized that developers of serious games must make considerable efforts to improve accessibility [
92].
As the field of accessibility matures, there are many studies that focus on specific subfields of accessibility research (e.g., people with autism or visual impairment) or on accessibility in different types of applications (e.g., websites, mobile applications, and VR games) [
93]. Most existing accessibility guidelines for applications are based on the W3C Web Content Accessibility Guidelines [
94], the W3C Mobile Accessibility Guidelines [
95], and the Guide to Applying WCAG 2 to Non-Web-based Information and Communication Technologies [
96]. They offer a series of recommendations to make digital content more accessible for all users, based on four basic principles: Perceivable, Operable, Understandable, and Robust (POUR). Applying these principles in development ensures that all users, including those with different types of impairments, can access, navigate, and understand the content. Certain industries, such as the gaming industry, have recently invested considerable resources in improving the accessibility of their products.
Previous research on accessibility in AR/VR environments has shown that these technologies present significant barriers to accessibility, which can lead to problems with the use of their features or to the exclusion of certain user groups, particularly people with certain forms of impairment. Issues such as limited evaluation of the effectiveness of solutions, lack of standardized testing methods, and technical barriers such as limited device resources and the need for environmental monitoring were identified. Successful systems emphasize UI adaptability and user involvement in the design process to ensure accessibility [
97]. These findings could be used to identify key areas where further work is needed to develop immersive platforms that are more accessible to people with certain forms of disabilities [
98].
When designing immersive applications for accessibility, there are some important principles to consider. Immersive applications need to provide redundant output options such as audio descriptions or subtitles to ensure that users can interact with the content in a way that best suits their abilities. They should also support redundant input methods and allow the user to interact with the environment by either changing the orientation of the head or using a controller. These applications need to be compatible with a wide range of assistive technologies so that users can use their own way of interacting between the physical world and the virtual space [
99]. In addition, customization options for content presentation and input modalities should allow users to tailor the experience to their specific needs [
100]. Applications should also offer direct assistance or facilitate support when the user encounters difficulties, through dynamic customization of the UI, intelligent agents, or the possibility of external support from friends or caregivers [
99]. Finally, implementing the principles of inclusive design will ensure that text content and controls are accessible to a broad population (e.g., by displaying text in an appropriate size and color) [
101].
The 3D interactions can be defined by three aspects: usability, reliability, and comfort [
102]. Usability describes the ease of learning, understanding, and remembering interactions, where high ease of use means simplicity compared to other interactions. Reliability refers to the likelihood that the tracking device will interpret the interactions correctly. It is expected that interactions with high reliability will be recognized most of the time. Comfort describes the physical effort and discomfort of performing gestures, while interactions with high comfort can be performed with ease and minimal effort. To make their devices more usable, manufacturers of devices that offer 3D interaction claim to study human behavior and apply evidence-based design practices to identify the gestures, interactions, and haptic sensations that work best for their users [
103]. They apply this knowledge in the design of their hardware and software, which programmers can use to control their devices.
The transition from 2D accessibility to accessibility in 3D environments is a major transition. Users are confronted with new types of interface components and interaction styles that may seem strange, unintuitive, or unnatural. In addition to the challenge of adapting to the new elements, additional complexities arise as accessibility principles from the 2D world need to be merged with those of real-world interaction. While expertise in 2D interactions is fundamental to translating accessibility to 3D, a perfect 1:1 conversion is often not possible.
As far as the authors are aware, there is no comprehensive research in the literature on the accessibility parameters of 3D user interactions in educational software systems that include 3D visualization.