A Review of Human Intention Recognition Frameworks in Industrial Collaborative Robotics

Kekana, Mokone; Du, Shengzhi; Steyn, Nico; Benali, Abderraouf; Djerroud, Halim

doi:10.3390/robotics14120174

Open AccessReview

A Review of Human Intention Recognition Frameworks in Industrial Collaborative Robotics

by

Mokone Kekana

¹

,

Shengzhi Du

^1,2,*

,

Nico Steyn

^1,2

,

Abderraouf Benali

³

and

Halim Djerroud

³

¹

Department of Electrical Engineering, Tshwane University of Technology, Pretoria 0001, South Africa

²

F’SATI, Department of Electrical Engineering, Tshwane University of Technology, Pretoria 0001, South Africa

³

Laboratoire d’Ingénierie des Systèmes de Versailles, UVSQ, Paris-Saclay, 10 Avenue de l’Europe, 78140 Velizy, France

^*

Author to whom correspondence should be addressed.

Robotics 2025, 14(12), 174; https://doi.org/10.3390/robotics14120174

Submission received: 21 October 2025 / Revised: 14 November 2025 / Accepted: 21 November 2025 / Published: 24 November 2025

(This article belongs to the Section Industrial Robots and Automation)

Download

Browse Figures

Versions Notes

Abstract

The integration of intention recognition systems in industrial collaborative robotics is crucial for improving safety and efficiency in modern manufacturing environments. This review paper looks at frameworks that enable collaborative robots to understand human intentions. This ability is essential for providing effective robotic assistance and promoting seamless human–robot collaboration, particularly in enhancing safety, improving operational efficiency, and enabling natural interactions. The paper discusses learning techniques such as rule-based, probabilistic, machine learning, and deep learning models. These technologies empower robots with human-like adaptability and decision-making skills. It also explores cues for intention recognition, categorising them into physical, physiological, and contextual cues. It highlights how implementing these various sensory inputs sharpen the interpretation of human intentions. Additionally, the discussion assesses the limitations of current research, including the need for usability, robustness, industrial readiness, real-time processing, and generalisability across various industrial applications. This evaluation identifies future research gaps that could improve the effectiveness of these systems in industrial settings. This work contributes to the ongoing conversation about the future of collaborative robotics, laying the foundation for advancements that can bridge the gap between human and robotic interactions. The key findings point out the significance of predictive understanding in promoting safer and more efficient human–robot interactions in industrial environments and provide recommendations for its use.

Keywords:

industrial collaborative robotics; learning algorithms; intention recognition

1. Introduction

Modern manufacturing environments are experiencing a significant change with the use of collaborative robots [1]. This shift is driven by changing market demands and the need for greater operational flexibility [2]. These systems combine human skills with robotic precision. This creates workflows that greatly improve productivity while maintaining high safety standards [3]. At the core of this technological revolution lies the critical capability of intention recognition [4,5]. This refers to a robot’s ability to understand and respond to human intentions during collaborative tasks [6,7].

In industrial settings, human intentions include both the desired outcomes and the planned actions during human–robot interaction [8,9]. For example, in assembly work, a worker’s intention to place a component accurately requires the collaborative robot to anticipate this action and provide the right support, such as correctly orienting parts or tools [10]. This ability goes beyond just executing commands. It also involves understanding the subtle behaviour cues and context that shape human decision-making [11]. In industrial settings, humans and robots work together under a supervisory system. This system assigns tasks to both humans and robots. The latter can infer the context and the goals of the task, allowing it to know how to help and work alongside the human [12].

The importance of robust intention recognition systems shows up in three key areas. Firstly, safety improvement is crucial. Intention recognition provides the robot with a better understanding of the exact human safety level, enabling it to operate at an optimal speed. This helps balance human safety in every work situation while also increasing productivity [13]. Second is operational efficiency. Clear intention interpretation allows robots to provide timely assistance, improving workflow continuity and shortening task completion times [14]. Third is the facilitation of fluid human–robot interaction. The collaboration between the robot and the human will be more natural, fluent, and effective when the robot can predict the object the human user needs next [15].

Figure 1 presents a model proposed for recognising human intentions in collaboration between humans and robots, based on a study by Lin et al. [8]. This model is adopted because it seeks to find an ideal course of action by taking a holistic view of the entire task from the starting point to the final step. The figure highlights a principle of teamwork, where the robot acts as both a co-worker and a supportive helper. The human block represents the starting point for task execution. Implicit goals turn into visible actions, like gestures and object manipulation. The perception block captures these actions using sensing devices, such as cameras, motion trackers, or force sensors, to extract important features of human behaviour and the environment.

The learning block has computational models that analyse the captured perception data to find patterns in how humans make decisions and what actions they take. The processed information is then stored in the database block, which keeps a structured record of demonstrations, object states, and context. Together, the learning block and database block form the initial pipeline that changes raw human behaviour into structured knowledge suitable for further interpretation.

The human intention recognition block builds on this pipeline to infer the goals of the human collaborator. Using the learning models, the system turns perceived actions and environmental states into a clear plan for intended human behaviour. Before the robot executes any response, the human approval block acts as a validation mechanism to ensure that the robot’s suggested actions match the human expectations. This merges at the action integration block, promoting safety and trust in industrial workspaces.

Finally, the task block represents the stage of collaborative execution. Here, the robot acts as both a co-worker and a helper. It performs complementary steps, offers corrective assistance, or makes up for incomplete actions. This sequence of blocks shows a closed-loop process. In this process, human demonstrations are perceived, learned, recognised, validated through human approval or rule, and ultimately turned into effective and adaptive human–robot collaboration [8].

Early generation intention recognition systems were built on rigid algorithms and basic sensor data. These methods were not effective in dynamic industrial settings [16,17]. This pushed the development of Artificial Intelligence (AI) and Cognitive Robotics [18,19]. As a result, we now have advanced learning algorithms, the integration of various sensors, and processing models that consider context. Modern systems have made significant strides in accuracy and responsiveness [20].

However, even with these technological advances, challenges still exist in achieving seamless human–robot collaboration. One major challenge is the need for real-time processing in complex environments. Another is the requirement for systems that can work across different industrial applications, which is the issue of generalisability [18,21]. Additionally, as collaborative tasks become more intricate, intention recognition systems need to manage multiple sensor inputs while staying reliable in operation [21].

At first, intention recognition in industrial collaborative robotics relied heavily on pre-programmed routines or simple sensors [10,22]. These provided robots with explicit cues to interpret basic human actions, such as pressing a button to start a task or moving a hand into the robot’s space [10,22]. These methods had significant limitations for safety and efficiency. Pre-programmed routines restricted robots to a limited range of expected human behaviours. This increased the risk of accidents in unpredictable situations [23]. Basic sensors, such as limit switches or proximity sensors, often lacked the precision and adaptability needed to understand human intentions fully. This further compromised safety [24]. Additionally, these simple methods limited the effectiveness of human–robot collaboration. They required explicit programming and could not easily adapt to the changing nature of real-world interactions [25]. As a result, early industrial collaborative systems struggled to fit seamlessly into human workflows. This highlighted the need for better recognition technologies that can understand both implicit and explicit communication to improve safety and operational efficiency in collaborative environments [18].

Recent developments in Cognitive Robotics and AI are transforming industrial human–robot collaboration, making it more efficient and safer [26]. Modern intention recognition systems have moved past the limits of rigid, pre-programmed routines. They now aim to imitate human abilities such as learning, decision-making, and sensing [27]. These improvements attempt to equip collaborative robots with the ability to adapt to dynamic environments and respond effectively to unexpected situations [28]. Ideally, collaborative robots will learn to predict human needs in industrial settings [29,30]. By observing human actions and proactively preparing tools and materials, these robots will cut down on wasted time looking for items and reduce idle periods, leading to a significant boost in overall efficiency.

This review paper looks at how intention recognition systems in industrial collaborative robotics have evolved. We critically examine two key advancements: learning techniques and cues for recognising human intention. We evaluate their potential for transformation while identifying ongoing limitations. Based on this analysis, we suggest important research directions to tackle current challenges. This work aims to ultimately advance the development of more adaptive, efficient, and safe human–robot collaborative systems. To present this analysis clearly, the paper is structured as follows. Section 2 examines the learning algorithms that enable human-like adaptability and decision-making. Section 3 discusses cues for intention recognition, examining various sensory modalities for interpreting intentions. Section 4 critically evaluates the limitations of current research and highlights promising gaps for future study. Finally, Section 5 concludes with a summary of the key findings.

2. Learning Techniques for Intention Detection Towards Human-like Adaptability

Learning models are changing industrial robotics. They enable robots to intelligently perceive, interpret, and respond to human actions and intentions in shared workspaces, similar to human cognition [31]. Unlike traditional industrial robots that are limited to specific and pre-programmed tasks, modern collaborative robots with learning algorithms can adapt their behaviour based on real-time cues and human interactions [31]. This section explores the development of learning algorithms in industrial collaborative robotics for intention recognition. As shown in Figure 2, the learning techniques are categorised as Rule-Based, Probabilistic, Machine Learning, and Deep Learning, as described in this section.

2.1. Rule-Based Approaches

In the field of industrial collaborative robotics, rule-based methods are essential for understanding human intentions and guiding robot behaviour. They are often used as modelling techniques for intention inference [32]. These structured methods depend on explicit logic. Experts manually encode task knowledge; as a result, the models operate strictly with discrete states and events. This makes them easy to understand and transparent. The robot’s reasoning can be traced directly to a specific human-readable rule or a transition in a formal graph [33,34]. Table 1 provides rule-based methods for recognising human intentions in industrial collaborative robotics. It outlines the types of computational models, input data, sensing devices, scenarios, key architecture, performance, and notable limitations in each work.

Several different rule-based approaches have been developed to address different parts of intention recognition and robot control. Each method has its own strengths in managing collaboration.

Approaches like Finite-State Machines (FSMs) are often used because they can model sequential processes and the changes between defined states [35,42,43]. In this context, an FSM can represent different stages of a collaborative task. Human actions trigger transitions between states. This allows the robot to predict the human’s next move and adjust its behaviour accordingly. Zhao et al. [35] developed a human–robot collaborative assembly method that uses FSM in a virtual environment to tackle efficiency issues in traditional human–robot collaboration. The FSM acts as a state transfer module. It interprets human gestures and eye gaze data to control the robot’s operations and switch between “instruction” and “mapping modes”. Hand data is collected using a Leap Motion Sensor. It recognises nine gestures with 100% accuracy based on finger joint angles. Eye gaze movement is captured with a Tobii Eye Tracker 4c (Tobii, Stockholm, Sweden). This, combined with hand inputs, assists in selecting objects. The collaborative task features a Six-Degrees-of-Freedom (DOF) robot that assembles building blocks in a virtual environment. This eye–hand interaction model proved more efficient. It significantly cut the total assembly time by 57.6%, compared to traditional key press methods. The average task completion time was 5.9 to 6.2 min versus 13.5 to 15.2 min, respectively. The main limitations noted include the virtual environment‘s inability to fully simulate real-world factors like delay and stability. There is also a lack of sensory feedback, such as grip force, for the user.

Behaviour Trees (BTs) provide a modular and hierarchical structure for organising complex robot behaviours and decision-making logic [36,44]. They break down high-level tasks into smaller, manageable sub-behaviours [45]. This offers a flexible and strong way to handle the robot’s responses to human actions and intentions. Styrud et al. [36] introduced Behaviour Tree Expansion with Large Language Models (BETR-XP-LLM). This method automatically expands and sets up BTs as robot control policies using natural language input. They demonstrated it with robotic manipulation tasks. The input data includes natural language instructions, a list of objects in the scene, and formal goal condition options. Sensing is performed with an Azure Kinect camera that uses YoloWorld for object detection and NaoSAM for segmentation. This setup, along with depth data and heuristics, provides estimates of object positions. The system has been tested in real robot experiments with an ABB YuMi robot. It performs tasks like picking and placing cubes or inserting test tubes. Its key features include a combination of Large Language Models (LLMs), specifically GPT-4-1106, with a reactive task planner. This approach allows the LLM to go beyond just interpreting goals; it also helps resolve errors and propose new preconditions during planning and execution. An improved prompt and LLM remove the need for reflective feedback. The system updates the BT policy permanently when it fails, which improves robustness and retains transparency and readability. The method reached nearly perfect accuracy in interpreting goals and received perfect scores in solving missing predictions. It also reduced LLM calls to save time and cost, as the planner works almost instantly. Limitations include uncertainty about scalability with thousands of objects or conditions. It also struggles to resolve unclear instructions without extensive user communication, and it cannot currently generate missing skill library actions.

Ontology-based rules are used to formally represent knowledge within the collaborative workspace. This includes object tools, tasks, and the relationships between them [37]. Olivares-Alarcos et al. [37] proposed OCRA, an Ontology for Collaborative Robotics and Adaptation. It uses a formal logic framework, with axioms and rules in First Order Logic (FOL) and Web Ontology Language Description Logic (OWL DL). This enables reasoning about plan adaptation and collaboration to interpret human intent. The ontology aims to formalise knowledge so robots can understand the causes of adaptation, like a human fulfilling a target compartment. The validation use case involves an industrial kitting task, where a human and a robot work together to fill a tray. The system processes input data, including hand pose and velocity measurements taken at 100 Hz with devices like the HTC Vive tracker and RFID-based detection on the board. It computes concepts such as collision risk, known as Time-To-Contact (TTC). The main structure uses an OWL 2 reasoner (HermiT) to perform intention inference over the populated knowledge base (ABox). The validation was qualitative. It showed the ontology’s ability to answer competency questions and maintain robustness in difficult cases. One noted limitation is the reduced expressiveness of the computational OWL DL formalisation compared to the complete FOL version. This limitation particularly affects the representation of ternary relationships over specific time intervals and can lead to inaccuracies in temporal reasoning during certain difficult cases.

In another case, Akkaladevi et al. [38] used a semantic knowledge-based reasoning framework to learn and recognise human activities from demonstrations. They linked these activities to the assembly process to understand human intentions during teaching. It generates rules online through abstract semantic interaction with the user. This method overcomes the limitation of needing pre-defined rules and allows the framework to manage new situations and process variations. The input data in this case comes from perception data that relates to human action recognition and object tracking. The system relies on components that can track objects and recognise actions in real-time. The framework supports human–robot collaborative teaching in industrial assembly and acts as a user guidance system. It employs a knowledge-driven description logic with GraknAI for semantic representation, outlining process states, events, and their logical, temporal, and spatial relationships. The qualitative benefits include better knowledge transfer and less need for training data compared to single-layer methods. This work mainly addresses issues of manually modelling event rules by generating them online.

Fuzzy rules deal with uncertainty and imprecision that often occur in human–robot interaction [39]. Unlike traditional binary logic, fuzzy logic allows for degrees of truth. This enables the robot to handle unclear human inputs or changing environmental conditions smoothly [39,46]. As a result, this leads to more fluid and natural collaboration. Zou et al. [39] proposed a novel approach for predicting human intention for a handover in human–robot collaborations. They used a wearable data glove for sensing and fuzzy rules for prediction. The system uses a data glove with six Inertial Measurement Units (IMUs) to detect human handover intentions, focusing on hand gestures represented by the bending angles of the five fingers. This method overcomes the drawbacks of vision-based methods, such as visual occlusion and a limited sensing area, as well as physical contact methods that pose safety risks. A key feature of this approach is a fast Human Handover Intention Prediction (HHIP) model based on fuzzy rules. These rules are defined using thresholds for the bending angles of five fingers, which reduces the computational load compared to using the original quaternions. The approach also seeks to eliminate interference from physiological electrical signals typically found in Electromyography (EMG) sensors. It collects signals from finger movements to predict intentions shown through gestures. In experiments, the method achieved an average prediction accuracy of 99.6% across 12 human handover intentions. It had a low standard deviation of prediction errors, around 0.008. This shows its reliability and effectiveness in real-time handovers between humans and robots, as well as in adjusting robot motion modes. However, the method has limitations. Its prediction accuracy is not yet 100%, and it mainly focuses on sensing and prediction. Future work is needed to better integrate it with motion planning and possibly include other types of information, like human gaze or speech, to improve the performance.

Temporal Logic is important for reasoning about sequences of events and actions that depend on time. It helps the robot understand how a person’s actions progress over time [40,47,48]. For Cao et al. [40], a rule-based reasoning method using Temporal Logic is used to understand operator intention and robot responses. It handles time-related aspects by combining the current assembly action with part recognition information. This approach accounts for flexible assembly sequences and helps infer completed and upcoming tasks. The input data includes assembly video sequences captured by an Azure Kinect DK camera. From these, skeletal data is extracted using OpenPose for action recognition, while images are used for part recognition. The collaborative task is a human–robot collaborative decelerator assembly. It involves tasks such as assembling the key, gear, left bearing, bushing, and right bearing. The main framework combines Spatial–Temporal Graph Convolutional Networks (ST-GCN) for skeleton-based assembly action recognition, which learns spatial–temporal patterns, with an improved YOLOX model for part recognition. The improvement to YOLOX includes the Convolutional Block Attention Module (CBAM) to better focus on parts and the Focal Loss function to address sample imbalance, improving the recognition of difficult, occluded, or small parts. In terms of performance, ST-GCN reached a Top-1 accuracy of 40% for action recognition at about 15 fps. The improved YOLOX model showed a mean average precision (mAP) of 96.89% at 54.37 fps, indicating better recognition for challenging parts. Its limitations include a lower accuracy of assembly action recognition due to similar actions, occlusion, and inconsistent body movements. This highlights the need for a combined human–object integrated approach for reliable intention inference.

Finally, Petri Nets offer a strong graphical and mathematical tool for modelling concurrent and asynchronous processes, common in human–robot collaboration [49]. They represent the flow control and resources effectively. This allows the robot to synchronise its actions with humans and grasp the overall progress of the collaborative task [50]. Llorens-Bonilla and Asada [41] developed a system using Coloured Petri Nets (CPNs) to control and coordinate Supernumerary Robotic Limbs (SRLs) with human workers. The CPNs model performed specialised aircraft assembly tasks, like securing an intercostal on an aeroplane’s fuselage. These tasks often require multiple workers due to tight spaces and the tasks’ complexity. The main feature of this task model is its ability to handle the concurrent and uncertain nature of these tasks. It addresses challenges such as parallel processes, resource allocation, and timing. CPNs combine graphical notation—places as states, transitions, arcs, and tokens for tools and resources—with high-level programming to define how to identify resources using “colours”, express constraint laws with arc expressions, and set transition criteria with guard functions. The input data for CPN transitions comes from detecting the human worker’s intentions, which are based on specific gestures and postures. These intentions are sensed through wearable Inertial Measurement Units (IMUs) placed at the wrists and back of the head. For example, Z and Y gyro readings help detect nods, while Euler angles from accelerometers assist in recognising postures. The system demonstrated the ability to successfully detect 90% of nods and accurately identify the correct final postures in validation tests. The CPN model itself underwent thorough checking with the state space method. This confirmed that the task was completed without token duplication, misplacement, or starvation, resulting in the desired end state. A limitation of the current system is its dependence on simple gestures that can be classified linearly.

Table 2 provides a comparison of rule-based computational models for understanding intentions in collaborative robotic tasks. It outlines their advantages, disadvantages, representative application scenarios, and potential improvements for future research.

2.2. Probabilistic Models

Probabilistic models are useful for recognising human intent in industrial collaborative robotics. They can accurately predict how a robot should respond to human actions. This improves the reliability and understanding of intent prediction in changing human–robot interactions [43].

They are especially well-suited for industrial collaborative tasks because they can handle the inherent uncertainty and variability in human behaviour [51]. Unlike rule-based models, probabilistic models assign probabilities to various possible intentions. This allows the robotic systems to make more informed decisions [52]. Thus, the collaborative robot can select the safest and most probable action. This decision-making ability in dynamic collaborative environments becomes more flexible. Table 3 shows applications of probabilistic models in industrial collaborative robotics for intention recognition.

Bayesian Networks (BNs) are useful because they clearly show causal relationships between variables. This improves interpretability, handles temporal dependencies, and combines prior knowledge with current data. As a result, it increases the reliability of intent prediction in dynamic human–robot collaboration [51]. Hernandez-Cruz et al. [51] developed a novel Bayesian intention framework for improved human–robot collaboration. They demonstrated it in a tabletop pick-and-place scenario involving a UR5 robot making cereal. This framework uses a BN to model causal relationships between variables. It combines information from top-down and bottom-up approaches. The input includes head orientation, hand orientation, and hand velocity, which are all extracted from RGB—D frames captured with an Intel RealSense D455 depth camera. This probabilistic method updates prior probabilities in real-time. This allows for predicting human intent, which helps with task adaptation, trajectory replanning, and collision avoidance by creating a virtual ellipsoid obstacle. The Bayesian Intention model predicts intent in 2.69 milliseconds (ms), achieving 89.55% accuracy, 91.80% precision, and an 89.77% F1 Score. This represents an 85% increase in accuracy, a 36% increase in precision, and a 60% increase in F1 Score over the best single-modality baseline. While this framework is interpretable and can handle temporal dependencies, it may show occasional mispredictions due to highly curved hand movements or unclear object affordances.

Hidden Markov Models (HMM) are useful for intention recognition because they work well with smaller sample sizes for learning and prediction. They are also easy to interpret, handle recognition errors well, and can improve through incremental learning [53,56]. Qu et al. [53] introduced a new method for predicting operator assembly intent in Human–robot Collaboration Assembly (HRCA) specifically for reducer assembly. Their approach uses an HMM, which analyses action state sequences captured from video data by devices like the Xiaomi 11. The HMM has important features such as integration with assembly task constraints, treating assembly tasks as hidden states and actions as observed states, and a learning method that adjusts parameters over time.

This design helps reduce the effects of action recognition errors and improves how well predictions can be understood. The method reached an assembly intention prediction accuracy of 90.6%, which is a 13.3% increase compared to HMMs without task constraints. Limitations include the need for more optimisation to improve generalisation and real-time performance for broader practical use.

Particle filters have shown promising results in tracking dynamic and continuously changing low-level human intentions [52]. They output a probability distribution over task intentions. This is important for enabling seamless, strong, and flexible human–robot collaboration. Huang et al. [52] use a Mutable Intention Filter (MIF), a type of particle filtering, for tracking intentions in a human–robot collaboration during an industrial assembly task. In this task, a human and a UR5e robot work together to assemble four pairs of Misumi Waterproof E-Model Crimp Wire Connectors. The human aligns the parts while the robot performs forceful pushing actions. The system is designed to recover well from failures. The MIF captures observed 3D human wrist movements and possible task intention areas. These wrist movements are captured at 30 Hz using Intel RealSense RGBD cameras and OpenPose with a Kalman Filter. The MIF is improved with an Intention-aware Linear Model (ILM) as its prediction model. It works in situations where the robot does not influence human actions. Therefore, this generates a probability distribution over task intentions for assembly goals and failure recovery.

The Probabilistic Dynamic Movement Primitive (PDMP) model is another probabilistic method useful in dynamic human–robot collaboration. Lou and Mai [54] developed a framework that uses the PDMP model as the main learning algorithm for inferring human intentions and predicting hand motion in real-time. The learning process has two stages. The first stage is offline, where multiple PDMPs are built by training them on a set of demonstrated 3D human hand motion paths that provide the input data. These paths are captured with a Microsoft Kinect V1 as a sensing device. The second stage is online. In this stage, the trained PDMPs analyse the ongoing motion to infer human intention and predict the next movement of the hand. The framework performed well, achieving accurate intention inference and trajectory prediction. One of its main strengths is its ability to generalise and adjust to new, previously unseen environments. Nonetheless, there are significant prediction errors during the early stages of movement. Other errors include differences between the assumed and actual target position and timing mismatches.

Gaussian Mixture Models (GMMs) are useful for recognising human intentions in industrial collaborative robotics [55]. Lyu et al. [55] used GMMs for estimating human intentions or targets in shared-workspace human–robot pick-and-place and assembly tasks. The GMMs were trained using the unsupervised Expectation–Maximisation (EM) algorithm on data from human palm trajectories. They take both observed and short-term predicted human-arm and hand-palm trajectories as inputs. This data is captured by a PhaseSpace Impulse X2 motion-capture system using LED markers placed on the human shoulder, elbow, wrist, and palm. A key feature of the method is the combination of observed and predicted trajectories, which significantly improves target estimation accuracy during the early stages of motion. This is important for 12 closely spaced targets, where initial trajectories appear highly similar. The approach provides much more accurate and robust estimates once at least 30% of the human motion is observed. GMM results are updated at about 20 Hz. However, a limitation is that some false classifications still occur between certain nearby targets, such as targets 3 and 10, or 6 and 12, where human hand trajectories first pass over intermediate targets. Table 4 presents a focused comparison of different probabilistic models for understanding intentions in collaborative robotic tasks. It clearly shows their advantages, disadvantages, representative application scenarios, and potential improvements. This information will help in selecting an appropriate model for complex industrial collaborative tasks.

2.3. Machine Learning Models

In industrial collaborative robotics, Machine Learning (ML) techniques have significantly improved human intention recognition. This has led to key results for enhancing human–robot collaboration [2]. Table 5 provides the implementation of ML models in industrial collaborative tasks for intention recognition.

Effective human–robot collaboration depends on a robot’s cognitive model. This model gathers inputs from the environment and the user. It processes and translates this information into data that enables the robot to adjust its behaviour [61]. ML is central to this. It embeds itself into the robot’s behavioural block [62]. ML algorithms fall into three main types: Supervised Learning, Unsupervised Learning, and Reinforcement Learning [2]. Each of these approaches offers different benefits for creating cognitive models important for human–robot collaboration.

Supervised Learning is essential for developing cognitive models that can recognise human intentions in industrial collaborative tasks. This improvement boosts the safety and efficiency of human–robot collaboration [63]. This approach involves collecting and carefully labelling human action data [10]. For instance, Olivares-Alarcos et al. [10] developed a system for industrial collaborative robots that relies heavily on Supervised Learning to interpret human operators’ intentions. This technique uses force data. The system collects data from an ATI Multi-Axis Force/Torque Sensor Mini40-SI-20-1, attached to the robot’s wrist. This device is measuring at a frequency of 500 Hz. The input data consists of labelled force and torque signals, with samples ranging from 0.5 to 3 s, padded with zeros for uniformity, and obtained from six sensor axes. The collaborative task is inspired by a car emblem manufacturing line, where a robot picks up and places items, while a human inspects and polishes. At the core of the learning algorithm is the K-Nearest Neighbours (KNN) algorithm, mainly used for Raw Databased Classification. Validation with 15 users showed a positive adaptation trend. An F1 score of 98.14% and inference time of 0.85 s were achieved. The key limitations included a non-symmetric confusion matrix with bias toward the “move” intent, which was the easiest to identify.

Unsupervised Learning methods provide useful solutions for understanding human intentions in collaborative robotics. These approaches allow robots to find hidden patterns, structures, and relationships in unlabelled human action data, such as raw sensor streams and joint trajectories. This helps robots infer intentions without needing explicit prior knowledge or human labels [57,58,59,60,64]. Lou et al. [57] demonstrated the effectiveness of an unsupervised online learning algorithm. This method allows for on-the-fly model building and adaptation to new motion styles. Also, it was able to adjust to noisy observations without needing labelling or offline training. Input data included human palm position (PP) for early recognition and arm joint centre positions (AJCP) for trajectory prediction. The data was captured by a VICON system at 100 frames per second. The framework used an Unsupervised Online Learning Algorithm (UOLA), which updates two layers of GMMs through incremental Expectation–Maximisation (EM) or initialises new ones with Random Trajectory Generation (RTG). This technique used a “ratio prior” to reduce the impact of atypical motion and trajectory prediction through Gaussian Mixture Regression (GMR). In real-time experiments, the system achieved high success rates (99.0% in simple, 93.0% in realistic scenarios), fast re-planning (0.5032–1.02 s), and 70% accuracy within 0.5–1.0 s. Its limitations included lower early prediction accuracy in complex settings, inaccurate goal inference when objects were grasped, and the need for manual motion segmentation.

Similarly, Vinanzi et al. [58] used unsupervised learning, specifically X-Means dynamical clustering. This approach was implemented to analyse human skeletal data captured by the iCub robot’s eye cameras in a collaborative block-building game scenario. Their artificial cognitive framework included a low-level component for processing skeleton data, which used OpenPose for feature extraction, Principal Component Analysis (PCA) for reducing dimensions, and X-Means for clustering to represent actions as clusters. It also featured a high-level Hidden Semi-Markov Model (HSMM) supported by an anticipator for making probabilistic intention predictions. The system showed 100% accuracy in predicting intentions, with an average latency of 4.49 s. This allowed the collaboration when the human partner had completed approximately 57.5% of their action. A major advantage of this unsupervised approach is its lightweight nature, robustness, and ability to learn an open-ended set of goals without requiring handcrafted plan libraries or large datasets. Its current reliance solely on skeletal input is a major limitation noted in the study.

Furthermore, Xiao et al. [59] developed an unsupervised robot learning method to predict human motion. They demonstrated this in a lab kitchen/dining area. The robot collects short, pre-processed trajectories from people using RGBD cameras or LIDAR. A pre-trained Support Vector Machine (SVM) classifier first separates these tracks into similar motion classes. The main unsupervised learning occurs when Partitioning Around Medoids (PAM) clustering extracts prototypical motion patterns from these classes. It also uses a modified distance function to improve similarity measurement. For online prediction, these prototypes match partially observed trajectories. The system achieved prediction accuracies of 70% for the top three matches in 10-point trajectories and 95% for correct predictions within a threshold of 3. The limitations include its reliance on a relatively small dataset, the current need for manual selection of prototype number, and the fact that more prototypes are required for better predictions in open spaces.

Most recently, Zhang et al. [60] introduced an unsupervised learning framework for video-based Human Activity Recognition (HAR). This approach enabled robots to autonomously learn disassembly tasks. The framework aims to assist in remanufacturing plants. By using unlabelled video frames captured by a digital camera, the system focused on Hard Disk Drive (HDD) disassembly. It specifically dealt with variations in the number of screws and the need to identify unexpected actions. The main architecture includes a Sequential Variational Autoencoder (Seq-VAE), which combines Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This combination helps extract important spatiotemporal features. These features are then input into a Hidden HMM to automatically break down activity states. A Support Vector Machine (SVM) classifier was then trained to validate the feature-activity match against ground truth labels. The framework showed strong performance, achieving an average recognition accuracy of 91.52%. This was better than other methods, such as Principal Component Analysis (PCA) and Autoencoder (AE). However, a significant limitation is that the initial case study assumed the disassembled HDD was in good condition. The impact of large variations in product quality needs further validation in future work.

Developments in Machine Learning approaches, especially Reinforcement Learning (RL) and its variants, have become an important method for developing advanced data-driven models. These models can predict human intentions and behaviour in sequential decision-making contexts. This helps tackle a key challenge in human–robot collaboration [8].

Lin et al. [8] focused on Inverse Reinforcement Learning (IRL) as a key part of their human intention learning framework in human–robot collaboration. They chose this method to address the challenge of manually defining the reward function in Markov Decision Processes (MDPs), which can result in sub-optimal policies. Instead, IRL directly derives the optimal reward function from observed human demonstrations, which is essential for finding the optimal policy. The main benefit of IRL is its ability to consider the entire task, from start to finish. It aims to find a globally optimal policy by maximising the margin from the optimal value to others. This, in turn, accelerates the learning time for obtaining the MDP policy. The recognised states and actions of the task, based on human gesture recognition and object attributes, are used to compute this reward function. In experiments, the IRL approach demonstrated its advantages over frequency-based methods, which only provided locally optimised solutions. For example, in the coffee-making task, IRL ensured the selection of globally optimal actions, such as “place spoon” when the coffee powder was already in the cup, rather than unnecessarily repeating “spoon up coffee” as a frequency-based method might do. In the pick-and-place task, the system effectively predicted successive human actions and could detect and suggest corrections when individuals deviated from the learned plan, directing them back to the desired sequence.

Table 6 provides a comparison of machine learning models for understanding intentions in collaborative robotic tasks. It outlines their advantages, disadvantages, representative application scenarios, and potential improvements.

2.4. Deep Learning Models

Recent advancements in deep learning have greatly improved the ability to predict human intentions in industrial collaborative tasks. This enhancement supports better human–robot interaction and is vital for ensuring safety and efficiency in environments where humans and robots work closely together [14,65]. Various deep learning models, such as LSTM [66], CNN [67], ConvLSTM [68], RNN [69], Transformer [70], and NN [71], have been used for predicting human intentions, each achieving different levels of success. When these models are incorporated into human–robot collaboration frameworks, they show promising results in terms of accuracy and response time. Both factors are crucial for seamless collaboration. These systems analyse data from sensors, cameras, and other inputs to understand the subtle cues in human behaviour, such as eye movement, body posture, and gestures, allowing the robot to predict a human‘s next move. Table 7 provides an overview of deep learning model applications for intention recognition within industrial collaborative tasks. The table summarises state-of-the-art works that examine the types of computational models used, the input data types for recognising intent, the sensing devices integrated, the application scenarios, key architectural features, the models’ performance, and notable limitations.

Long Short-Term Memory (LSTM) networks, first introduced in 1997, are a specialised type of Recurrent Neural Network (RNN) designed to capture and retain long-term dependencies in sequential data [72]. Rekik et al. [66] used LSTMs for the real-time prediction of human intention within a human–robot collaboration context. The model’s input consists of a time-series three-dimensional tensor that represents human hand key point coordinates and their depth values. By processing these sequential hand motion data, the LSTM model outputs a classification of the expected human hand, categorising it into specific locations such as “Bin-Bottom”, “Bin-Raspi”, “Bin-Top”, or “Bin-Fan”. This deep learning approach solves the issue of delayed responses. The collaborative task studied is the assembly of a microcontroller housing, which involves several components such as a 3D printed case, a Raspberry Pi, and a DC fan. The task can start with different initial human actions, requiring the robot to adapt dynamically. In terms of performance, the LSTM model achieved a validation accuracy of approximately 89% with optimal results using a learning rate of 1 × 10⁻⁴ and stacking three LSTM layers. This intention prediction method showed significantly faster response times compared to an object detection-based approach. On average, human intentions were predicted in 0.43 s, allowing the robot to react 0.81 s quicker than if it had waited for an object detection module to figure out the action.

CNNs have become important in human intention prediction because they effectively extract local spatial patterns and detailed motion and gesture information from multi-dimensional input [67]. Kamali Mohammadzadeh et al. [67] applied CNNs for human intention prediction in industrial collaborative robotics. They took advantage of CNNs’ capability to capture complex spatial patterns and subtle motion and gesture details from rich, multi-dimensional datasets. Their study used high-resolution data on body movement trajectories gathered from Virtual Reality (VR) environments. This data included position, rotation, velocity, and angular velocity from HTC-Vive trackers and Head-Mounted Displays (HMDs). They also gathered detailed hand and finger joint movements from Leap Motion Sensors, which were processed into a 3D tensor. Intention recognition, facilitated by CNNs, is essential for improving human–robot collaboration. This allows robots to predict human actions, which can boost efficiency, effectiveness, and safety. As a standalone model, the CNN achieved a strong overall precision, recall, and F1 score of 0.95. However, it struggled with identifying “walking” and “standing” activities. It showed lower precision and recall scores for these categories, which indicates challenges in capturing complex gait.

The Convolutional Long Short-Term Memory (ConvLSTM) is a hybrid deep learning architecture. It combines CNNs with Long Short-Term Memory (LSTM) layers [68]. Keshinro et al. [68] implemented and evaluated a ConvLSTM for predicting human intention from RGB images within the context of human–robot collaboration. The ConvLSTM architecture integrates CNN layers to extract spatial features from individual video frames and LSTM layers to model temporal sequences using these features. This setup allows the network to learn both spatiotemporal characteristics effectively through end-to-end training. Keras ConvLSTM2D recurrent layers specifically handle three-dimensional input (width, height, and number of channels). For this study, RGB images from the UTKinect-Action 3D Dataset were used, featuring ten participants performing four selected actions: Pick, Throw, Wave, and Carry. These actions relate to collaborative tasks, such as a robot assisting in assembling a piece of furniture or constructing a box. Another example includes a table tennis game between a human and a robot, requiring anticipatory action selection by the robot. Images sized at 480 × 640 pixels were preprocessed by resizing to 64 × 64 pixels and normalising pixel values between 0 and 1. This process helped speed up training and improve convergence, with an 80% training and 20% validation/testing split. The ConvLSTM achieved a prediction accuracy of 74.11% on the test dataset. While it successfully predicted the underlying human intentions, the researchers noted that a 74% accuracy was not high enough. They identified limitations, such as focusing on only four actions and conducting demonstrations without a robot.

Recurrent Neural Networks (RNNs) show great promise in industrial collaborative robotics for predicting human intentions. They can process sequential data, provide continuous real-time decisions, and reduce latency for early and accurate intention recognition [69]. Maceira et al. [69] used this deep learning method, which is effective for handling time series data by capturing both instantaneous and previous information. This allows for continuous real-time decisions and reduced system latency during inference. The network framework feeds each force sensor measurement through the RNN to update hidden states. A fully connected layer then uses these states to classify among three possible user intentions, with a SoftMax layer computing probabilities for each class. Based on the study, the researchers used the RNNs to interpret force data that users naturally provide while manipulating a shared object. The collaborative task considered is an industrial setting where the robot and a human work together to clean and polish an object. The operator’s intentions are classified as polishing, grabbing the object for inspection, or moving the robot to a different position. It reads six-dimensional force/torque sensor signals. In terms of accuracy and performance, the method shows better classification accuracy and faster response times. It achieved an F1 measure of 0.937 with a response time of 0.103 s. The method also proved robust, maintaining good performance even when trained with limited data.

Transformer-based models are effective in modelling the dependencies between agents and forecasting their joint behaviours [70,73,74]. Kedia et al. [70] developed the INTERACT framework, taking advantage of the capabilities of conditional transformer models. It models inter-agent dependencies and predicts their joint behaviours, which is crucial when dealing with the “chicken-or-egg problem” in human–robot collaboration. The main innovation of INTERACT is that it conditions human intention predictions on the robot’s planned future actions. The framework uses a two-stage training process: first, it pre-trains on large-scale human–human interaction datasets, such as AMASS and extended CoMaD. Next, it fine-tunes on a smaller, specifically collected human–robot dataset. This dataset is gathered using a novel teleoperation technique that defines a low-level link between human hand/wrist movements and the robot’s end-effector to ensure that human actions and motion data are paired correctly. The framework is tested on human–robot manipulation tasks performed in close proximity, including “Cabinet Pick”, “Cart Place”, and “Tabletop manipulation”. Human intent is defined as a T-horizon sequence of future human poses, specifically nine upper body 3-D joint positions: upper back, shoulder, elbows, wrists, and hands. The performance is mainly measured by the Mean Per Joint Position Error (MPJPE) as the prediction loss and the Final Displacement Error (FDE) as the key evaluation metric. The FDE measures the distance between predicted joint positions and the actual joint positions at the end of a 1 s forecast. The model showed a lower FDE.

Neural Networks (NNs) are highly useful for capturing complex temporal dependencies in human actions. This improves prediction accuracy, allows for better proactive assistance, reduces response times, and provides high accuracy in predicting interactions. As a result, the user experience is enhanced [71]. In this context, Dell’Oca et al. [71] studied human intention prediction in collaborative robotics employing a multi-input Neural Network (NN). This NN architecture consists of three fully connected layers interleaved by two dropout layers, a final dense layer, and SoftMax activation. It is trained with an Adam optimiser and a sparse categorical cross-entropy loss function. The system uses an existing work-cell camera setup as a sensing device. The input data for the NN includes the Tower of Hanoi (TOH) game configuration, visual features such as eye, nose, right-hand, and left-hand positions, and contextual information like dominant hand and the number of times the participant has played. In a collaborative, turn-based TOH game between a human and a robot, the cobot predicts and anticipates human intentions. The trained model achieved a test accuracy of over 95% in predicting human moves. The study’s limitations include its focus on a single use case, where human-related features were found to have minimal influence on predictions compared to game configuration. Table 8 provides a comparison of deep learning models for understanding intentions in collaborative robotic tasks.

In Section 3, we delve into cues for recognising human intentions in an industrial collaborative setting, to facilitate seamless human–robot interaction.

3. Cues for Human Intention Recognition

Human intention recognition relies heavily on the analysis of sensory information, where diverse data sources provide complementary insights into human behaviour [75].

As shown in Figure 3, these approaches are categorised into physical, physiological, and contextual cues to infer what a human is likely to do in a collaborative workspace with a robot [76]. This section provides insights into sensory techniques to infer human intention in industrial collaborative robotics. By combining information from multiple modalities, these methods allow for accurate and robust predictions of human behaviour, which ultimately improves safety, efficiency, and adaptability in shared workspaces.

3.1. Physical Cues

Physical cues refer to observable movements and bodily expressions that show a human’s intentions. They provide important information for anticipating human actions in industrial collaborative settings. These cues offer rich, real-time data streams that can be captured through vision-based systems, wearable inertial sensors, or force-feedback interfaces. This enables the robots to predict and adapt to human behaviour. The main physical cues usually analysed include force interaction, hand gestures, body posture, hand-arm motions, and gaze direction [77].

3.1.1. Hand Gestures

In industrial human–robot collaboration, hand gestures serve as a type of non-verbal communication. Specific hand poses or movements send commands, provide information, or express intent to a robotic system. This way of interacting is seen as natural and intuitive. It proves useful in noisy industrial environments where verbal communication might not work well [78]. Human gestures are important in industrial collaborative robotics. They allow for direct control of robotic systems without using traditional input devices [79]. They provide clear and easy-to-understand signals to the robot. This helps improve safety and operational efficiency [80]. Human hand gestures are generally classified into two types: static and dynamic. Static gestures consist of fixed hand poses. Dynamic gestures involve a sequence of poses that develop over time. In the studies reviewed, most researchers using hand gestures for recognising human intentions prefer static gestures. This preference is due to their simplicity, reliability, and ease of recognition. In contrast, dynamic gestures require recognising sequential motion patterns [80]. Figure 4 depicts examples of static hand gestures.

Lin et al. [8] use static hand gestures to recognise human intentions and classify five predefined gesture types: Empty, Grip, GripTrans, Spoon, and Hold. These gestures are used for a coffee-making task. They apply CNNs on binarised and pose-calibrated hand images. These classified static gestures are encoded as states within a Markov Decision Process (MDP) to infer human intentions. The main part of the system for recognising intentions relies on visualising static hand gestures. They use a robust skin colour model based on a GMM. They capture images of human gestures, pre-process them with pose calibration for continuous actions, and then input them into the CNN for classification into predefined gesture types. This helps the robot to understand both the human’s immediate actions and their overall task plan. While their method shows promising results in classifying human hand actions, it might not fully capture the complexity of human intention. This complexity should be seen as a problem of optimising task planning. Furthermore, the system is prone to gesture recognition errors. Transitive hand gestures are simplified by assigning them to the nearest preceding or succeeding frame, which may not always represent the human’s true continuous action.

Zou et al. [39] achieve static hand gesture recognition for human handover intention prediction by using a wearable data glove. This glove senses the static shape of the five fingers, representing twelve different human hand gestures that correspond to various handover intentions. Their method has benefits over vision-based approaches by avoiding visual occlusion and excessive physical contact-based methods by minimising safety threats.

Six Inertial Measurement Units (IMUs) are built into the data glove. The IMU data is processed to calculate the bending angles of each finger, which quantitatively describe the human gesture. This approach efficiently predicts twelve different human handover intentions with an average prediction accuracy of 99.6%. However, the proposed approach does not reach 100% accuracy. The researchers recognise this limitation and suggest that future studies should explore methods that use multimodal information, such as combining wearable sensing devices with human gaze or speech, to further improve performance. This suggests that instead of relying only on hand gesture information from the data gloves, gaining additional data would provide a better understanding of human intention.

Llorens-Bonilla and Asada [41] use both dynamic gestures and static postures to recognise intentions in their study. An example of a dynamic gesture is the nodding gesture. This is detected by analysing the Z and Y gyro readings from wearable Inertial Measurement Units (IMUs) placed on the wearer’s wrists and the back of the head. The detection uses specific threshold classifiers to identify nodding actions based on maximum velocities from training data. It is also robust to the wearer’s general posture. However, one main limitation of this approach is that the algorithm for gesture and posture detection works best for simple gestures that can be classified linearly.

3.1.2. Body Postures

Body postures play a key role in intention recognition in industrial collaborative robotics. They provide non-verbal cues that enable robots to understand human actions and goals. This ability is important for ensuring safety, improving efficiency, and supporting seamless human–robot interaction in shared workspaces [71]. Body postures include the position and orientation of a human’s torso, limbs, and head. They offer a valuable source of information about a human collaborator’s state and intentions. Unlike discrete gestures or verbal commands, posture can communicate continuous information about a human’s focus of attention, readiness to act, or discomfort [81]. The robot can predict a human’s next move by analysing their posture. For example, if a person leans forward with a hand extended towards an object, they are likely trying to grab it. The robot can use this information to clear a path or prepare to hand over a tool, which makes the collaborative process more efficient [82]. Different postures can indicate different stages of a task. A person standing upright and facing a workbench might be in a preparatory phase, while a person stooping down could be focused on a delicate assembly task. Recognising these postures helps the robot to understand the current state of the workflow and offer suitable assistance. Body postures are represented as skeletal data, as shown in Figure 5, where a human body is mapped to a set of interconnected joint points.

Vinanzi et al. [58] derive high-level goals from observed human body movements by first extracting skeletal data using OpenPose. This provides an 18 × 2 feature vector of 2D keypoints. To make this representation spatially invariant and more compact, the normalisation process reduces it to a 10 × 2 representation by discarding certain keypoints and centring the torso joint. This high-dimensional data then goes through Principal Component Analysis (PCA) for dimensionality reduction. It projects the 20D feature vectors onto a 2D space to address the challenge of dimensionality in clustering. X-means clustering groups similar postures, and an action is encoded as a sequence of cluster transitions. Any persistence in the same group is discarded to ensure temporal invariance and independence from action speed.

3.1.3. Hand and Arm Motions

Hand and arm motions provide a rich, intuitive, and efficient way to communicate non-verbally and infer human intentions in industrial collaborative robotics. In noisy and dynamic industrial settings, where verbal commands may not work well, these movements serve as the main communication channel [83]. The kinematics of hand and arm motions offer the robot clues about a worker’s upcoming actions. For example, the robot can predict that a human is about to grab an object by analysing the trajectory and velocity of their hand and the shape of their fingers as they approach the object. Similarly, repetitive or patterned arm motions can indicate routine tasks, while sudden or abrupt changes may signal unexpected actions or the need for corrective intervention. These motions are essential for creating a natural and fluid interaction, reducing cognitive load on the human worker, and improving the overall efficiency and safety of human–robot collaboration [84]. In the end, integrating hand-arm motion recognition into robotic systems helps build trust, adaptability, and mutual understanding in collaborative industrial settings. Figure 6 shows a scenario in which hand and arm motions are continuously tracked to capture and interpret human intentions within a collaborative environment.

Lou et al. [57] employed human reaching motions to infer intentions in a collaborative task through a two-layer framework for unsupervised online prediction. The first layer processes human palm position features to classify the observed motion. This layer was chosen for its superior recognition performance. This initial classification guides the second layer, which models the positions of human arm joints, including the palm, wrist, elbow, and shoulder, to predict the entire arm trajectory. Predicting the complete arm trajectory is crucial for accurately computing the human workspace occupancy, allowing collaborative robots to avoid interference. They specifically avoid inferring arm motion solely from palm position by computing Inverse Kinematics (IK). The human arm has redundant degrees of freedom, making it hard to determine the exact IK solution a human will choose. The human’s intended goal is also interpreted by calculating the Euclidean distance between the predicted palm position and various target regions.

In contrast, Luo et al. [54] focus on predicting human hand motion using real-time sensor measurements. Both the velocity and position of the hand play a role. These hand movements help determine the virtual object location that the user wants to interact with, often along with gaze direction. The strategy identifies a desired point from a list of virtual locations. It does this by selecting the point closest to the hand or by applying a threshold distance on the hand to detect the intention to interact.

Rekik et al. [66] use human hands as the main input, providing time-series 3D hand key point coordinates (X, Y, Z) obtained through depth sensing and segmentation. The sequential hand movements are interpreted to identify where the hand is likely headed, such as specific component bins. A simple method involves figuring out the intended object by finding the bin that is closest to the hand using Euclidean distance from the hand keypoint. Their method combines depth information with hand key point coordinates and evaluates the highest probabilities within a specific window of observed hand motions to predict human intention.

3.1.4. Gaze and Eye Direction

Gaze and eye direction are important for recognising human intentions in industrial collaborative tasks. They provide immediate and predictive cues about a person’s focus of attention [85]. Unlike hand gestures that happen during an action, gaze and eye direction often come before an action. This lets a robot predict a human’s next move. This proactive ability is crucial for keeping things safe and making shared workspaces run more efficiently. Figure 7 shows a gaze detection for intention inference in industrial collaborative robotics.

Ban et al. [86] used a two-camera eye-tracking setup that combines gaze and eye direction to recognise user intent for real-time control of a robotic arm. It includes a commercial eye tracker that employs the Pupil Center-Corneal Reflection (PCCR) method to track gaze and identify user attention and intentions. At the same time, a webcam, equipped with a CNN model, detects and classifies four different eye directions with 99.99% accuracy. The system translates these eye directions directly into commands for the robotic arm. It features an all-in-one interface that tracks and classifies this combined eye information in real-time to determine user intent and synchronise fused data. This allows for complex control of the robotic arm with a high degree of freedom. It also distinguishes intentional blinks from natural ones by capturing four input combinations within a 1 s window. This innovative system provides important support in industrial human–robot collaboration by allowing hands-free, precise control of robotic arms in hazardous or complex environments. However, there are inherent delays for control actions.

Similarly, Zhao et al. [35] integrated gaze data into a method for human–robot collaboration during assembly. This primarily aims to understand human intention for selecting and placing objects in a virtual environment. The system collects gaze data using the Tobii Eye Tracker 4c sensor. In the “instruction” mode, the user focuses on a block while pointing at it with their index finger. An algorithm, like Identification using the Dispersion Threshold (I-DT), selects the object if both the eye and hand are close to it and remain within a set boundary for 200 milliseconds. This approach uses eye movement to indicate a user’s attention clearly, which significantly improves reliability and speeds up the interactive response when paired with hand gestures. However, a major drawback of the gaze data is that using eye movement alone to control multi-degree objects is challenging. Additionally, even though the eye tracker is calibrated for each user, it functions as an open-loop system, displaying the line of sight directly. The stability of the eye tracker’s output is usually less reliable than that of hand indicator points, which users fine-tune through “Brain-hand-eye feedback”. The average offset between the eye’s interaction and the target for various objects ranged from 0.41 cm to 1.26 cm at a 40 cm distance from the screen.

3.1.5. Force Interaction

Force interaction serves as a direct physical signal for understanding human intentions in human–robot collaboration. Humans naturally use force when interacting with objects and expressing their intentions [87]. For example, a person might use more force on a tool to show they want to tighten a bolt, or less force on a delicate object to indicate caution. Force sensors are used in this type of interaction. They provide immediate, real-time data about human movements and applied forces [10]. For example, a person might apply more force. This allows the robot to respond instantly and change its behaviour according to human intentions. By examining the patterns and strength of forces a human applies, the robot can predict the human’s next move. Figure 8 shows how force sensing estimates the human collaborator’s arm force to understand their intention during a collaborative task.

In a study by Zhou et al. [88], force signals are the key to understanding human intent and controlling the robot. The system measures the operator’s three-dimensional arm force. This measured force signal has several important uses in the study framework. Firstly, it acts as the main interface for human–robot interaction. The estimated force directly translates into robot motion commands, enabling the operator to guide the robot intuitively and adjust its path by changing the applied force. Secondly, the estimated force and how quickly it changes make up the state space. Thirdly, the force signal helps measure skill criteria based on human demonstrations, such as collaborative comfort and smoothness. Finally, the strength of the arm force and its rate of change serve as inputs for a fuzzy rule-based system that breaks the task into different sub-motion phases.

Another study by Olivares-Alarcos et al. [10] presented a robotic system that infers human operators’ intentions in industrial human–robot collaboration by using force data as physical cues. In a scenario modelled after a car emblem manufacturing line, a human polishes and inspects an emblem held by the robot. Force signals serve as the primary source of information for the robot to understand human intent. The system employs an ATI Multi-Axis Force/Torque Sensor Mini40 SI-20-1, attached to the robot’s wrist, to collect force and torque signals at 500 Hz. These signals help distinguish three specific human intentions: ‘polishing,’ which requires the robot to increase stiffness and hold the object firmly; ‘moving the robot’, prompting the robot to decrease stiffness and move to a more ergonomic position; and ‘grabbing the object’, leading the robot to open its gripper and release the piece. The benefits of this force-based approach include enabling more natural gestures, aiming to reduce overly mechanical movements. However, the system struggled with ambiguity in natural force patterns, which resulted in bias in classification.

3.2. Physiological Cues

Physiological cues are important for recognising intentions in industrial human–robot collaboration. They deliver objective and real-time information about a person’s internal state that conventional behavioural cues cannot capture. By monitoring physiological signals, robots can understand a person’s emotional and cognitive states, including stress, fatigue, and frustration. This helps robots respond proactively and change their behaviour according to the person’s condition. As a result, collaboration becomes safer, more efficient, and more natural [89]. For example, a robot could help when it detects that a human is under a heavy cognitive load. This ability enables collaborative robotics to go beyond simply performing tasks and fosters a more supportive and human-centred partnership. Physiological cues used for recognising intentions are usually divided into two main types: Central Nervous System (CNS) and Peripheral Nervous System (PNS) [89].

3.2.1. Central Nervous System Cues

CNS cues are linked to brain activity. They are considered the most reliable signs of a person’s cognitive and emotional state [89]. Electroencephalography (EEG) is one of the most common techniques for assessing and validating the internal state of a human co-worker during collaborative tasks [90,91]. As shown in Figure 9, EEG is a direct, non-invasive measure of brain activity with a wearable cap. An EEG signal can help identify cognitive states like mental workload, fatigue, and attention. These states are important for understanding a human collaborator’s intent, especially during complex tasks [90,91].

Lyu et al. [92] use EEG for intention recognition in industrial human–robot collaboration by employing a spatially coded Steady-State Visual Evoked Potential (SSVEP) Brain–Computer Interface (BCI). This system detects a person’s gaze direction relative to a single flicker stimulus projected directly onto the shared workspace. This method predicts where a person is about to reach, taking advantage of the natural tendency for people to look at the location where they intend to act. This makes it easier to integrate the system into the person’s workflow. A key benefit is that this early prediction gives the robot enough time to adjust its movements. This significantly improves the efficiency of human–robot collaboration and increases safety distances compared to relying only on arm tracking data. Additionally, the study uses the Signal-to-Noise Ratio (SNR) of the SSVEP response to monitor the operator’s awareness level. This allows the robot’s speed to be adjusted based on vigilance, increasing for highly alert operators, and decreasing for those with lower alertness. This Brain–Computer Interface Vigilance-Controlled Velocity (BCI + VCV) strategy further improves robot performance without reducing safety, even with faster robot arm speeds, since vigilance and target prediction offer separate information channels. However, a limitation of this approach is the variability in BCI classification accuracy among participants. Some experience “BCI illiteracy” and achieve accuracies as low as 50%. Another identified limitation is the 2 s gaze duration required to gather enough EEG data for reliable classification, which introduces delays.

3.2.2. Peripheral Nervous System Cues

Peripheral Nervous System (PNS) signals come from the body’s physical responses. These reactions are usually automatic and show emotional arousal [93]. An example of a PNS technique is Electromyography (EMG), with an application shown in Figure 10.

Peternel et al. [94] use EMG signals, which are direct PNS cues, captured by EMG surface electrodes, to estimate human muscle activity and stiffness trends. This information helps adjust the robot’s stiffness. It also incorporates measurements of human arm force manipulability, based on kinematics obtained through an optical motion capture system. This allows the robot to change its task-frame configuration based on the human’s intended main directions of interaction forces. The key benefits include better collaboration and coordination. This capability allows the robot to change its behaviour and provide suitable assistance with minimal task-level programming. However, there are some limitations. The system requires some predetermined hybrid controller parameters. It also needs to select specific muscles for EMG based on the task. Additionally, accurately estimating full human impedance is complex, although simpler estimations can be sufficient.

3.3. Contextual Cues

Contextual cues for understanding intentions in industrial collaborative robotics are important. They give robots proactive, non-intrusive information about a human’s goals. Instead of waiting for a direct command, a robot can predict a worker’s next step by observing the state of the shared workspace. This approach leads to seamless, safer, and more efficient collaboration [95,96].

Schlenoff et al. [97] implemented a new method where the cues used for understanding intentions come mainly from a representation of the environment based on its state, instead of the usual activity recognition. This approach makes extensive use of spatial relationships modelled with a three-dimensional version of Region Connection Calculus 8 (RCC8). It applies its eight basic relationships across the x, y, and z planes. The main device used is the robot itself, which has sensor systems designed to observe the environment and various end effectors, such as grippers and vacuum tools. The states of these end effectors are crucial to the contextual cues. The benefits of this state-based approach are clear. States are often easier for sensor systems to recognise than actions. They also do not depend on who created them, and the ontology is more reusable because state information is often more flexible. By understanding detailed state relationships, the robot can predict what a human might do next. This allows it to plan either to assist the human or to avoid creating unsafe situations.

Zhang et al. [98] developed a prediction-based model for human–robot collaboration in assembly tasks. This model predicts human intention by using contextual cues and an embedded learning from demonstration technique. The robot observes and learns the human worker’s intentions from their movements, focusing on the history of human hand manipulation actions and the movement of parts in the collaboration workspace. It uses a state-enhanced Convolutional Long-Term Memory (ConvLSTM) framework to extract high-level spatiotemporal features from the shared workspace. This helps predict future actions and allows for smooth task transitions. The main sensing device for data collection is an overhead vision sensor that captures real-time image data at 10 Hz. The advantage of using this contextual cue is that the robot can assist human workers by anticipating their needs and delivering the right components at the right time. However, this technique can struggle with occlusions because of the camera’s limited field of view.

4. Discussion

This review of human intention recognition frameworks in industrial collaborative robotics shows a major shift from basic, pre-programmed systems to more complex and adaptable solutions. This change emphasises the important role of intention recognition in improving safety and efficiency in today’s manufacturing settings. The main goal is to help collaborative robots accurately understand human intentions. This understanding goes beyond simple commands and includes subtle behavioural cues and context. Doing so encourages seamless human–robot collaboration. This ability is crucial for proactive robotic assistance, enhancing workflow, accelerating task completion, and facilitating natural interactions.

4.1. Progression of Learning Models and Sensory Approaches

The development of intention inference techniques is evident, starting from Rule-Based models (e.g., Finite State Machines, Behaviour Trees, Ontology-Based, Semantic Knowledge-Based, Fuzzy Rules, and Petri Nets), which, due to their explicit logic and structured representations, remain highly interpretable and suitable for safety-critical environments [35,36,38,39,40,41,97]. These are followed by Probabilistic models (e.g., Bayesian Networks, Particle Filters, Gaussian Mixture Models, Probabilistic Dynamic Movement Primitives, and Hidden Markov Models), adept at handling the inherent uncertainty and variability in human behaviour by assigning probabilities to different intentions [51,53,54,55,99]. The most recent advancements lie in Machine Learning models [8,10,57,58,59,60] and Deep Learning models [66,67,68,69,70,71], which empower robots with human-like adaptability, superior pattern recognition, and decision-making capabilities, essential for moving beyond inflexible routines and achieving predictive understanding. These algorithms are fundamental to allowing robots to anticipate human needs, proactively prepare tools, and eliminate wasted time, thereby boosting overall efficiency and safety. Table 9 provides a comparative analysis of the intention recognition computational models.

Complementing these algorithmic developments is a focus on combining different sensory inputs to understand human intentions more fully. The paper categorises these inputs into physical cues, such as hand gestures, body posture, hand-arm motions, gaze, and force interactions; physiological cues, like EEG and EMG; and contextual cues, including environmental states and spatial relationships. By integrating these different types of information, robots can grasp both clear commands and subtle signals. This approach goes beyond just completing tasks. It helps in understanding the nuanced behaviour and context that shape human decision-making. For example, gaze offers immediate predictive signals [100], force interaction provides real-time feedback, and the EEG reflects internal cognitive states like workload and alertness [90]. Together, these factors lead to a deeper understanding of human intent.

4.2. Interaction Types and Application Cases

Table 10 categorises the interaction types found in the examined human–robot collaboration studies. These interactions occur in both industrial and experimental settings and show how different collaboration tasks have unique requirements for intention recognition. The table organises each interaction type based on its characteristics, related application case, and the cues the robots use to understand human intention. This organised view highlights the variety of collaboration modes. It also shows how context and physical and cognitive processes shape intention recognition approaches.

4.3. Key Limitations

Despite these significant technological advances, challenges still exist in achieving smooth and effective human–robot collaboration. These issues highlight important gaps for future research. The main limitations focus on the need for real-time processing and the ability to generalise across different industrial applications. Many existing methods, especially in deep learning, struggle with the heavy computing power needed for real-time analysis in complex, changing industrial settings. Additionally, the lack of strong generalisation means that models trained on a specific task, or a small group of users, often do not work well in different real-world situations, with product variations, or with various human operators [71]. The specific limitations identified across the various methods are:

Scalability and complexity: Rule-based models are interpretable, but they often lack flexibility. As tasks grow, their complexity can increase quickly. This can make them difficult to manage in changing environments and limits their ability to adapt to new interactions. They also depend on simple, linearly classified gestures, which restrict their use in more complicated human actions [35,36,38,39,40,41,97].
Accuracy and Robustness: Probabilistic models can show temporary mispredictions when dealing with unclear inputs or similar initial trajectories [51,53,54,55,99]. Machine learning [8,10,57,58,59,60] and Deep Learning [66,67,68,69,70,71] models are data-dependent, requiring extensive, well-labelled datasets that are often difficult to obtain for diverse industrial scenarios.
Latency in Prediction: While the goal is to provide proactive assistance, several methods create delays. This happens either because of extensive data collection or processing tasks. These factors hinder real-time responsiveness and reduce the robot’s ability to act in advance [14].
Multimodal Integration Challenges: Although multimodal input is important, effectively combining different types of sensory data is still a challenge. Each cue has its own limitations. For example, gaze can be delayed and less stable [86]. Force interaction may create confusion [88]. Physiological signals can vary widely among users [89]. Additionally, contextual cues from overhead vision can be blocked [98].
Cognitive Understanding and Implicit Cues: Current systems often emphasise clear actions, but they see human intention as a problem to solve for task planning. There is a significant gap in understanding subtle, hidden human signals and internal thoughts. Many models continue to depend on predictions made at a single point in time instead of capturing the complete and changing intent behind human behaviour [70].
Data Requirements: Machine Learning [8,10,57,58,59,60] and Deep Learning [66,67,68,69,70,71] approaches depend on extensive and meticulously labelled datasets, which are difficult to obtain for varying industrial scenarios. Studies are often conducted with small or limited datasets, thus frequently failing to capture the full variability of real-world conditions.
Interpretability: Machine Learning and Deep learning models often function as “black boxes”, lacking transparency [101].
Lack of Standardised Evaluation: Frameworks are often tested in very specific, constrained, or virtual environments. This makes it difficult to assess how well they perform in real industrial conditions.
Usability: Several systems need a complicated setup, careful adjustments, or manual parameter selection from users. This makes them difficult to deploy and adapt.
Industrial Readiness: Many proposed frameworks have high computational demands and latency. This makes real-time processing difficult, which is crucial for dynamic industrial applications.

4.4. Recommendations

Future research in the context of human intention recognition should prioritise the following key aspects:

Sophisticated Sensor fusion: Combining different sensory inputs. This will help overcome the limits of individual sensors.
Generalisable learning models: Focus on creating models that can perform well across a variety of tasks and for different users.
Operation in complex environments: Ensuring systems work well in real-time, changing, and unstructured industrial environments.
Address prediction latency: Cutting down the time it takes to predict human intentions. This will enable more proactive and seamless collaboration.
Deeper cognitive understanding: Creating models that move past basic action recognition. The goal is to reach a deeper understanding of human intent, which includes grasping ambiguous and implicit signals.
Account for Human and Environmental Variability: Human intentions are not static. Humans continually adapt their actions in response to the robot’s behaviour and dynamic environmental conditions. This makes intention prediction highly context-dependent and variable over time.
Extensive exploration on hybrid models: Developing next-generation hybrid models that can combine the strengths of various algorithmic approaches.
Ethical aspects: Investigating the ethical issues, such as the possibility of algorithmic bias in models and the wider effects of using these technologies in the workplace.
Integration of Complementary Control Approaches: Future research can improve by including complementary control strategies. For example, focusing on learning the desired motion profile to satisfy a force objective [102]. It can also learn environmental dynamics to allow the robot to follow the user’s motion intent with high manoeuvrability [103].
Standardisation efforts: Establishing standardised benchmark datasets and evaluation protocols. This would address the current limitation where systems are often tested in very specific or limited scenarios. Standardisation would allow for fairer comparisons and speed up the development of truly strong and industry-ready solutions.
Moving Beyond Prescriptive Safety: The current safety standards for collaborative robotics, such as ISO/TS 15066, offer important but often rigid guidelines. The fast development of technology in collaborative workspaces can benefit from a shift from reactive safety to proactive safety measures [98].
Addressing Handedness Bias in Collaborative Robot Gesture Recognition: Future research must develop balanced, hand-agnostic models and datasets and gesture-based systems to ensure fair safety and efficiency for varying handedness.
Prioritising Elicitation Studies for Intent-Driven Human–robot Collaboration: Future frameworks require a user-centred system design that integrates comprehensive elicitation studies early in the development process. These studies are important for collecting rich behavioural data. This data captures human intent more effectively than kinematic information alone. Using this human-centred information helps create strong predictive models, enabling the robots to be more proactive than reactive.

5. Conclusions and Future Work

This review paper aimed to examine the changing landscape of human intention recognition frameworks in industrial robotics. It highlighted their crucial role in improving safety and efficiency in modern manufacturing environments. We explored how accurately understanding human intentions is essential for effective robotic assistance and seamless human–robot collaboration. This capability directly boosts safety, operational efficiency, and natural interaction.

The paper discussed advancements in learning techniques, tracing their development from interpretable Rule-Based and Probabilistic models to adaptive Machine Learning and sophisticated Deep Learning approaches. These algorithms together give robots human-like adaptability and decision-making skills, which are vital for dynamic and complex industrial tasks. At the same time, the review emphasised the need for integrating sensory modalities, categorised as physical, physiological, and contextual cues. When these are combined, they greatly improve the interpretation of human intentions through both explicit commands and implicit signals.

Despite the notable advancements, several critical open challenges remain that need to be addressed for widespread use in the industry. Based on our review, a significant limitation is that modern Machine Learning and Deep Learning models rely on large, carefully labelled datasets, which are difficult to acquire for different industrial situations.

Furthermore, many advanced models operate as “black boxes”, posing interpretability challenges in safety-critical industrial settings. Current systems also struggle to achieve real-time processing within complex environments and to generalise across industrial applications.

Looking ahead, future research should focus on addressing these issues. Crucially, although the review focus is on industrial collaborative robotics, future research may have a broader context, encompassing the collaboration of service robots and humanoids with humans in natural and social environments. One key area is the goal of achieving a deeper understanding of cognition. This means developing models that go beyond simple action recognition to capture subtle, ambiguous, and implicit human signals and internal thought processes. Making this shift is important for providing truly proactive assistance and ensuring smooth collaboration. Additionally, research should focus on developing next-generation hybrid models that effectively merge the strengths of different algorithmic methods, such as combining the interpretability of rule-based systems with the adaptability of probabilistic, machine and deep learning techniques. To ensure that these systems are ready for industrial applications, future efforts must also account for the changing nature of human actions and environmental conditions. Finally, there should be efforts to standardise benchmark datasets and evaluation protocols to enable fairer comparisons and speed up the development of robust, industry ready solutions. Addressing these issues will help close the current gap between human and robotic interactions leading to truly adaptative, efficient, and safe collaborative robotic systems.

Author Contributions

Conceptualization, M.K., S.D. and N.S.; methodology, M.K., S.D. and N.S.; investigation, M.K.; writing—original draft preparation, M.K.; writing—review and editing, M.K., S.D., N.S., A.B. and H.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data sharing not applicable.

Acknowledgments

The authors acknowledge the Department of Electrical Engineering and French South African Institute of Technology (F’SATI) at Tshwane University of Technology, Pretoria, South Africa, for their continued support.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AE	Autoencoder
AI	Artificial Intelligence
AJCP	Arm Joint Centre Positions
BCI	Brain–Computer Interface
BCI + VCV	Brain–Computer Interface Vigilance-Controlled Velocity
BETR-XP-LLM	Behaviour Tree Expansion with Large Language Models
BN	Bayesian Network
BTs	Bahaviour Trees
CBAM	Convolutional Block Attention Module
CC	Creative Commons
CNN	Convolutional Neural Network
CNS	Central Nervous System
ConvLSTM	Convolutional Long Short-Term Memory
CPNs	Coloured Petri Nets
DOF	Degree-of-Freedom
DTW	Dynamic Time Warping
EEG	Electroencephalography
EM	Expectation–Estimation/Expectation–Maximisation
EMG	Electromyography
FDE	Final Displacement Error
Fps	Frames per Second
F’SATI	French South African Institute of Technology
FSM	Finite State Machines
GMM	Gaussian Mixture Model
GMR	Gaussian Mixture Regression
HAR	Human Activity Recognition
HDD	Hard Disk Drive
HHIP	Human Handover Intention Prediction
HMDs	Head-Mounted Displays
HMM	Hidden Markov Model
HRCA	Human–robot Collaboration Assembly
HSMM	Hidden Semi-Markov Model
I-DT	Identification using Dispersion Threshold
IK	Inverse Kinematics
ILM	Intention-Aware Linear Model
IMUs	Inertial Measurement Units
IRL	Inverse Reinforcement Learning
KNN	K-Nearest Neighbours
LLM	Large Language Model
LSTM	Long Short-Term Memory
mAP	Mean Average Precision
MDPs	Markov Decision Processes
MIF	Mutable Intention Filter
ML	Machine Learning
MPJPE	Mean Per Joint Position Error
Ms	milliseconds
OCRA	Ontology for Collaborative Robotics and Adaptation
OWL	Web Ontology Language
NN	Neural Network
PAM	Partitioning Around Medoids
PCA	Principal Component Analysis
PCCR	Pupil Centre-Corneal Reflection
PDMP	Probabilistic Dynamic Movement Primitive
PNS	Peripheral Nervous System
PP	Palm Position
RCC8	Region Connection Calculus 8
RL	Reinforcement Learning
RNN	Recurrent Neural Network
Seq-VAE	Sequential Variational Autoencoder
SNR	Signal-to-Noise Ratio
SRLs	Supernumerary Robotic Limbs
SSVEP	Steady-State Visual Evoked Potential
ST-GCN	Spatial–Temporal Graph Convolutional Networks
SVM	Support Vector Machine
TOH	Tower of Hanoi
UOLA	Unsupervised Online Learning Algorithm
VR	Virtual Reality

References

Matheson, E.; Minto, R.; Zampieri, E.G.; Faccio, M.; Rosati, G. Human–robot collaboration in manufacturing applications: A review. Robotics 2019, 8, 100. [Google Scholar] [CrossRef]
Semeraro, F.; Griffiths, A.; Cangelosi, A. Human–robot collaboration and machine learning: A systematic review of recent research. Robot. Comput.-Integr. Manuf. 2023, 79, 102432. [Google Scholar] [CrossRef]
Arents, J.; Abolins, V.; Judvaitis, J.; Vismanis, O.; Oraby, A.; Ozols, K. Human-Robot Collaboration Trends and Safety Aspects: A Systematic Review. J. Sens. Actuator Netw. 2021, 10, 48. [Google Scholar] [CrossRef]
Zhang, Y.; Doyle, T. Integrating intention-based systems in human-robot interaction: A scoping review of sensors, algorithms, and trust. Front. Robot. AI 2023, 10, 1233328. [Google Scholar] [CrossRef] [PubMed]
Schmid, A.J.; Weede, O.; Worn, H. Proactive robot task selection given a human intention estimate. In Proceedings of the RO-MAN 2007-The 16th IEEE International Symposium on Robot and Human Interactive Communication, Jeju, Republic of Korea, 26–29 August 2007; pp. 726–731. [Google Scholar]
Khan, F.; Asif, S.; Webb, P. Communication components for human intention prediction–a survey. In Proceedings of the 14th International Conference on Applied Human Factors and Ergonomics (AHFE 2023), San Francisco, CA, USA, 20–24 July 2023. [Google Scholar]
Hoffman, G.; Bhattacharjee, T.; Nikolaidis, S. Inferring human intent and predicting human action in human–robot collaboration. Annu. Rev. Control Robot. Auton. Syst. 2024, 7, 73–95. [Google Scholar] [CrossRef]
Lin, H.-I.; Nguyen, X.-A.; Chen, W.-K. Active intention inference for robot-human collaboration. Int. J. Comput. Methods Exp. Meas. 2018, 6, 772–784. [Google Scholar] [CrossRef]
Liu, C.; Hamrick, J.; Fisac, J.; Dragan, A.; Hedrick, J.; Sastry, S.; Griffiths, T. Goal Inference Improves Objective and Perceived Performance in Human-Robot Collaboration. arXiv 2018, arXiv:1802.01780. [Google Scholar] [CrossRef]
Olivares-Alarcos, A.; Foix, S.; Alenya, G. On inferring intentions in shared tasks for industrial collaborative robots. Electronics 2019, 8, 1306. [Google Scholar] [CrossRef]
Zhang, Y.; Ding, K.; Hui, J.; Lv, J.; Zhou, X.; Zheng, P. Human-object integrated assembly intention recognition for context-aware human-robot collaborative assembly. Adv. Eng. Inform. 2022, 54, 101792. [Google Scholar] [CrossRef]
Johannsmeier, L.; Haddadin, S. A hierarchical human-robot interaction-planning framework for task allocation in collaborative industrial assembly processes. IEEE Robot. Autom. Lett. 2016, 2, 41–48. [Google Scholar] [CrossRef]
Mohammadi Amin, F.; Rezayati, M.; van de Venn, H.W.; Karimpour, H. A mixed-perception approach for safe human–robot collaboration in industrial automation. Sensors 2020, 20, 6347. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Tian, S.; Liang, X.; Zheng, M.; Behdad, S. Early Prediction of Human Intention for Human–Robot Collaboration Using Transformer Network. J. Comput. Inf. Sci. Eng. 2024, 24, 051003. [Google Scholar] [CrossRef]
Gomez Cubero, C.; Rehm, M. Intention recognition in human robot interaction based on eye tracking. In Proceedings of the IFIP Conference on Human-Computer Interaction, Bari, Italy, 30 August–3 September 2021; pp. 428–437. [Google Scholar]
Nicora, M.L.; Ambrosetti, R.; Wiens, G.J.; Fassi, I. Human–robot collaboration in smart manufacturing: Robot reactive behavior intelligence. J. Manuf. Sci. Eng. 2021, 143, 031009. [Google Scholar] [CrossRef]
Van Den Broek, M.K.; Moeslund, T.B. Ergonomic adaptation of robotic movements in human-robot collaboration. In Proceedings of the Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, Cambridge, UK, 23–26 March 2020; pp. 499–501. [Google Scholar]
Gildert, N.; Millard, A.G.; Pomfret, A.; Timmis, J. The Need for Combining Implicit and Explicit Communication in Cooperative Robotic Systems. Front. Robot. AI 2018, 5, 65. [Google Scholar] [CrossRef]
Che, Y.; Okamura, A.M.; Sadigh, D. Efficient and trustworthy social navigation via explicit and implicit robot–human communication. IEEE Trans. Robot. 2020, 36, 692–707. [Google Scholar] [CrossRef]
Baptista, J.; Castro, A.; Gomes, M.; Amaral, P.; Santos, V.; Silva, F.; Oliveira, M. Human–Robot Collaborative Manufacturing Cell with Learning-Based Interaction Abilities. Robotics 2024, 13, 107. [Google Scholar] [CrossRef]
Knepper, R.A.; Mavrogiannis, C.I.; Proft, J.; Liang, C. Implicit communication in a joint action. In Proceedings of the 2017 Acm/Ieee International Conference on Human-Robot Interaction, Vienna, Austria, 6–9 March 2017; pp. 283–292. [Google Scholar]
Bogue, R. Sensors for robotic perception. Part one: Human interaction and intentions. Ind. Robot 2015, 42, 386–391. [Google Scholar] [CrossRef]
Safeea, M.; Neto, P.; Béarée, R. A quest towards safe human robot collaboration. In Proceedings of the Towards Autonomous Robotic Systems: 20th Annual Conference, TAROS 2019, London, UK, 3–5 July 2019; pp. 493–495. [Google Scholar]
Hoskins, G.O.; Padayachee, J.; Bright, G. Human-robot interaction: The safety challenge (an inegrated frame work for human safety). In Proceedings of the 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA), Bloemfontein, South Africa, 28–30 January 2019; pp. 74–79. [Google Scholar]
Anand, G.; Rahul, E.; Bhavani, R.R. A sensor framework for human-robot collaboration in industrial robot work-cell. In Proceedings of the 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT), Kerala, India, 6–7 July 2017; pp. 715–720. [Google Scholar]
Mia, M.R.; Shuford, J. Exploring the Synergy of Artificial Intelligence and Robotics in Industry 4.0 Applications. J. Artif. Intell. Gen. Sci. 2024, 1. [Google Scholar] [CrossRef]
Pereira, L.M. State-of-the-art of intention recognition and its use in decision making. AI Commun. 2013, 26, 237–246. [Google Scholar]
Rozo, L.; Silvério, J.; Calinon, S.; Caldwell, D.G. Exploiting interaction dynamics for learning collaborative robot behaviors. In Proceedings of the 2016 AAAI International Joint Conference on Artificial Intelligence: Interactive Machine Learning Workshop (IJCAI), New York, NY, USA, 9–11 July 2016. [Google Scholar]
Görür, O.C.; Rosman, B.; Sivrikaya, F.; Albayrak, S. Social cobots: Anticipatory decision-making for collaborative robots incorporating unexpected human behaviors. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, Chicago, IL, USA, 5–8 March 2018; pp. 398–406. [Google Scholar]
Görür, O.C.; Rosman, B.; Sivrikaya, F.; Albayrak, S. FABRIC: A framework for the design and evaluation of collaborative robots with extended human adaptation. ACM Trans. Hum.-Robot Interact. 2023, 12, 1–54. [Google Scholar] [CrossRef]
Cipriani, G.; Bottin, M.; Rosati, G. Applications of learning algorithms to industrial robotics. In Proceedings of the The International Conference of IFToMM ITALY, Online, 9–11 September 2020; pp. 260–268. [Google Scholar]
Ding, H.; Schipper, M.; Matthias, B. Collaborative behavior design of industrial robots for multiple human-robot collaboration. In Proceedings of the IEEE ISR 2013, Seoul, Repulic of Korea, 24–26 October 2013; pp. 1–6. [Google Scholar]
Abraham, A. Rule-Based expert systems. In Handbook of Measuring System Design; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2005. [Google Scholar]
Wang, W.; Li, R.; Chen, Y.; Jia, Y. Human intention prediction in human-robot collaborative tasks. In Proceedings of the Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, Chicago, IL, USA, 5–8 March 2018; pp. 279–280. [Google Scholar]
Zhao, X.; He, Y.; Chen, X.; Liu, Z. Human–robot collaborative assembly based on eye-hand and a finite state machine in a virtual environment. Appl. Sci. 2021, 11, 5754. [Google Scholar] [CrossRef]
Styrud, J.; Iovino, M.; Norrlöf, M.; Björkman, M.; Smith, C. Automatic Behavior Tree Expansion with LLMs for Robotic Manipulation. arXiv 2024, arXiv:2409.13356. [Google Scholar]
Olivares-Alarcos, A.; Foix, S.; Borgo, S.; Alenyà, G. OCRA–An ontology for collaborative robotics and adaptation. Comput. Ind. 2022, 138, 103627. [Google Scholar] [CrossRef]
Akkaladevi, S.C.; Plasch, M.; Hofmann, M.; Pichler, A. Semantic knowledge based reasoning framework for human robot collaboration. Procedia CIRP 2021, 97, 373–378. [Google Scholar] [CrossRef]
Zou, R.; Liu, Y.; Li, Y.; Chu, G.; Zhao, J.; Cai, H. A Novel Human Intention Prediction Approach Based on Fuzzy Rules through Wearable Sensing in Human–Robot Handover. Biomimetics 2023, 8, 358. [Google Scholar] [CrossRef]
Cao, C.; Yang, C.; Zhang, R.; Li, S. Discovering intrinsic spatial-temporal logic rules to explain human actions. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2023; Volume 36, pp. 67948–67959. [Google Scholar]
Llorens-Bonilla, B.; Asada, H.H. Control and coordination of supernumerary robotic limbs based on human motion detection and task petri net model. In Proceedings of the Dynamic Systems and Control Conference, Palo Alto, CA, USA 21–23 October 2013; p. V002T027A006. [Google Scholar]
Bdiwi, M.; Al Naser, I.; Halim, J.; Bauer, S.; Eichler, P.; Ihlenfeldt, S. Towards safety4. 0: A novel approach for flexible human-robot-interaction based on safety-related dynamic finite-state machine with multilayer operation modes. Front. Robot. AI 2022, 9, 1002226. [Google Scholar] [CrossRef]
Awais, M.; Henrich, D. Human-robot collaboration by intention recognition using probabilistic state machines. In Proceedings of the 19th International Workshop on Robotics in Alpe-Adria-Danube Region (RAAD 2010), Budapest, Hungary, 24–26 June 2010; pp. 75–80. [Google Scholar]
Iodice, F.; De Momi, E.; Ajoudani, A. Intelligent Framework for Human-Robot Collaboration: Safety, Dynamic Ergonomics, and Adaptive Decision-Making. arXiv 2025, arXiv:2503.07901. [Google Scholar] [CrossRef]
Colledanchise, M.; Ögren, P. Behavior Trees in Robotics and AI: An Introduction; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
Palm, R.; Chadalavada, R.T.; Lilienthal, A. Fuzzy Modeling and Control for Intention Recognition in Human-robot Systems. In Proceedings of the 8th International Joint Conference on Computational Intelligence, Porto, Portugal, 9–11 November 2016. [Google Scholar]
Nawaz, F.; Peng, S.; Lindemann, L.; Figueroa, N.; Matni, N. Reactive temporal logic-based planning and control for interactive robotic tasks. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 12108–12115. [Google Scholar]
Wu, B.; Hu, B.; Lin, H. A learning based optimal human robot collaboration with linear temporal logic constraints. arXiv 2017, arXiv:1706.00007. [Google Scholar] [CrossRef]
Zurawski, R.; Zhou, M. Petri nets and industrial applications: A tutorial. IEEE Trans. Ind. Electron. 1994, 41, 567–583. [Google Scholar] [CrossRef]
Ebert, S. A model-driven approach for cobotic cells based on Petri nets. In Proceedings of the 23rd ACM/IEEE International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings, Virtual, 16–23 October 2020; pp. 1–6. [Google Scholar]
Hernandez-Cruz, V.; Zhang, X.; Youcef-Toumi, K. Bayesian intention for enhanced human robot collaboration. arXiv 2024, arXiv:2410.00302. [Google Scholar] [CrossRef]
Huang, Z.; Mun, Y.-J.; Li, X.; Xie, Y.; Zhong, N.; Liang, W.; Geng, J.; Chen, T.; Driggs-Campbell, K. Hierarchical intention tracking for robust human-robot collaboration in industrial assembly tasks. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 9821–9828. [Google Scholar]
Qu, J.; Li, Y.; Liu, C.; Wang, W.; Fu, W. Prediction of Assembly Intent for Human-Robot Collaboration Based on Video Analytics and Hidden Markov Model. Comput. Mater. Contin. 2025, 84, 3787–3810. [Google Scholar] [CrossRef]
Luo, R.C.; Mai, L. Human intention inference and on-line human hand motion prediction for human-robot collaboration. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 5958–5964. [Google Scholar]
Lyu, J.; Ruppel, P.; Hendrich, N.; Li, S.; Görner, M.; Zhang, J. Efficient and collision-free human–robot collaboration based on intention and trajectory prediction. IEEE Trans. Cogn. Dev. Syst. 2022, 15, 1853–1863. [Google Scholar] [CrossRef]
Liu, T.; Wang, J.; Meng, M.Q.H. Evolving hidden Markov model based human intention learning and inference. In Proceedings of the 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), Zhuhai, China, 6–9 December 2015; pp. 206–211. [Google Scholar]
Luo, R.; Hayne, R.; Berenson, D. Unsupervised early prediction of human reaching for human–robot collaboration in shared workspaces. Auton. Robot. 2018, 42, 631–648. [Google Scholar] [CrossRef]
Vinanzi, S.; Goerick, C.; Cangelosi, A. Mindreading for robots: Predicting intentions via dynamical clustering of human postures. In Proceedings of the 2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), Oslo, Norway, 19–22 August 2019; pp. 272–277. [Google Scholar]
Xiao, S.; Wang, Z.; Folkesson, J. Unsupervised robot learning to predict person motion. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 691–696. [Google Scholar]
Zhang, X.; Yi, D.; Behdad, S.; Saxena, S. Unsupervised Human Activity Recognition Learning for Disassembly Tasks. IEEE Trans. Ind. Inform. 2023, 20, 785–794. [Google Scholar] [CrossRef]
Fruggiero, F.; Lambiase, A.; Panagou, S.; Sabattini, L. Cognitive human modeling in collaborative robotics. Procedia Manuf. 2020, 51, 584–591. [Google Scholar] [CrossRef]
Ogenyi, U.E.; Liu, J.; Yang, C.; Ju, Z.; Liu, H. Physical human–robot collaboration: Robotic systems, learning methods, collaborative strategies, sensors, and actuators. IEEE Trans. Cybern. 2019, 51, 1888–1901. [Google Scholar] [CrossRef] [PubMed]
Obidat, O.; Parron, J.; Li, R.; Rodano, J.; Wang, W. Development of a teaching-learning-prediction-collaboration model for human-robot collaborative tasks. In Proceedings of the 2023 IEEE 13th International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Qinhuangdao, China, 11–14 July 2023; pp. 728–733. [Google Scholar]
Vinanzi, S.; Cangelosi, A.; Goerick, C. The collaborative mind: Intention reading and trust in human-robot interaction. Iscience 2021, 24, 102130. [Google Scholar] [CrossRef]
Liu, R.; Chen, R.; Abuduweili, A.; Liu, C. Proactive human-robot co-assembly: Leveraging human intention prediction and robust safe control. In Proceedings of the 2023 IEEE Conference on Control Technology and Applications (CCTA), Bridgetown, Barbados, 16–18 August 2023; pp. 339–345. [Google Scholar]
Rekik, K.; Gajjar, N.; Da Silva, J.; Müller, R. Predictive intention recognition using deep learning for collaborative assembly. In Proceedings of the 2024 10th International Conference on Control, Decision and Information Technologies (CoDIT), Vallette, Malta, 1–4 July 2024; pp. 1153–1158. [Google Scholar]
Kamali Mohammadzadeh, A.; Alinezhad, E.; Masoud, S. Neural-Network-Driven Intention Recognition for Enhanced Human–Robot Interaction: A Virtual-Reality-Driven Approach. Machines 2025, 13, 414. [Google Scholar] [CrossRef]
Keshinro, B.; Seong, Y.; Yi, S. Deep Learning-based human activity recognition using RGB images in Human-robot collaboration. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 2022, 66, 1548–1553. [Google Scholar] [CrossRef]
Maceira, M.; Olivares-Alarcos, A.; Alenya, G. Recurrent neural networks for inferring intentions in shared tasks for industrial collaborative robots. In Proceedings of the 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Naples, Italy, 31 August–4 September 2020; pp. 665–670. [Google Scholar]
Kedia, K.; Bhardwaj, A.; Dan, P.; Choudhury, S. Interact: Transformer models for human intent prediction conditioned on robot actions. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 621–628. [Google Scholar]
Dell’Oca, S.; Matteri, D.; Montini, E.; Cutrona, V.; Barut, Z.M.; Bettoni, A. Improving collaborative robotics: Insights on the impact of human intention prediction. In Proceedings of the International Workshop on Human-Friendly Robotics, Lugano, Switzerland, 30 September–1 October 2024; pp. 1–15. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Miyazawa, K.; Nagai, T. Survey on multimodal transformers for robots. TechRxiv 2023. [Google Scholar] [CrossRef]
Ding, P.; Zhang, J.; Zhang, P.; Lyu, Y. Large Language Model-powered Operator Intention Recognition for Human-Robot Collaboration. IFAC-PapersOnLine 2025, 59, 2280–2285. [Google Scholar] [CrossRef]
Liu, H.; Gamboa, H.; Schultz, T. Sensor-Based Human Activity and Behavior Research: Where Advanced Sensing and Recognition Technologies Meet. Sensors 2023, 23, 125. [Google Scholar] [CrossRef]
Xu, B.; Li, J.; Wong, Y.; Zhao, Q.; Kankanhalli, M.S. Interact as you intend: Intention-driven human-object interaction detection. IEEE Trans. Multimed. 2019, 22, 1423–1432. [Google Scholar] [CrossRef]
Schreiter, T.; Rudenko, A.; Rüppel, J.V.; Magnusson, M.; Lilienthal, A.J. Multimodal Interaction and Intention Communication for Industrial Robots. arXiv 2025, arXiv:2502.17971. [Google Scholar] [CrossRef]
Lei, Q.; Zhang, H.; Yang, Y.; He, Y.; Bai, Y.; Liu, S. An investigation of applications of hand gestures recognition in industrial robots. Int. J. Mech. Eng. Robot. Res. 2019, 8, 729–741. [Google Scholar] [CrossRef]
Darmawan, I.; Rusydiansyah, M.H.; Purnama, I.; Fatichah, C.; Purnomo, M.H. Hand Gesture Recognition for Collaborative Robots Using Lightweight Deep Learning in Real-Time Robotic Systems. arXiv 2025, arXiv:2507.10055. [Google Scholar] [CrossRef]
Shrinah, A.; Bahraini, M.S.; Khan, F.; Asif, S.; Lohse, N.; Eder, K. On the design of human-robot collaboration gestures. arXiv 2024, arXiv:2402.19058. [Google Scholar] [CrossRef]
Abdulghafor, R.; Turaev, S.; Ali, M.A. Body language analysis in healthcare: An overview. Healthcare 2022, 10, 1251. [Google Scholar] [CrossRef]
Kuo, C.-T.; Lin, J.-J.; Jen, K.-K.; Hsu, W.-L.; Wang, F.-C.; Tsao, T.-C.; Yen, J.-Y. Human posture transition-time detection based upon inertial measurement unit and long short-term memory neural networks. Biomimetics 2023, 8, 471. [Google Scholar] [CrossRef]
Kurniawan, W.C.; Liang, Y.W.; Okumura, H.; Fukuda, O. Design of human motion detection for non-verbal collaborative robot communication cue. Artif. Life Robot. 2025, 30, 12–20. [Google Scholar] [CrossRef]
Laplaza, J.; Moreno, F.; Sanfeliu, A. Enhancing robotic collaborative tasks through contextual human motion prediction and intention inference. Int. J. Soc. Robot. 2024, 17, 2077–2096. [Google Scholar] [CrossRef]
Belcamino, V.; Takase, M.; Kilina, M.; Carfí, A.; Mastrogiovanni, F.; Shimada, A.; Shimizu, S. Gaze-Based Intention Recognition for Human-Robot Collaboration. In Proceedings of the 2024 International Conference on Advanced Visual Interfaces, Genoa, Italy, 3–7 June 2024. [Google Scholar]
Ban, S.; Lee, Y.J.; Yu, K.J.; Chang, J.W.; Kim, J.-H.; Yeo, W.-H. Persistent human–machine interfaces for robotic arm control via gaze and eye direction tracking. Adv. Intell. Syst. 2023, 5, 2200408. [Google Scholar] [CrossRef]
Losey, D.P.; McDonald, C.G.; Battaglia, E.; O’Malley, M.K. A review of intent detection, arbitration, and communication aspects of shared control for physical human–robot interaction. Appl. Mech. Rev. 2018, 70, 010804. [Google Scholar] [CrossRef]
Zhou, Y.; Tang, N.; Li, Z.; Sun, H. Methodology for Human–Robot Collaborative Assembly Based on Human Skill Imitation and Learning. Machines 2025, 13, 431. [Google Scholar] [CrossRef]
Savur, C.; Sahin, F. Survey on Physiological Computing in Human–Robot Collaboration. Machines 2023, 11, 536. [Google Scholar] [CrossRef]
Fazli, B.; Sajadi, S.S.; Jafari, A.H.; Garosi, E.; Hosseinzadeh, S.; Zakerian, S.A.; Azam, K. EEG-Based Evaluation of Mental Workload in a Simulated Industrial Human-Robot Interaction Task. Health Scope 2025, 14, e158096. [Google Scholar] [CrossRef]
Richter, B.; Putze, F.; Ivucic, G.; Brandt, M.; Schütze, C.; Reisenhofer, R.; Wrede, B.; Schultz, T. Eeg correlates of distractions and hesitations in human–robot interaction: A lablinking pilot study. Multimodal Technol. Interact. 2023, 7, 37. [Google Scholar] [CrossRef]
Lyu, J.; Maýe, A.; Görner, M.; Ruppel, P.; Engel, A.K.; Zhang, J. Coordinating human-robot collaboration by EEG-based human intention prediction and vigilance control. Front. Neurorobot. 2022, 16, 1068274. [Google Scholar] [CrossRef]
Rani, P.; Sarkar, N.; Smith, C.A.; Kirby, L.D. Anxiety detecting robotic system–towards implicit human-robot collaboration. Robotica 2004, 22, 85–95. [Google Scholar] [CrossRef]
Peternel, L.; Tsagarakis, N.; Ajoudani, A. A human–robot co-manipulation approach based on human sensorimotor information. IEEE Trans. Neural Syst. Rehabil. Eng. 2017, 25, 811–822. [Google Scholar] [CrossRef]
Marić, J.; Petrović, L.; Marković, I. Human Intention Recognition in Collaborative Environments using RGB-D Camera. In Proceedings of the 2023 46th MIPRO ICT and Electronics Convention (MIPRO), Opatija, Croatia, 22–26 May 2023; pp. 350–355. [Google Scholar]
Khan, S.U.; Sultana, M.; Danish, S.; Gupta, D.; Alghamdi, N.S.; Woo, S.; Lee, D.-G.; Ahn, S. Multimodal feature fusion for human activity recognition using human centric temporal transformer. Eng. Appl. Artif. Intell. 2025, 160, 111844. [Google Scholar] [CrossRef]
Schlenoff, C.; Pietromartire, A.; Kootbally, Z.; Balakirsky, S.; Foufou, S. Ontology-based state representations for intention recognition in human–robot collaborative environments. Robot. Auton. Syst. 2013, 61, 1224–1234. [Google Scholar] [CrossRef]
Zhang, Z.; Peng, G.; Wang, W.; Chen, Y.; Jia, Y.; Liu, S. Prediction-Based Human-Robot Collaboration in Assembly Tasks Using a Learning from Demonstration Model. Sensors 2022, 22, 4279. [Google Scholar] [CrossRef]
Huang, Z.; Mun, Y.-J.; Li, X.; Xie, Y.; Zhong, N.; Liang, W.; Geng, J.; Chen, T.; Driggs-Campbell, K. Hierarchical intention tracking for robust human-robot collaboration in industrial assembly tasks. arXiv 2022, arXiv:2203.09063. [Google Scholar]
Nie, Y.; Ma, X. Gaze Based Implicit Intention Inference with Historical Information of Visual Attention for Human-Robot Interaction. In Intelligent Robotics and Applications; Springer: Cham, Switzerland, 2021; pp. 293–303. [Google Scholar]
Zhou, J.; Su, X.; Fu, W.; Lv, Y.; Liu, B. Enhancing intention prediction and interpretability in service robots with LLM and KG. Sci. Rep. 2024, 14, 26999. [Google Scholar] [CrossRef]
Xing, X.; Maqsood, K.; Zeng, C.; Yang, C.; Yuan, S.; Li, Y. Dynamic motion primitives-based trajectory learning for physical human–robot interaction force control. IEEE Trans. Ind. Inform. 2023, 20, 1675–1686. [Google Scholar] [CrossRef]
Xing, X.; Burdet, E.; Si, W.; Yang, C.; Li, Y. Impedance learning for human-guided robots in contact with unknown environments. IEEE Trans. Robot. 2023, 39, 3705–3721. [Google Scholar] [CrossRef]

Figure 1. A model of human intention recognition in industrial collaborative robotics [8].

Figure 2. Learning techniques for human intention recognition in industrial collaborative robotics.

Figure 3. Cues for human intention recognition in industrial collaborative robotics.

Figure 4. Static human hand gestures for intention recognition.

Figure 5. Human skeletal tracking for body posture and intention recognition.

Figure 6. Tracking hand and arm motions for intention recognition in collaborative robotics (A–D).

Figure 7. Tracking of gaze and eye direction for intention recognition in collaborative robotics.

Figure 8. Human intention inferred through force signals during a collaborative task.

Figure 9. Implementation of EEG for intention recognition in collaborative robotics.

Figure 10. Implementation of EMG for intention recognition in collaborative robotics.

Table 1. Rule-Based Approaches for Intention Recognition in Industrial Collaborative Robotics.

Model	Input Data	Sensing Device	Scenario	Key Features/ Architecture	Performance	Limitations	Key Findings	Ref.
Finite State Machines (FSM)	Hand and eye gaze data.	Leap Motion Sensor and Tobii Eye Tracker 4c.	6-DOF robot assembling building blocks in a virtual setting.	FSM as a state transfer module.	100% accuracy. Assembly time reduced By 57.6%	Delays and stability.	Efficient eye-hand coordination and FSM.	[35]
Behaviour Trees (BTs)	Natural language instructions.	Azure Kinect camera (object detection).	ABB YuMi robot picking and placing cubes.	BT with Large Language Model (LLM) + GPT-4-1106.	100% goal interpretation.	Scalability Uncertainty.	High goal interpretation accuracy.	[36]
Ontology-Based Rules	Hand pose and velocity measurements.	HTC Vive Tracker.	A human and a robot collaboratively fill a tray during an industrial kitting task.	Web Ontology Language 2 (OWL 2) reasoner for intention inference.	Maintain robustness In incongruent limit cases.	Decreased expressiveness of computational formalisation.	Enhanced robotic assistance and safety via intention inference.	[37]
Semantic Knowledge-Based Rules	Perception data.	Camera.	Human–robot collaborative teaching in an industrial assembly.	GrakAI for semantic knowledge representation.	Reduced training data Requirements.	Addressed manually modelling event rules.	Better management of novel situations.	[38]
Fuzzy Rules	Hand gestures.	Wearable data glove.	Human–robot handover.	Fuzzy Rules-based Fast Human Handover Intention Prediction (HHIP).	Average prediction accuracy of 99.6%	Prediction accuracy is not yet 100%	The method avoids visual occlusion/safety risks.	[39]
Temporal Logic Rules	Skeletal data.	Azure Kinect DK camera.	Human–robot decelerator assembly.	Spatial-Temporal Graph Convolutional Networks (ST-GCN)	Top-1 accuracy of 40%. mAP of 96.89% at 54.37 fps	Lower accuracy of assembly action recognition in isolation	Framework reliably deduces agent roles with high probability.	[40]
Petri Nets	Human gestures	Wearable Inertial Measurement Units (IMUs).	Aircraft assembly task using Supernumerary Robotic Limbs (SRLs).	Coloured Petri Nets (CPNs) with graphical notations and high-level programming capabilities.	Successful detection of 90% of nods.	Linearly classified simple gestures.	Minimal states and perfect resource management.	[41]

Table 2. Analysis of Rule-Based Approaches for Intention Recognition in Industrial Collaborative Robotics.

Computational Model	Advantages	Disadvantages	Representative Application Scenarios	Potential Improvements	Ref.
FSM	Effective for discrete state control. Clear stages for optimisation. Reduces user burden.	Lack of direct sensory feedback incorporation. Potential for delay/stability issues.	Collaborative assembly tasks. Virtual prototyping. Semiautomatic operation management.	Optimisation of interaction states. Incorporation of a feedback mechanism.	[35]
BTs	Transparent and readable. Enhanced reactivity. Strong structure and modularity. Amenable to formal verification is possible.	Limited scalability. Difficulty with long-horizon planning. Challenges with handling missing information. Requires substantial engineering effort.	Collaborative object manipulation tasks (Pick-and-place operations).	Automated and continuous policy updates. Eliminating reflective feedback. Simplified parameter selection. Robust resolution of ambiguous instructions.	[36]
Ontology-Based Rules	Higher state recognition accuracy. Key enabler for human–robot safety.	Challenge of non-convex objects. High ontology update overhead in dynamic environments	Cooperative human–robot environments. Industrial collaborative assembly tasks.	Model more complex spatial relationships. Implement a system for failure determination and replanning.	[37]
Semantic Knowledge-Based Rules	Effective handling of novel situations. Knowledge portability across robots.	Requires manual rule modelling and user interaction.	Human–robot collaborative teaching. User-guided systems.	Support for episodic definition of novel events. Development of an action generator for task execution.	[38]
Fuzzy Rules	Faster intention prediction. Mitigates sensor-related safety threats.	Challenges with sensor fusion and prediction.	Human–robot handover. Object transfer. Robot motion adjustments.	Integrate multimodal sensor information to improve intention detection accuracy.	[39]
Temporal Logic Rules	Effective reactive behaviour specification. Superior long-term future action prediction.	Performance degrades with sub-optimal agents. Complex and exhaustive prediction search.	Human movement prediction in a collaborative task	Introduce constraints to robot motions. Extensions for non-holonomic robots.	[40]
Petri Nets	Effective for modelling concurrent and deterministic processes. Excellent for the management of parallel tasks.	Basic structure unsuitable for complex tasks. Incompatible with complex human uncertainties.	Specialised complex collaborative assembly tasks	Increase sensor variety and sophisticated models.	[41]

Table 3. Probabilistic Models for Intention Recognition in Industrial Collaborative Robotics.

Model	Input Data	Sensing Device	Scenario	Key Features/ Architecture	Performance	Limitations	Key Findings	Ref.
Bayesian Network (BN)	Head and hand orientation. Hand velocities.	Intel RealSense D455 depth camera.	Tabletop pick-and-place task involving a UR5 robot.	Bayesian Network (BN) for modelling variable causal relationships.	85% accuracy. 36% precision F1 score = 60%	Exhibit temporary mispredictions.	Achieves real-time and smooth collision avoidance.	[51]
Particle Filter	3D human wrist trajectories.	Intel RealSense RGBD camera.	Assembly of Misumi waterproof E-Model Crimp wire connectors.	Mutable Intention Filter (MIF) as a particle filter variant.	Frame-wise accuracy of 90.4%	Single layer Filter.	Zero assembly failures. Fast completion time. Shorter guided path.	[52]
Hidden Markov Model (HMM)	Assembly action sequence video data.	Xiaomi 11 camera.	Reducer assembly process.	Integration with assembly task constraints.	Prediction accuracy of 90.6%	Need for further optimisation.	Efficient feature extraction.	[53]
Probabilistic Dynamic Movement Primitive (PDMP)	3D right hand motion trajectories.	Microsoft Kinect V1.	Tabletop manipulation task	Offline stage of PDMP construction	Good adaptation and generalisation	Temporal mismatches.	High performance in predicted hand motion trajectory similarity.	[54]
Gaussian Mixture Model (GMM)	Human palm trajectory data.	PhaseSpace Impulse X2 motion-capture system.	Shared workspace pick-and-place and assembly tasks	GMM for human intention target estimation	Accurate and robust estimates	False classifications for close targets	Generates shorter robot trajectories and faster task execution.	[55]

Table 4. Analysis of Probabilistic Models for Intention Recognition in Industrial Collaborative Robotics.

Computational Model	Advantages	Disadvantages	Representative Application Scenarios	Potential Improvements	Ref.
BN	High accuracy and speed. Clear causality between variables. Good interpretability.	Single-modality baselines takes less to predict.	Collaborative object pick-and-place operations.	Automate obstacle size. Offline LLM integration for structure.	[51]
Particle Filter	Effective for tracking continuously changing intentions.	Base MIF is for a single layer intention tracking.	Low-level intention tracking in collaborative assembly tasks.	Generalise from rule-based intentions to latent states. Explore time-varying intention transition settings.	[52]
HMM	Require a smaller sample size for prediction. Mitigate action recognition errors effectively.	Untrained sequences prevent prediction.	Collaborative assembly task: predicting the operator’s next assembly intention.	Integration of assembly task constraints. Improve generalisation ability and real-time performance.	[53]
PDMP	Easy adaptability and generalisation. Better flexibility in unstructured environments. Low noise sensitivity.	High prediction error in very early movement stages. Temporal mismatches due to the average duration.	Online inference of human intention in collaborative tasks (e.g., object manipulation).	Reduce path attractor differences. Reduce temporal mismatch. Generalise the framework to arm or whole-body motion prediction.	[54]
GMM	Estimates target accurately early on with the predicted trajectory. Low computational load for few targets.	Poor target accuracy initially based on the observed trajectory. Not suited for trajectory shape regression.	Estimating the human reaching goals in collaborative assembly tasks.	Fuse GMM input with extra information.	[55]

Table 5. Machine Learning Models for Intention Recognition in Industrial Collaborative Robotics.

Model	Input Data	Sensing Device	Scenario	Key Features/ Architecture	Performance	Limitations	Key Findings	Ref.
K-Nearest Neighbours (KNN)	Labelled force/torque signals.	ATI Multi-Axis Force/Torque sensor Mini40-Si-20-1	Car emblem polishing: robot holds object, human polishes.	KNN with Dynamic Time Warping (DTW).	F1 score = 98.14% with inference time of 0.85 s.	Bias towards “move” intent.	Validation showed evidence of user adaptation.	[10]
Unsupervised Learning (Online clustering)	Unlabeled human reaching trajectories.	VICON System	Predicting reaching motions for robot avoidance.	Two-layer Gaussian Mixture Models (GMMs) + online unsupervised learning.	99.0% (simple tasks), 93.0% (realistic scenarios)	Reduced early prediction accuracy in complex settings.	Faster decision making for human motion inference.	[57]
X-means Clustering	Human skeletal data.	iCub Robot Eye Cameras	Block building game with iCub robot.	X-means clustering + Hidden Semi-Markov Model (HSMM).	100% accuracy at ~57.5% task completion, 4.49 s latency.	Reliance solely on skeletal input data	Approach eliminates the need for handcrafted plan libraries.	[58]
Partitioning Around Medoids (PAM)	Short trajectories of people moving.	LIDAR or RGBD camera	Predicting human motion for interference avoidance.	Pre-trained SVM + PAM clustering.	70% prediction accuracy.	Manual selection of prototype numbers. Small dataset.	Prediction accuracy was effectively demonstrated in a lab environment.	[59]
Sequential Variational Autoencoder (Seq-VAE)	Unlabeled video frames of human poses.	Digital Camera	Hard disk drive disassembly.	Seq-VAE + Hidden Markov Model (HMM).	91.52% recognition accuracy, reduced annotation effort.	Robustness to real- world product variability	Successfully identifies continuous complex activities.	[60]
Inverse Reinforcement Learning (IRL)	Images of humans gestures and object attributes.	Kinect camera	Coffee-making and pick-and-place tasks.	Inverse Reinforcement Learning (IRL) in Markov Decision Processes.	Derived globally optimal policies. Outperformed frequency-based optimisation.	Insufficient demonstration data	Robot predicts successive actions and suggests support.	[8]

Table 6. Analysis of Machine Learning Models for Intention Recognition in Industrial Collaborative Robotics.

Computational Model	Advantages	Disadvantages	Representative Application Scenarios	Potential Improvements	Ref.
KNN	Algorithmic simplicity and concept. Good performance in time series classification.	Slow inference time. Computationally intense. Cannot handle non-sequential data.	Inferring intentions during physical collaborative tasks for time series classification.	Incorporate diverse contextual data.	[10]
Online Clustering	Manual labelling is not required. Models built on-the-fly. Reduce the influence of noisy motions.	Prediction performance degrades with challenging setups.	Industrial manipulation tasks for early prediction of human intent.	Explore ways to generate smoother predicted trajectories. Explore fast motion planning algorithms for the robot.	[57]
X-means Clustering	Autonomously discovers an optimal cluster.	Untrained sequences prevent prediction.	Low-level clustering of human postures in collaborative tasks.	Integration with new data sources.	[58]
PAM	Minimises general pairwise dissimilarities.	Manual selection of prototypes through trial and error.	Unsupervised learning to find prototypical patterns of motion in industrial settings.	Automate the selection of the number of prototypes. Incorporate cluster statistics.	[59]
Seq-VAE	Reduces manual annotation needs. Latent space clearly separates actions. Captures detailed subactions effectively. Distinguishes new actions without retraining.	Requires tuning hyperparameters. Higher Mean Square Error (MSE) for frame reconstruction.	Unsupervised human activity recognition in disassembly tasks.	Integrate advanced self-supervised learning methods.	[60]
IRL	Effectively discovers optimal reward function. Reduces the learning time significantly.	Ineffective if there are gesture recognition errors or insufficient demonstration data.	Pick-and-place collaborative tasks.	Consider more actions for the versatility of the collaboration.	[8]

Table 7. Deep Learning Models for Intention Recognition in Industrial Collaborative Robotics.

Model	Input Data	Sensing Device	Scenario	Key Features/ Architecture	Performance	Limitations	Key Findings	Ref.
Long Short-Term Memory (LSTM)	Time series 3D tensor of human key point coordinates + depth values	3D Azure Kinect Cameras	Human hand destination prediction in microcontroller housing assembly	Captures long-term dependencies.	~89% validation accuracy	Limited to predefined locations.	Faster and stable convergence.	[66]
Convolutional Neural Network (CNN)	High-resolution, multimodal data of human motion and gestures.	HTC Vive Trackers and Pro Eye Arena System	Completing a series of predefined activities in a virtual manufacturing environment.	Extract local spatial Patterns and fine-grained motion details.	F1 score = 0.95, Response time = 0.81 s	Difficulty capturing complex gait patterns.	Achieved near-perfect overall accuracy.	[67]
Convolutional LSTM	RGB images resized to 64 × 64 Pixels, normalised.	Stationary Kinect Camera	Predict human intentions for HRC scenarios like furniture assembly or table tennis	Combines CNN for spatial feature extraction + LSTM for temporal modelling.	74.11% accuracy.	Low accuracy for practical deployment.	ConvLSTM clearly outperformed alternative methods.	[68]
Recurrent Neural Network (RNN)	6D force/torque sensor data from shared object manipulation	ATI Multi-Axis Force/Torque Sensor	Classify intentions in industrial polishing/collaborative object handling.	Processes sequential force sensor measurements. SoftMax classification.	F1 = 0.937 Response time = 0.103 s	Specific to force- based tasks, may not generalise to vision-based scenarios.	Speed superior to the window-based methods.	[69]
Transformer	Human poses	OptiTrack Motion Capture System	Close proximity human–robot tasks such as cabinet pick, cart place, and tabletop manipulation.	Two-stage training: human–human and human–robot.	Achieved lower FDE, better pose prediction at 1 s horizon.	Limited environments per task.	Outperformed marginal models in human–robot interactions.	[70]
Neural Network (NN)	Contextual information and Visual features	Work-cell camera Setup	Turn-based Tower of Hanoi (TOH) game.	Multi-input Neural Network (NN).	Test accuracy of more than 95%	Single-use case.	Efficient, consistent, and predictable performance.	[71]

Table 8. Analysis of Deep Learning Models for Intention Recognition in Industrial Collaborative Robotics.

Computational Model	Advantages	Disadvantages	Representative Application Scenarios	Potential Improvements	Ref.
LSTM	Faster response time. Effectively captures long-term dependencies in sequential data.	Mispredictions due to partial hand occlusions. Prediction instability due to human hesitation.	Real-time human intention prediction in collaborative assembly tasks.	Incorporating wider situations. Develop a task planning module.	[66]
CNN	Excellent for extracting local spatial features. Robust for multi-dimensional input data.	Lacks the capability to excel on its own in tasks requiring an understanding of temporal dynamics.	Classification of the seven-core human activities in collaborative tasks.	Combination with other architectures to improve the capturing of long-range temporal dependencies.	[67]
ConvLSTM	High prediction accuracy. Effectively captures spatial and temporal relationships in video frames.	Training time is longer. Limited to a few actions.	Industrial collaborative tasks.	Expand applications to more complex problems. Increase action types.	[68]
RNN	Reduces decision latency. Flexible, reacting dynamically to changes in user intention.	Performance saturation.	Force sensor data for inferring operator intentions in a shared task.	Expand input modalities.	[69]
Transformer	Effectively tackles the dependency problem between human intent and robot actions.	Training requires large-scale paired human–robot interaction data.	Collaborative human–robot object manipulation.	Dataset expansion to cover a wider distribution of motions.	[70]
NN	Outperforms optimal logic in predicting human actions. Provides more stable performance.	Only uses single-point-in-time predictions.	Collaborative tasks for anticipating human moves.	Test in use cases with higher task variability.	[71]

Table 9. Critical Comparative Analysis of Intention Recognition Models.

Aspect	Rule-Based Models	Probabilistic Models	Machine Learning and Deep Learning Models
Core Principle	Explicit logic using predefined rules and symbolic reasoning.	Uncertainty modelling through probability distributions.	Data-driven learning through pattern extraction neural architectures
Interpretability	Highly interpretable due to explicit rules and decision logic.	Moderately interpretable, though probabilistic dependencies may obscure reasoning.	Low interpretability, often considered a “black-box” with limited interpretability.
Adaptability	Low—limited to programmed rules, poor scalability to novel scenarios.	Moderate -adapts to uncertainty but relies on predefined probability structures.	High—dynamically learns from a large dataset and adapts to changing human behaviours.
Data Requirement	Low—operates effectively with small or no datasets.	Medium—requires data to estimate probability distributions.	High—needs large, diverse datasets for training and generalisation.
Robustness to Uncertainty	Poor—fails when faced with unexpected or noisy human behaviour.	Strong—model behavioural variability effectively.	Very strong—handles noisy, multimodal sensory inputs robustly
Computational Cost	Low—simple logical processing.	Medium—depends on the complexity of algorithmic structures.	High—training and intention inference require intensive computation.
Suitability for Industrial Environments	Ideal for safety-critical systems requiring transparency.	Suitable for probabilistic uncertainty must be managed.	Suitable for adaptive, context-rich environments requiring predictive intelligence.
Weaknesses	Rigid, cannot generalise beyond predefined rules.	Requires accurate probabilistic modelling and assumptions.	Low explainability, potential for algorithmic bias

Table 10. Analysis of Interaction Types Observed Across Reviewed Collaborative Robotics Studies.

Interaction Types	Description	Application Case	Key Cues for Intention Recognition
Collaborative Assembly Tasks	Sequenced, interdependent tasks between a human and a robot working together.	Part placement and alignment [52]. Multi-component assembly [40,53,66,68]. Complex and specialised assembly [41].	Physical and contextual cues (hand trajectory, component type, and sequence of operation).
Object Manipulation and Handover	Transferring or collaboratively handling objects in a workspace.	Pick-and-place operations [8,36,51,55,69]. Handover prediction [39,57].	Physical and contextual cues (hand position, gesture, and object location).
Shared Physical Tasks and Co-Manipulation	Humans and robots apply force to a shared object.	Polishing and inspection [10]. Co-manipulation [54,69,71].	Physical cues (force/torque signals, motion dynamics).
Teaching and Guidance	Human instructing and demonstrating a task to the robot.	Collaborative teaching [38]. Learning from demonstration [98].	Physical cues (movement patterns, demonstration trajectories).
Cognitive and Simulation Tasks	Highly structured and tests cognitive frameworks.	Virtual prototyping [35,67].	Physical, physiological, and contextual cues (gaze direction, cognitive state, and task states).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kekana, M.; Du, S.; Steyn, N.; Benali, A.; Djerroud, H. A Review of Human Intention Recognition Frameworks in Industrial Collaborative Robotics. Robotics 2025, 14, 174. https://doi.org/10.3390/robotics14120174

AMA Style

Kekana M, Du S, Steyn N, Benali A, Djerroud H. A Review of Human Intention Recognition Frameworks in Industrial Collaborative Robotics. Robotics. 2025; 14(12):174. https://doi.org/10.3390/robotics14120174

Chicago/Turabian Style

Kekana, Mokone, Shengzhi Du, Nico Steyn, Abderraouf Benali, and Halim Djerroud. 2025. "A Review of Human Intention Recognition Frameworks in Industrial Collaborative Robotics" Robotics 14, no. 12: 174. https://doi.org/10.3390/robotics14120174

APA Style

Kekana, M., Du, S., Steyn, N., Benali, A., & Djerroud, H. (2025). A Review of Human Intention Recognition Frameworks in Industrial Collaborative Robotics. Robotics, 14(12), 174. https://doi.org/10.3390/robotics14120174

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review of Human Intention Recognition Frameworks in Industrial Collaborative Robotics

Abstract

1. Introduction

2. Learning Techniques for Intention Detection Towards Human-like Adaptability

2.1. Rule-Based Approaches

2.2. Probabilistic Models

2.3. Machine Learning Models

2.4. Deep Learning Models

3. Cues for Human Intention Recognition

3.1. Physical Cues

3.1.1. Hand Gestures

3.1.2. Body Postures

3.1.3. Hand and Arm Motions

3.1.4. Gaze and Eye Direction

3.1.5. Force Interaction

3.2. Physiological Cues

3.2.1. Central Nervous System Cues

3.2.2. Peripheral Nervous System Cues

3.3. Contextual Cues

4. Discussion

4.1. Progression of Learning Models and Sensory Approaches

4.2. Interaction Types and Application Cases

4.3. Key Limitations

4.4. Recommendations

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI