1. Introduction
There is an increasing use of humanoid robots outside controlled laboratory settings in medical, assistive, and rehabilitation, as well as social companionship contexts. To facilitate a good interaction in these human applications, not only are visual and auditory sense are important, but on a fundamental level, the robot needs to be able to respond and decipher the human sense of touch [
1,
2,
3]. Giving feelings like comfort, stress, affection, and reassurance is an aspect that can best be expressed through touch; therefore, tactile perception is an important factor that socially smart human–robot interfaces need to consider.
Although there has been a tremendous advancement in the area of robotic sensing, most of the currently available systems are specifically tailored to measure physical parameters like force, pressure, or contact position [
4]. Although these abilities are adequate to perform activities such as manipulation, grasping, and collision avoidance, they do not match the interpretation of the emotional intent of human touch [
5]. Therefore, because of their construction, robots are usually mechanical in their responses but emotionally blind, which limits their usefulness and the level of acceptance in the social and assistance fields [
6].
Electronic skin (e-skin) has been identified as a solution to improve the process of touch in humanoid robots through the provision of large-area, flexible, and high-resolution sensation [
7]. Newer e-skin models have several built-in modalities of sensing, such as pressure, temperature, or electrostatic, to enable complex triggers to be captured. Nonetheless, such multimodal sensory information is difficult to fuse and interpret because of the considerable computational challenges. Traditional methods of deep learning, especially convolutional neural networks (CNNs), are usually based on continuous-valued processing and demand large amounts of computational power and energy, which makes them inapplicable to real-time, on-board execution in embedded robots.
Another alternative is neuromorphic computing, which is a biologically inspired approach that simulates the event-based information processing of the human nervous system. As a fundamental element of neuromorphic systems, Spiking Neural Networks (SNNs) represent information in discrete spike events, which makes this type of processing sparse, low-latency, and energy-efficient [
8,
9,
10]. These properties of SNNs particularly render them highly capable in the context of edge AI, where real-time responsiveness and power limits are of great importance. Despite the potential of neuromorphic models in vision and sensory processing tasks, relatively little is done on emotion-aware tactile perception in humanoid robots.
This paper discusses a neuromorphic AI-based multimodal electronic skin system of emotional tactile perception in humanoid robots. The proposed system combines pressure, temperature, and electrostatic sensors to record rich tactile signals, which are linked to various emotional interactions [
11,
12,
13]. Bio-inspired encoding converts the multimodal sensory data into spike trains, and the lightweight SNN is used to decode the spike trains to classify emotions in real time. With the use of event-driven neuromorphic computation, the proposed modality attains high recognition accuracy with a huge decrease in computation energy consumption and inference latency [
14].
The experimental outcomes indicate that the proposed neuromorphic e-skin can differentiate the patterns of emotional touch, i.e., stress, comfort, affection, and neutral touch, with a high level of robustness and real-time processing. The feasibility of on-the-edge AI multimodal sensing and neuromorphic processing has been demonstrated by the end-to-end integration of the technology into a humanoid robot, demonstrating the potential of implementing emotion-aware tactile intelligence on a humanoid robot. This publication is a step in the right direction in the development of emotionally intelligent robotic systems that can safely, intuitively, and socially adaptively interact with humans.
The main contributions of this work are as follows:
A neuromorphic AI-driven multimodal e-skin integrating pressure, temperature, and electrostatic sensing for emotion-aware touch perception.
An end-to-end tactile processing pipeline combining bio-inspired spike encoding with SNN-based real-time inference.
An energy-efficient, low-latency neuromorphic framework suitable for embedded and edge-based humanoid robotic platforms.
3. Methodology
This section describes the suggested AI-driven neuromorphic multimodal e-skin system of humanoid robots to perceive emotions about touch. The design used is an end-to-end design that incorporates multimodal tactile sensing, signal conditioning, neuromorphic processing, emotion classification, and real-time implementation on an edge AI processor.
3.1. Dataset Description
Since no publicly accessible dataset combines multimodal tactile sensing with associated emotional responses during human–robot touch interactions, a mixed data approach was adopted. Public datasets were used to analyze tactile signal characteristics and to standardize the definition of emotion categories, while model development and evaluation were performed on a structured multimodal tactile dataset constructed for this study. To ensure realistic tactile signal behavior, the publicly available tactile dataset recorded on the tactile skin of the iCub humanoid robot [
19] was analyzed to understand spatial pressure distributions and temporal contact dynamics. Additionally, a publicly available multimodal emotion recognition dataset on Kaggle [
20] was used to guide the definition and consistency of the four emotional interaction classes considered in this work. The proposed model was neither trained nor evaluated directly on datasets [
19,
20].
Based on these reference datasets, a controlled multimodal tactile dataset was generated to represent four emotional interaction categories: stress, neutral, comfort, and affection. The dataset contains 1200 tactile interaction samples, with 300 samples per class, where each sample corresponds to a 5 s tactile interaction sequence. The signal characteristics and temporal patterns were designed to follow distributions observed in the reference tactile datasets to ensure dataset realism and authenticity.
Each sample contains synchronized pressure, temperature, and electrostatic signal streams sampled at 200 Hz. The signals were time-aligned, segmented into fixed-duration windows, and preprocessed using noise filtering and normalization before spike encoding and SNN-based processing.
For model evaluation, the dataset was divided using stratified random sampling into 70% training, 15% validation, and 15% testing subsets, ensuring balanced class representation across all splits. No cross-validation was applied, and all reported results correspond to evaluation on the held-out test set. All performance metrics presented in this study were obtained using this structured multimodal tactile dataset.
3.2. Multimodal e-Skin Layer
The multimodal electronic skin layer is the main interface of human touch. It incorporates a pressure sensor matrix to sense contact force distribution and contact intensity, temperature sensors to sense thermal changes related to long-lasting or affective touch, and electrostatic sensors to sense the proximity and subtle pre-contact information. A combination of these sensing modalities allows the system to record the physical as well as emotional features of human touch so as to give a complete picture of the tactile data.
3.3. Signal Conditioning and Preprocessing
Raw sensor output is initially sent to the signal conditioning and preprocessing unit to improve the quality and reliability of the data. The noise filtering methods are used to dampen the sensor artifacts and the noise in the environment. There is then normalization of the filtered signals so that there is consistency among the various sensing modalities. Extraction of features is done to gain salient time and space features of a tactile interaction. The processed signals are then translated into spike trains with bio-inspired spike encoding approaches, and are therefore compatible with neuromorphic processing.
3.4. Neuromorphic Processing Core
The neuromorphic processing core is implemented as a feedforward Spiking Neural Network (SNN) consisting of three layers: an input layer, one hidden layer, and an output layer. The input layer contains 192 neurons corresponding to the flattened multimodal tactile features (pressure matrix, temperature, and electrostatic signals). The hidden layer consists of 128 spiking neurons, and the output layer comprises 4 neurons representing the emotion classes: stress, neutral, comfort, and affection.
All neurons in the hidden and output layers are modeled using the Leaky Integrate-and-Fire (LIF) neuron model, which accumulates synaptic input over time and generates spikes when a threshold is reached, followed by membrane potential reset. This enables temporal integration of tactile dynamics.
Multimodal tactile signals are converted into spike trains using rate-based encoding, where spike frequency is proportional to the normalized signal amplitude within each time window.
The SNN is trained using surrogate gradient-based backpropagation through time (BPTT) to handle the non-differentiability of spike activation. A cross-entropy loss function is applied at the output layer, and the network is optimized using the Adam optimizer with a learning rate of 0.001. This configuration enables efficient temporal feature learning while maintaining low-latency inference suitable for edge-oriented deployment.
3.5. Emotion Classification Module
The output of the spike activities of the SNN is processed to identify the emotional condition arising with the input of the tactile stimulus. The emotion classification module classifies interactions as stress, neutral, comfort, and affection classes according to spatiotemporal spikes. This module allows for stable distinctions between small tactile stimuli that are hard to make in frame-based models.
3.6. Robot Behavior Generation Module
The categorized emotional state is sent to the robot behavior generation module, which makes adjustments to the responses of the robot in the form of grip force, response timing, interaction intensity, and social gestures. This would make the behaviors of the robot emotionally suitable and socially adaptive to enhance the safety, comfort of the user, and the naturalness of the interaction.
3.7. Edge AI Deployment
The entire framework gets deployed on an edge AI processor to allow real-time, on-device inference. The system has a lowered latency, enhanced privacy, and continuous operation by removing the cloud dependency. The neuromorphic SNN core enables energy-saving implementation, and the proposed strategy is appropriate in the case of long-term implementation in humanoid robot systems, as shown in
Figure 1.
4. Results and Discussion
In this section, the experimental findings of the suggested neuromorphic AI-based multimodal e-skin structure are presented. The assessment is based on the classification of emotion performance and spike-based behavioral analysis, comparison of latency, and energy distribution within the system components, and proves the effectiveness and efficiency of the proposed method.
4.1. Spike-Based Tactile Response Analysis
Spike train patterns produced by the Spiking Neural Network (SNN) demonstrate that various touch interactions with varying emotions have different temporal characteristics. In the spike train comparison of stress and comfort interactions, the stress-related touch gives thicker, higher frequency spike bursts with unpredictable temporal separation, which are indicative of sudden and powerful touch stimulation. Conversely, more regular temporal structure in improved sparse spike patterns, characteristic of gentle and persistent touch behavior, is produced by comfort interactions. These results indicate that neuromorphic encoding and SNN processing are useful in preserving the temporal properties of the sensory signal and allow effective differentiation of emotionally different touch patterns, as shown in
Table 2 and
Figure 2.
4.2. Emotion Classification Performance
The proposed framework was tested in terms of emotion classification performance, including four emotional categories: stress, neutral, comfort, and affection. The results of the classification show high accuracy in all the classes, with the highest being 92% in stress, 90% in neutral, and 88% in comfort, and the lowest at 85% in affection. The slightly smaller accuracy of affection may be explained by the fact that the tactile features of tender emotional contacts belong to the initial category of overlaying ones. On the whole, the findings point to the fact that emotion recognition with multimodal sensing and SNN-based processing is robust and consistent, as depicted in
Figure 3.
Figure 4, showing a classification accuracy in three-dimensional visualization, also reinforces the equal performance of emotion classes. The equal distribution of the accuracy bars in height indicates the stability of the model and its capability to be used in other affective touch situations without being biased towards a given subgroup.
4.3. Comprehensive Evaluation and Robustness Analysis
To provide a more complete evaluation beyond overall accuracy, a confusion matrix was computed on the held-out test set to analyze inter-class misclassification patterns, as shown in
Figure 5. The results show minor confusion between comfort and affection due to similarity in temporal tactile dynamics, while the stress and neutral classes exhibit strong separability. In addition to accuracy, class-wise precision, recall, and F1-scores were calculated. The model achieved balanced performance across all four classes, with F1-scores ranging between 0.84 and 0.92, indicating stable discrimination capability without significant class bias.
Robustness and generalization were assessed using a stratified 70/15/15 train–validation–test split, ensuring balanced class representation. The evaluation was conducted exclusively on the unseen test set, and controlled noise perturbation during preprocessing further improved stability against signal variability. These results confirm that the proposed SNN framework generalizes consistently within the defined tactile interaction domain.
4.4. Overall System Performance Comparison
The inference latency of the proposed SNN-based framework was directly compared to the latency of a conventional CNN-based model being applied to the same input data of the tactile system to assess real-time performance. The neuromorphic SNN had a mean inference latency of about 8 ms, as compared to the CNN-based network that had a very high latency of about 38 ms. This is a significant processing delay minimization, proving the benefit of event-driven neuromorphic calculation; therefore, the proposed system is highly applicable to time-sensitive human–robot interaction, where a response must be taken immediately, as shown in
Figure 6.
4.5. Energy Distribution and Computational Efficiency
Analysis of energy consumption demonstrates that there is a high degree of computation load distribution among the system components. The sensor layer consumes about 35 percent of the overall energy consumption due to sustained data collection of different modalities. The edge controller is only 25% of the power, yet communication and SNN processing are about 20% each. It is important to note that the neuromorphic SNN processing core is able to deliver high classification accuracy at low energy usage, e.g., the sparse nature of event-driven computation compared to other frame-based neural models, as demonstrated in
Figure 7.
All the experimentation findings indicate that the neuromorphic multimodal e-skin framework can be highly accurate in emotion recognition, with low inference latency, and can work with minimal energy consumption. The distinct differentiation of spike patterns among the various emotional conditions justifies the efficiency of spike-based encoding of tactile and neuromorphic processing. The SNN-based system has a far greater number of benefits compared to traditional deep learning methods in terms of real-time responsiveness and the ability to be deployed in an embedded environment.
Energy consumption was estimated based on the average CPU power draw during inference using system-level monitoring tools. Absolute power consumption during inference averaged approximately 18.6 W for the CNN model and 12.3 W for the SNN model, resulting in lower estimated energy per inference for the event-driven SNN architecture. These values represent software-level benchmarking under controlled experimental conditions rather than specialized neuromorphic hardware measurements.
4.6. Limitations and Future Work
Although the experimental results demonstrate the efficacy of proposed system, its transition from a controlled environment to real-world deployment is currently bounded by specific data-related limitations. First, the 1200-sample multimodal tactile dataset used in this study was synthetically generated based on reference signal characteristics from the iCub tactile skin. While this approach ensures high data quality and class balance, it does not account for the stochasticity and sensor noise inherent in real human–robot physical interactions. Consequently, the model’s ability to generalize to live, unconstrained scenarios remains to be fully established in the future. Second, the evaluation utilized a fixed 70/15/15 stratified train/validation/test split. Although stratification was used to maintain class representation, the absence of k-fold cross-validation, primarily due to the high computational cost of SNN training, means that the results may be influenced by the specific data partition. Third, while the SNN demonstrated a lower power profile of 12.3 W compared to the 18.6 W CNN baseline, these results are based on system-level software monitoring rather than measurements on physical neuromorphic hardware. As such, the reported energy efficiency should be interpreted as a comparative software benchmark rather than a demonstration of neuromorphic hardware acceleration. Finally, the current benchmarking is limited to a comparison with a standard CNN baseline.
Future work will prioritize the collection of live-interaction tactile datasets involving human participants to bridge the sim-to-real gap. Additionally, we intend to implement multi-fold cross-validation and expand our benchmarking to include a wider range of state-of-the-art deep learning architectures beyond the current CNN baseline to further validate the robustness and efficiency of the proposed system.
5. Conclusions
This paper presents a multimodal neuromorphic e-skin to provide emotion history touch perception in humanoid robots. The proposed system enables real-time interpretation of affective touch interactions by integrating pressure, temperature, and electrostatic sensing with bio-inspired spike encoding and SNN-based processing. The experimental evaluation using a custom tactile dataset demonstrated reliable classification of stress, neutral, comfort, and affection interactions, achieving high accuracy and very low inference latency. The proposed neuromorphic model compared with conventional CNN-based approaches significantly reduces computational overhead and energy consumption, making it suitable for embedded and edge AI deployments. The effectiveness of event-driven neuromorphic processing for tactile emotion recognition is further validated by the distinct spike pattern separation observed across emotional states. Thus, the proposed framework offers an efficient and scalable solution for affect-aware tactile perception that can be used for the design of emotionally sensitive humanoid robots, enabling safer and more natural human–robot interactions for various social, assistive, and healthcare applications.