Next Article in Journal
An Edge Caching Strategy Based on User Speed and Content Popularity for Mobile Video Streaming
Next Article in Special Issue
Improved Rapidly Exploring Random Tree with Bacterial Mutation and Node Deletion for Offline Path Planning of Mobile Robot
Previous Article in Journal
Periodic Event-Triggered Estimation for Networked Control Systems
Previous Article in Special Issue
Cognitive Robotics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Perceptual and Semantic Processing in Cognitive Robots

by
Syed Tanweer Shah Bukhari
and
Wajahat Mahmood Qazi
*
Intelligent Machines & Robotics, Department of Computer Science, COMSATS University Islamabad, Lahore 54000, Pakistan
*
Author to whom correspondence should be addressed.
Electronics 2021, 10(18), 2216; https://doi.org/10.3390/electronics10182216
Submission received: 3 August 2021 / Revised: 6 September 2021 / Accepted: 7 September 2021 / Published: 10 September 2021
(This article belongs to the Special Issue Cognitive Robotics)

Abstract

:
The challenge in human–robot interaction is to build an agent that can act upon human implicit statements, where the agent is instructed to execute tasks without explicit utterance. Understanding what to do under such scenarios requires the agent to have the capability to process object grounding and affordance learning from acquired knowledge. Affordance has been the driving force for agents to construct relationships between objects, their effects, and actions, whereas grounding is effective in the understanding of spatial maps of objects present in the environment. The main contribution of this paper is to propose a methodology for the extension of object affordance and grounding, the Bloom-based cognitive cycle, and the formulation of perceptual semantics for the context-based human–robot interaction. In this study, we implemented YOLOv3 to formulate visual perception and LSTM to identify the level of the cognitive cycle, as cognitive processes synchronized in the cognitive cycle. In addition, we used semantic networks and conceptual graphs as a method to represent knowledge in various dimensions related to the cognitive cycle. The visual perception showed average precision of 0.78, an average recall of 0.87, and an average F1 score of 0.80, indicating an improvement in the generation of semantic networks and conceptual graphs. The similarity index used for the lingual and visual association showed promising results and improves the overall experience of human–robot interaction.

1. Introduction

This paper proposes an affordance- and grounding-based approach for the formation of perceptual semantics in robots for human–robot interaction (HRI). Perceptual semantics play a vital role in ensuring a robot understands its environment and the implication of its actions [1,2]. The challenge is to build robots with the ability to process their acquired knowledge into perception and affordance [1,2,3,4,5,6]. In this context, the significance of affordance can be rationalized by the following scenario taken from human–human interaction (HHI): if we state the following information to another human “X” that “I am feeling thirsty” rather than “I want to drink beverage using a red bottle”, the human “X” will be able to understand the relationship between “drink” and “thirst”. The link between these two terms is “thirst causes the desire to drink”. This ability to establish a relationship between “drink” and “thirst” based on semantic analysis is called affordance. Consequently, the human “X” will offer something to drink. Let us assume we have a robot with the ability to process affordance and a similar situation is present in human–robot interaction; then, it is expected that the robot may perform a similar action as the human “X” in HHI. For robots, this type of interaction is currently a challenge, although there are various contributions in this direction [5,6] with the focus on visual affordance. In the scenario presented above, visual affordance is not sufficient for understanding of the relationship between “thirst”, and “drink”. The robot also needs to ground the objects placed on the table. Object grounding is an approach that allows the robot to profile objects in the environment [6] i.e., “How many objects belong to a drinkable category?” will be answered with the response having the position of the object as “There is one drinkable object located at the left side of the table”. This challenge becomes even more complex when it is implemented for cognitive robots, because their design rationale considers factors such as internal regularization, control, and synchronization of autonomous processes through a cognitive cycle (understanding, attending, and acting) [7,8,9,10,11]. A reference cognitive cycle may consist of a variant of the phases of perception, understanding, and action [7,11]. In this study, we adopted an extended version [12] of Bloom’s taxonomy as a cognitive cycle. The reason for using the Bloom-based cognitive cycle is that it provides a map between the level of cognitive processes and the type of knowledge domain. The detailed Bloom taxonomy map is provided in a previous paper [12]. In addition, Bloom uses the action verbs to steer the cognitive-knowledge-domain map [12]. The control structure used in this study is shown in Figure 1, which is an extract from our previously reported work [13]. The detailed utilization of the control structure in Figure 1 is discussed in Section 4.
In this study, we proposed perceptual semantics based on extended-object grounding and machine perception (visual and lingual). In this regard, we performed a table-top experiment using a Universal Robot (UR5). A dataset for affordance learning comprising 7622 images (Section 4.1) was prepared for the training of a YOLOv3-based perception module (Section 4.1). A Bloom-based cognitive cycle identifier was also implemented for the identification of cognitive levels (see Section 4.2). The semantic memory was constructed from ConceptNet and WordNet having 1.47 million nodes and 3.13 million relationships (see Section 4.3). Our analysis of the experimental data/results (see Section 5) suggests that perceptual learning alone is not sufficient to access the environment; the inclusion of seed knowledge is important to understand the extended affordance features (i.e., the relationship between “drink” and “thirst”). Moreover, the inclusion of a cognitive cycle identifier helps the robot to choose between “what to reply”, “what not to reply”, “when to reply”, and “what would be the procedure”. The work reported in this paper is an effort to contribute to the advancement of building robots with a better understanding of the environment.

2. Related Work

There is a growing need for robots and other intelligent agents to have safe interactions with partners, mainly human beings. In this regard, the need for perceptual semantics formulated using affordance learning and object grounding is vital for human–robot interaction (HRI) [14,15,16].
Affordance is considered to be the catalyst in establishing a relationship between accessible objects, their effects, and actions carried out by robots [17,18]. Affordance capability can be induced in an agent through interaction, demonstration, annotation, heuristics, and trails [19]. Most of the work undertaken in object affordance learning is based on visual perception [3,20,21,22,23,24,25,26], whereas lingual cues can also provide additional advantages that can significantly improve affordance [19]. Therefore, in this study, we focused on both visual and lingual cues. In addition to visual and lingual cues, Breux et al. [16] considered ontologies based on WordNet to extract the action cues and ground the relationships between objects and features (properties). This improved the results and HRI but covered only seven types of relationships (isA, hasA, prop, usedFor, on, linked-to, and homonym), which limits the agent’s recognition and understanding capabilities to the stated semantic associations. Implementation of semantic memory has also been reported in the literature [3,27,28]. Antunes et al. [27] reported the use of semantic memory for HRI and discussed the scenario of “make a sandwich” having explicit information objects and their actions. This system [27] does not have the capabilities to cater to situations such as “I am feeling hungry”, in which the robot understands that there is a need to make the sandwich. This suggests that semantic memory is more like a knowledge repository.
Object grounding based on either lingual or visual perception is used to profile the object in the environment [6,29]. The grounding is mapped in terms of exact and relative location(s) of the object(s), i.e., “left-of” [16]. Oliveira et al. [3,4] discussed the importance of semantic memory for HRI and interaction-based acquisition of knowledge. The mentioned system uses object grounding without incorporating object affordance; therefore, it is unable to process the feature-based lingual cues. The object grounding techniques used in this paper are similar to that introduced in the Robustness by Autonomous Competence Enhancement (RACE) project [1]. The table-top setup is represented as a grid having nine positions, i.e., c e n t e r ,   r i g h t ,   l e f t ,   t o p ,   b o t t o m ,   b o t t o m l e f t ,   b o t t o m r i g h t ,   t o p r i g h t ,   t o p l e f t (see Section 4.4).
An agent by design has a control structure that can be as simple as a sense act [30] or as complex as a cognitive architecture [8,13,31]. These control structures may be a collection of memories, learnings, and other mental faculties depending on the architectural complexity [8,13]. The systems with semantic memories are those that fall within the domain of cognitive architecture [8,13,31]. These processes in these control structures are regularized, controlled, and synchronized through a cognitive cycle. A reference cognitive cycle consists of perception, understanding, and action [7,8,9,10]. A limitation of these cognitive cycles is that they do not provide explicit guidelines to map the degree of processing on various knowledge levels and dimensions [7,8,9,10]. This is a challenge in the implementation of cognitive agents in the selection of appropriate cognitive processes and knowledge dimensions from the lingual cues. Bloom’s taxonomy provides the method to map lingual cues with the cognitive levels and knowledge dimensions [12], but it has not yet been used as a cognitive cycle.
The analysis of the current work suggests that significant improvement in the state-of-the-art can be made by increasing the number of semantic relationships, combining both supervised and heuristic approaches for acquiring affordance and formulation of object grounding. We proposed a semantic memory consisting of 53 types of relationships having 1.47 million nodes and 3.13 million relationships to enhance the benefits of affordance learning (see Section 4.3).
The control structures and design of existing systems build a strong case for the inclusion of architecture having semantic memory, perception, and other required modalities [3,16,21,22,27,28,32,33,34,35,36,37]. We used the minimalistic design of the previously reported Nature-inspired Humanoid Cognitive Architecture for Self-awareness and Consciousness (NiHA) (see Figure 1) [13].
We incorporated Bloom’s taxonomy in a standard cognitive cycle to identify the cognitive process and knowledge dimensionality based on the identification of action verbs from lingual cues (see Section 4.2). The detailed comparison of the state-of-the-art with the proposed method is symbolized in Table 1.

3. Problem Formulation

This study proposed a method for human–robot interaction (HRI). For this purpose, semantic memory S m for an agent from the atom of knowledge ( A t o m ) is an essential first step. The atom of knowledge is generated from visual and auditory sensory stimuli. The human-centered environment consists of household items (objects) present on the table.
Let the items that exist in the workspace be presented as I = { i 1 , i 2 , i 3 , . , i n } , and the properties of the item I p be represented as I p = { n a m e ,   a f f o r d a n c e ,   l o c a t i o n ,   d i r e c t i o n } . The affordance of the item is defined as I p a f f o r d a n c e = {   c a l l a b l e ,   c l e a n a b l e ,   d r i n k a b l e ,   e d i b l e ,   p l a y a b l e ,   r e a d a b l e ,   w r i t a b l e } and the location as I p l o c a t i o n : i t e m   ×   gives the parameters concerning the visual frame. The direction of the items is presented according to the center of the visual frame as I p d i r e c t i o n : I t e m × I t e m [ c e n t e r ,   r i g h t ,   l e f t ,   t o p ,   b o t t o m ,   b o t t o m l e f t ,   b o t t o m r i g h t ,   t o p r i g h t ,   t o p l e f t ] .
Let the auditory stream be based on m number of words as W = { N , A d j , V } . The word W can be recognized as noun node N = { N N ,   N N S ,   N N P ,   N N P S } , adjective node A d j = { J J ,   J J R ,   J J S } and the verb node can as V = { V B ,   V B D ,   V B P ,   V B N ,   V B G ,   V B Z } . The noun, adjective, and verb are checked with an a priori knowledge base of concepts and features. The concept is defined as C = { c 1 , c 2 , c 3 . c k } , whereas feature F = { f 1 , f 2 , f 3 . f k } . The atom of knowledge is represented as A t o m = { c , c ,   c , f } . The semantic memory of the system is the collection of k atoms of knowledge S m = { A t o m 1 , A t o m 2 , A t o m 3 . A t o m k } . Let the cognitive cycle based on Bloom’s revised taxonomy be selected as cognitive level C o g l e v e l = [ P e r c e p t i o n ,   U n d e r s t a n d i n g C o m p r e h e n s i o n s ,   E x e c u t i o n C o n t r o l ,   P o s t E x e c u t i o n A n a l y s i s ,   E v a l u a t i o n ,   S y n t h e s i s ] . The knowledge dimension K n o w l e d g e D i m e n s i o n = [ F a c t u a l ,   C o n c e p t u a l ,   P r o c e d u r a l ,   M e t a C o g n i t i o n ] can be selected based on action verbs proposed in revised Bloom’s taxonomy. The two-dimensional array, i.e., matrix can be represented as B C o g m a t r i x = C o g l e v e l × K n o w l e d g e D i m e n s i o n . The selected cognitive cycle is an instance of a matrix as B C o g c y c l e = ( C o g l e v e l i , K n o w l e d g e D i m e n s i o n i ) .

4. Methodology

This section explains the methodology for the development of artifacts highlighted in the problem formulation. These artifacts include perception (i.e., visual, and lingual), working memory (i.e., object grounding and semantic similarity analysis), and construction of semantic memory (seed knowledge and explicit knowledge). The lingual perception is further divided into knowledge representation, cognitive cycle identifier, and natural language processing module. The core architecture is depicted in Figure 2.

4.1. Visual Perception

The visual perception module is based on multiple levels. The first level is based on affordance learning and the next is based on item name identification.
Affordance Learning: The affordance module is trained on a dataset [40] consisting of objects used commonly in the household. The 30 items chosen to date can be on categorized as callable, cleanable, drinkable, edible, playable, readable, writable, and wearable [6]. A total of 8538 images were taken by a Samsung Galaxy 7 camera. The system (see Figure 2) was trained to recognize seven classes, i.e., callable, cleanable, drinkable, edible, playable, readable, and writable. The number of total images used for training purposes was 7622 excluding the wearable category. The system trained on YOLOv3 [41] to identify the items placed on the table-top setup with 18,000 iterations having an average loss of 0.1176 (see Appendix A, Figure A1). The architecture of YOLOv3 with its configuration is shown in Figure 3. The detailed configuration of the training pipeline is presented in Table A1 in Appendix A.
Item Name Identification: The items classified based on affordance learning are further assigned names, i.e., Drinkable as Bottle or Cup. For the said purpose, a pre-trained YOLOv3 classifier [42] was used to identify the name of the commonly used household items. This module uses the position of the image determined by the YOLOv3 classifier to localize the detected object in the table-top setup. The system returns the item sets as I = { i 1 , i 2 , i 3 , , i n } , and the properties of items are classified as I p = { n a m e ,   a f f o r d a n c e , l o c a t i o n ,   d i r e c t i o n } .

4.2. Lingual Perception

We developed a rule-based chatbot for the acquisition of perceptual semantics from the auditory stream. The co-worker (i.e., human) communicates with the robot through a speech engine based on Google Speech-to-Text API. The stream is then sent to the Natural Language Processing module for tokenization, part-of-speech tagging, name entity tagging, and basic dependency tagging. Further processing is done in Knowledge Representation for the formation of the conceptual graph and semantic network, whereas Cognitive Cycle Identifier modules are used for the classification of cognitive processes in the cycle.
Natural Language Processing: The Natural Language Processing module consists of four submodules: Tokenization, Part of Speech (POS) tagger, Name Entity (NE) tagger and Basic Dependency (BD) (see Figure 4). The input stream (sentence) is tokenized in the Tokenization module and further tagged using the Part of Speech (POS) tagger. The stream is then classified into noun N = { N N ,   N N S ,   N N P ,   N N P S } , adjective A d j = { J J ,   J J R ,   J J S } , and verb V = { V B ,   V B D ,   V B P ,   V B N ,   V B G ,   V B Z } using NLTK (Natural Language Toolkit). Detail about the POS tags can be found at [43]. Furthermore, CoreNLP is used to identify BD and NE tags for the formulation of atom of knowledge elements (concepts, relations, and features).
Knowledge RepresentationKnowledge Representation module consists of Triplet Extraction, Semantic Association and Atom of Knowledge (see Figure 5). Knowledge is constructed after the NLP module has processed the stream. The Knowledge Representation module extracts the triplets (i.e., predicate, object, and subject) from the processed sentences. The predicate is extracted from the previously processed verb set V , whereas the subject is extracted from the noun set N . The last triplet is assigned based on the last of the adjective set A d j or noun. The association between concepts is created using an a priori knowledge base by searching the concept nodes for similarities based on relationship types such as “InstanceOf”, “IsA”, “PartOf”, “DerivedFrom”, “Synonym”, “CreatedBy”, “HasProperty”, “UsedFor”, “HasA”, “FormOf”, and “RelatedTo”. Based on these associations, the atom of knowledge is constructed.
Cognitive Cycle Identifier: The sensory stimuli based on sentences are evaluated in Bloom’s taxonomy-based cognitive module. The module is constructed on a system trained on Long-Short-Term Memory (LSTM) with an improvised dataset based on Yahya’s model with 300 epochs and a cost function of 1.903 × 10−6 [44]. The cognitive level is determined as C o g l e v e l = [ P e r c e p t i o n ,   U n d e r s t a n d i n g C o m p r e h e n s i o n s ,   E x e c u t i o n C o n t r o l ,   P o s t E x e c u t i o n A n a l y s i s ,   E v a l u a t i o n ,   S y n t h e s i s ] . The cognitive levels are dataset classes. The stream is then tokenized and parsed using the Natural Language Toolkit (NLTK). The knowledge domain is further classified based on the action verbs of Bloom’s revised taxonomy [12] to determine the instance of B C o g m a t r i x . The instance then initiates the designated cognitive process applicable for potential knowledge dimension and action.

4.3. Semantic Memory

Semantic Memory is constituted in an a priori and an a posteriori manner. The seed knowledge is developed from ConceptNet and WordNet, whereas the posterior knowledge is constructed when the agent interacts with the environment and stored in the Explicit Knowledge repository.
Seed Knowledge: Seed knowledge is constituted based on atoms of knowledge from WordNet and ConceptNet. The knowledge base has 1.47 million nodes and 3.13 million relationships (53 relationship types, i.e., AlsoSee, Antonym, AtLocation, Attribute, CapableOf, Cause, Causes, etc.). The nodes consist of 117,659 Synsets (WordNet Nodes), 1.33 million Concept (ConceptNet), and 157,300 Lemma nodes. The Lemma nodes are extracted from Concept nodes based on “root words”. These nodes are partially or fully matchable with Synset nodes. The semantic memory-based seed (tacit) knowledge is represented as S m = { A t o m 1 , A t o m 2 , A t o m 3 . A t o m k } and atoms as A t o m = { c o n c e p t , c o n c p e t ,   c o n c e p t , f e a t u r e } . The transformation of ConceptNet and WordNet ontologies to the proposed seed knowledge, i.e., semantic memory can be seen in Table 2 and Table 3.
Explicit KnowledgeExplicit Knowledge is constructed based on the semantic network [45] and conceptual graph [46] drawn from working memory. These graphs are constructed in the Knowledge Representation module.

4.4. Working Memory

Working memory acts as an executive control in the proposed system, whose primary responsibility is to formulate object grounding and semantic similarity analysis.
Object Grounding: The localization is further used to determine the accessibility coordinates of the robotic arm. We started with the simplest approach by dividing the table-top setup into a 3 × 3 grid as shown in Figure 6.
The grid is divided into several directions as defined in I d i r e c t i o n : I t e m × I t e m [ c e n t e r ,   r i g h t ,   l e f t ,   t o p ,   b o t t o m ,   b o t t o m l e f t ,   b o t t o m r i g h t , t o p r i g h t ,   t o p l e f t ] . This approach is workable to determine the exact position of the item. However, we want to know the relative positions of the items. The grid is further described based on a reference point, i.e., center in Figure 7.
This reference point consideration is further extended to position the item relative to others as shown in Figure 8.
Semantic Analysis: The semantic similarity between atoms of knowledge constructed from words W coming from Lingual Perception and atoms of knowledge constricted from I p a f f o r d a n c e coming from Visual Perception is evaluated.
S ( A t o m W ,   A t o m I p a f f o r d a n c e ) = | A t o m W A t o m I p a f f o r d a n c e | | A t o m W A t o m I p a f f o r d a n c e |
The S maximum scores indicate the similarity between A t o m W and A t o m I p a f f o r d a n c e .

5. Results

To validate the proposed methodology, we conducted multiple experiments. The experimental results are based on human collaborator interaction with the agent. In the first phase, visual perception analysis is discussed, and the subsequent phases are based on object grounding, cognitive, and semantic analysis.

5.1. Visual Perception

We trained the agent on YOLOv3 and tested it to validate the proposed methods on 160 video frames comprising a collection of 783 different household objects (see Figure 9 for a subset of video frames). The categorization of objects placed on the table-top scenarios is based on callable, cleanable, drinkable, edible, playable, readable, and writable affordance classes. Each frame contains an average of nine objects placed on various areas of the table to identify and relate the location with spoken commands.
The results are shown in confusion matrices in Figure 10. The results indicate that for cleanable affordance the negative predictions were mostly callable and writeable. This happens in the case of duster (cleanable) and spunch (cleanable) because their shape is similar to that of a cellphone (callable). In some cases, the yellow spunch was misclassified as sticky-note (writable). Furthermore, the toys (playable) were misclassified as drinkable objects in 12 instances due to geometric similarities. Moreover, performance metrics were calculated for affordance recognition and can be seen in Table 4.
P r e c i s i o n = T r u e   P o s i t i v e s T r u e   P o s i t i v e s + f a l s e   p o s i t i v e s
R e c a l l = T r u e   P o s i t i v e s T r u e   P o s i t i v e s + f a l s e   n e g a t i v e s
F 1   S c o r e = 2   P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l .
Table 4 contains false positive, false negative, true positive, and true negative parameters. Based on these parameters’ precision, recall, and f1 score are calculated for all seven affordance classes. The results indicate that the precision is good in the case of the cleanable object but the recall has a low value, whereas callable has good recall but has low precision. Moreover, the f1 score of callable is lowest amongst the remainder of the classes. The affordance learning is compared with the current state-of-the-art in Table 5. Furthermore, the objects were classified using a pre-trained COCO model for object grounding and knowledge representation. The knowledge is represented using both a conceptual graph and a semantic network. The conceptual graph is used for further action selection and the semantic network becomes part of the semantic memory.

5.2. Lingual Perception and Object Grounding

This section is based on the object grounding results, formation of the conceptual graph, and semantic network. To display the formulation of the conceptual graph, one of the previously discussed video frames is used (Figure 9a). In this phase, the information extracted from video frames is used to address the affordance of an object and position with respect to the center of the frame and the position of other objects. This information is further transformed using the COCO model as “The cellphone is located at the bottom left side of the table” (see Figure 11). For this purpose, two types of graphs were generated, i.e., a conceptual graph (CG) and a semantic network (SN). CG is generated separately for each instance in the frame. CG is composed of two nodes, i.e., conceptNode (cellphone, located, side, table, bottom, left) and relationNode (object, at, attr, of), whereas the empty relation is represented as “Link” (see Figure 11a). This type of graph helps the agent to check the dependency factor. In the case of “The frog is located on the left side of the table” the nodes are slightly different, i.e., conceptNode (frog, located, side, table, left) and relationNode (agent, at, attr, of) (see Figure 11c). Both examples have two relationNodes distinct nodes, i.e., “cellphone” is represented as “object” whereas “frog” is represented as an “agent”. This information helps understand the nature of the item, its role, and placement.
The semantic network (SN) for a video frame (Figure 9a) is constructed to be stored in the semantic memory for future processing (see Figure 12). SN is composed of the ConceptNode and relationship in the form of edges. The edges for “LocatedAt” and “LocatedOn” indicate the path towards the position of the item, whereas “NEXT” is an empty relationship that points towards the succeeding node(s).
If the item in the frame is not recognized based on affordance (see Figure 9a), i.e., “Rubik’s cube” (as playable) and “pen” (as writeable), then the agent will not be able to ground the position and direction of an item. The grounding is formulated after the affordance recognition in the form of sentences and then as a conceptual graph (see Figure 11) and a semantic network (see Figure 12).

5.3. Cognitive Cycle Identifier

This section is based on the results from the Bloom-based Cognitive Cycle Identifier. In this phase, the verbal sensory stimuli are analyzed for the action selection, i.e., “How many items are present on the table?”, “How many objects belong to a drinkable category?”, “Which object is used to reduce the hunger?”, and “Which item is used to reduce the intensity of thirst?”. The action verbs are further accessed for the identification of the “cognitive domain”, as described in Bloom’s revised taxonomy [12]. After the identification of the “cognitive domain,” the agent chooses its actions as “Blob Detection and Counting”, “Affordance Recognition”, and “Jaccard Semantic Similarity” (see Figure 13, Figure 14, Figure 15, Figure 16, Figure 17 and Figure 18). The results shown here are encouraging and represent an important step towards an advancement in perceptual semantics in cognitive robots.
The Universal Robot (UR5)-based demonstrations can be accessed through Table 6.

6. Conclusions

In this work, we proposed perceptual and semantic processing for human–robot interaction in the agent. The contribution of the proposed work is the extension of affordance learning, Bloom’s taxonomy as a cognitive cycle, object grounding, and perceptual semantics. The experiments were conducted on the agent using 160 video frames with household objects in a table-top scenario and human cues that contained implicit instructions. The results suggest that the overall HRI experience was improved due to the proposed method and the agent was able to address implicit lingual cues (see Table 6).

Author Contributions

Conceptualization, W.M.Q.; Methodology, S.T.S.B. and W.M.Q.; Supervision, W.M.Q.; Validation, S.T.S.B.; Writing—original draft, S.T.S.B.; Writing—review & editing, W.M.Q. Both authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Services Syndicate Private Limited for providing access to Universal Robot (UR5) for experimentations.

Acknowledgments

Authors acknowledge the support of Services Syndicate Private Limited for providing access to Universal Robot (UR5) for experimentations.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Layer configurations.
Table A1. Layer configurations.
Layer TypeLayerFiltersConcatenationSize/Strd(dil)Output
0 Convolutional conv32 3 × 3/ 1608 × 608 × 32
1 conv64 3 × 3/ 2304 × 304 × 64
2 conv32 1 × 1/ 1304 × 304 × 32
3 conv64 3 × 3/ 1304 × 304 × 64
4 Residual Shortcut Layer 304 × 304 × 64
5 Convolutional conv128 3 × 3/ 2152 × 152 × 128
2 × conv64 1 × 1/ 1152 × 152 × 64
conv128 3 × 3/ 1152 × 152 × 128
11Residual Shortcut Layer 152 × 152 × 128
12 Convolutional conv256 3 × 3/ 276 × 76 × 256
8 × conv128 1 × 1/ 176 × 76 × 128
conv256 3 × 3/ 176 × 76 × 256
36Residual Shortcut Layer 76 × 76 × 256
37 Convolutional conv512 3 × 3/ 238 × 38 × 512
8 × conv256 1 × 1/ 138 × 38 × 256
conv512 3 × 3/ 138 × 38 × 512
61Residual Shortcut Layer 38 × 38 × 512
62 Convolutional conv1024 3 × 3/ 219 × 19 × 1024
4 × conv512 1 × 1/ 119 × 19 × 512
conv1024 3 × 3/ 119 × 19 × 1024
74Residual Shortcut Layer 19 × 19 × 1024
3 × Convolutional conv512 1 × 1/ 119 × 19 × 512
80 conv1024 3 × 3/ 119 × 19 × 1024
81 conv39 1 × 1/ 119 × 19 × 39
82 Detection yolo
83 route79 ->
84 Convolutional conv256 1 × 1/ 119 × 19 × 256
85 Upsampling upsample 2x38 × 38 × 256
86 route: 85 -> 618561 38 × 38 × 768
3 × Convolutional conv256 1 × 1/ 138 × 38 × 256
92 conv512 3 × 3/ 138 × 38 × 512
93 conv39 1 × 1/ 138 × 38 × 39
94 Detection yolo
95 route91 ->
96 Convolutional conv128 1 × 1/ 138 × 38 × 128
97 Upsampling upsample 2 x76 × 76 × 128
98 route: 97 -> 369736 76 × 76 × 384
3 × Convolutional conv128 1 × 1/ 176 × 76 × 128
104 conv256 3 × 3/ 176 × 76 × 256
105 conv39 1 × 1/ 176 × 76 × 39
106 Detection yolo
Figure A1. Affordance training iterations graph.
Figure A1. Affordance training iterations graph.
Electronics 10 02216 g0a1

References

  1. Dubba, K.S.R.; Oliveira, M.R.d.; Lim, G.H.; Kasaei, H.; Lopes, L.S.; Tome, A.; Cohn, A.G. Grounding Language in Perception for Scene Conceptualization in Autonomous Robots. In Proceedings of the AAAI 2014 Spring Symposium, Palo Alto, CA, USA, 24–26 March 2014. [Google Scholar]
  2. Kotseruba, I.; Tsotsos, J.K. 40 years of cognitive architectures: Core cognitive abilities and practical applications. Artif. Intell. Rev. 2020, 53, 17–94. [Google Scholar] [CrossRef] [Green Version]
  3. Oliveira, M.; Lopes, L.S.; Lim, G.H.; Kasaei, S.H.; Tomé, A.M.; Chauhan, A. 3D object perception and perceptual learning in the RACE project. Robot. Auton. Syst. 2016, 75, 614–626. [Google Scholar] [CrossRef]
  4. Oliveira, M.; Lim, G.H.; Lopes, L.S.; Kasaei, S.H.; Tomé, A.M.; Chauhan, A. A perceptual memory system for grounding semantic representations in intelligent service robots. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 14–18 September 2014; Institute of Electrical and Electronics Engineers (IEEE): New York, NY, USA, 2014; pp. 2216–2223. [Google Scholar]
  5. Lopes, M.; Melo, F.S.; Montesano, L. Affordance-based imitation learning in robots. In Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, CA, USA, 30 November 2006; IEEE: New York, NY, USA, 2007; pp. 1015–1021. [Google Scholar]
  6. Mi, J.; Tang, S.; Deng, Z.; Goerner, M.; Zhang, J. Object affordance based multimodal fusion for natural Human-Robot interaction. Cogn. Syst. Res. 2019, 54, 128–137. [Google Scholar] [CrossRef]
  7. Sowa, J.F. The Cognitive Cycle. In Proceedings of the 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), Lodz, Poland, 13–16 September 2015; IEEE: New York, NY, USA, 2015; Volume 5, pp. 11–16. [Google Scholar]
  8. McCall, R.J. Fundamental Motivation and Perception for a Systems-Level Cognitive Architecture. Ph.D. Thesis, The University of Memphis, Memphis, TN, USA, 2014. [Google Scholar]
  9. Paraense, A.L.; Raizer, K.; de Paula, S.M.; Rohmer, E.; Gudwin, R.R. The cognitive systems toolkit and the CST reference cognitive architecture. Biol. Inspired Cogn. Archit. 2016, 17, 32–48. [Google Scholar] [CrossRef]
  10. Blanco, B.; Fajardo, J.O.; Liberal, F. Design of Cognitive Cycles in 5G Networks. In Collaboration in A Hyperconnected World; Springer Science and Business Media LLC: London, UK, 2016; pp. 697–708. [Google Scholar]
  11. Madl, T.; Baars, B.J.; Franklin, S. The Timing of the Cognitive Cycle. PLoS ONE 2011, 6, e14803. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Krathwoh, D. A Revision of Bloom’s Taxonomy: An Overview. Theory Pract. 2002, 41, 213–264. [Google Scholar]
  13. Qazi, W.M.; Bukhari, S.T.S.; Ware, J.A.; Athar, A. NiHA: A Conscious Agent. In Proceedings of the COGNITIVE 2018, The Tenth International Conference on Advanced Cognitive Technologies and Applications, Barcelona, Spain, 18–22 February 2018; pp. 78–87. [Google Scholar]
  14. Marques, H.G. Architectures for Embodied Imagination. Neurocomputing 2009, 72, 743–759. [Google Scholar] [CrossRef]
  15. Samsonovich, A.V. On a roadmap for the BICA Challenge. Biol. Inspired Cogn. Archit. 2012, 1, 100–107. [Google Scholar] [CrossRef]
  16. Breux, Y.; Druon, S.; Zapata, R. From Perception to Semantics: An Environment Representation Model Based on Human-Robot Interactions. In Proceedings of the 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Nanjing and Tai’an, China, 27–31 August 2018; IEEE: New York, NY, USA, 2018; pp. 672–677. [Google Scholar] [CrossRef]
  17. Bornstein, M.H.; Gibson, J.J. The Ecological Approach to Visual Perception. J. Aesthet. Art Crit. 1980, 39, 203. [Google Scholar] [CrossRef]
  18. Cruz, F.; Magg, S.; Weber, C.; Wermter, S. Training Agents With Interactive Reinforcement Learning and Contextual Affordances. IEEE Trans. Cogn. Dev. Syst. 2016, 8, 271–284. [Google Scholar] [CrossRef]
  19. Min, H.; Yi, C.; Luo, R.; Zhu, J.; Bi, S. Affordance Research in Developmental Robotics: A Survey. IEEE Trans. Cogn. Dev. Syst. 2016, 8, 237–255. [Google Scholar] [CrossRef]
  20. Kjellström, H.; Romero, J.; Kragić, D. Visual object-action recognition: Inferring object affordances from human demonstration. Comput. Vis. Image Underst. 2011, 115, 81–90. [Google Scholar] [CrossRef]
  21. Thomaz, A.L.; Cakmak, M. Learning about objects with human teachers. In Proceedings of the 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI), San Diego, CA, USA, 11–13 March 2009; IEEE: New York, NY, USA, 2009; pp. 15–22. [Google Scholar]
  22. Wang, C.; Hindriks, K.V.; Babuška, R. Robot learning and use of affordances in goal-directed tasks. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; Institute of Electrical and Electronics Engineers (IEEE): New York, NY, USA, 2013; pp. 2288–2294. [Google Scholar]
  23. Nguyen, A.; Kanoulas, D.; Muratore, L.; Caldwell, D.G.; Tsagarakis, N.G. Translating Videos to Commands for Robotic Manipulation with Deep Recurrent Neural Networks. 2017. Available online: https://www.researchgate.net/publication/320180040_Translating_Videos_to_Commands_for_Robotic_Manipulation_with_Deep_Recurrent_Neural_Networks (accessed on 17 September 2018).
  24. Myers, A.; Teo, C.L.; Fermuller, C.; Aloimonos, Y. Affordance detection of tool parts from geometric features. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; IEEE: New York, NY, USA, 2015; pp. 1374–1381. [Google Scholar]
  25. Moldovan, B.; Raedt, L.D. Occluded object search by relational affordances. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 7 June 2014; IEEE: New York, NY, USA, 2014; pp. 169–174. [Google Scholar]
  26. Nguyen, A.; Kanoulas, D.; Caldwell, D.G.; Tsagarakis, N.G. Object-based affordances detection with Convolutional Neural Networks and dense Conditional Random Fields. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: New York, NY, USA, 2017; pp. 5908–5915. [Google Scholar]
  27. Antunes, A.; Jamone, L.; Saponaro, G.; Bernardino, A.; Ventura, R. From human instructions to robot actions: Formulation of goals, affordances and probabilistic planning. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA); Institute of Electrical and Electronics Engineers (IEEE), Stockholm, Sweden, 16–21 May 2016; IEEE: New York, NY, USA, 2016; pp. 5449–5454. [Google Scholar]
  28. Tenorth, M.; Beetz, M. Representations for robot knowledge in the KnowRob framework. Artif. Intell. 2017, 247, 151–169. [Google Scholar] [CrossRef]
  29. Roy, D.; Hsiao, K.-Y.; Mavridis, N. Mental Imagery for a Conversational Robot. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2004, 34, 1374–1383. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach; Pearson Education, Inc.: Upper Saddle River, NJ, USA, 1994. [Google Scholar]
  31. Madl, T.; Franklin, S.; Chen, K.; Trappl, R. A computational cognitive framework of spatial memory in brains and robots. Cogn. Syst. Res. 2018, 47, 147–172. [Google Scholar] [CrossRef] [Green Version]
  32. Shaw, D.B. Robots as Art and Automation. Sci. Cult. 2018, 27, 283–295. [Google Scholar] [CrossRef]
  33. Victores, J.G. Robot Imagination System; Universidad Carlos III de Madrid: Madrid, Spain, 2014. [Google Scholar]
  34. Diana, M.; De La Croix, J.-P.; Egerstedt, M. Deformable-medium affordances for interacting with multi-robot systems. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; Institute of Electrical and Electronics Engineers (IEEE): New York, NY, USA, 2013; pp. 5252–5257. [Google Scholar]
  35. Fallon, M.; Kuindersma, S.; Karumanchi, S.; Antone, M.; Schneider, T.; Dai, H.; D’Arpino, C.P.; Deits, R.; DiCicco, M.; Fourie, D.; et al. An Architecture for Online Affordance-based Perception and Whole-body Planning. J. Field Robot. 2014, 32, 229–254. [Google Scholar] [CrossRef] [Green Version]
  36. Sun, Y.; Ren, S.; Lin, Y. Object–object interaction affordance learning. Robot. Auton. Syst. 2014, 62, 487–496. [Google Scholar] [CrossRef]
  37. Hart, S.; Dinh, P.; Hambuchen, K. The Affordance Template ROS package for robot task programming. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 26 2015; IEEE: New York, NY, USA, 2015; pp. 6227–6234. [Google Scholar]
  38. Gago, J.J.; Victores, J.G.; Balaguer, C. Sign Language Representation by TEO Humanoid Robot: End-User Interest, Comprehension and Satisfaction. Electronics 2019, 8, 57. [Google Scholar] [CrossRef] [Green Version]
  39. Pandey, A.K.; Alami, R. Affordance graph: A framework to encode perspective taking and effort based affordances for day-to-day human-robot interaction. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems; Institute of Electrical and Electronics Engineers (IEEE), Tokyo, Japan, 3–7 November 2013; IEEE: New York, NY, USA, 2013; pp. 2180–2187. [Google Scholar]
  40. Bukhari, S.T.S.; Qazi, W.M.; Intelligent Machines & Robotics Group, COMSATS University Islamabad, Lahore Campus. Affordance Dataset. 2019. Available online: https://github.com/stsbukhari/Dataset-Affordance (accessed on 8 September 2021).
  41. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 18–20 June 1996; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
  42. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  43. Taylor, A.; Marcus, M.; Santorini, B. The Penn Treebank: An Overview. Treebanks 2003, 20, 5–22. [Google Scholar]
  44. Yahya, A.A.; Osman, A.; Taleb, A.; Alattab, A.A. Analyzing the Cognitive Level of Classroom Questions Using Machine Learning Techniques. Procedia-Soc. Behav. Sci. 2013, 97, 587–595. [Google Scholar] [CrossRef] [Green Version]
  45. Sowa, J.F. Semantic Networks. In Encyclopedia of Cognitive Science; American Cancer Society: Chicago, IL, USA, 2006. [Google Scholar]
  46. Sowa, J.F. Conceptual graphs as a universal knowledge representation. Comput. Math. Appl. 1992, 23, 75–93. [Google Scholar] [CrossRef] [Green Version]
  47. Do, T.-T.; Nguyen, A.; Reid, I. AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; IEEE: New York, NY, USA, 2018; pp. 1–5. [Google Scholar]
  48. Myers, A. From Form to Function: Detecting the Affordance of Tool Parts using Geometric Features and Material Cues. Ph.D. Thesis, University of Maryland, College Park, MD, USA, 2016. [Google Scholar]
  49. Jiang, Y.; Koppula, H.; Saxena, A.; Saxena, A. Hallucinated Humans as the Hidden Context for Labeling 3D Scenes. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 18–20 June 1996; IEEE: New York, NY, USA, 2013; pp. 2993–3000. [Google Scholar]
  50. Koppula, H.S.; Jain, A.; Saxena, A. Anticipatory Planning for Human-Robot Teams. In Experimental Robotics; Springer: Berlin/Heidelberg, Germany, 2016; pp. 453–470. [Google Scholar]
  51. Baleia, J.; Santana, P.; Barata, J. On Exploiting Haptic Cues for Self-Supervised Learning of Depth-Based Robot Navigation Affordances. J. Intell. Robot. Syst. 2015, 80, 455–474. [Google Scholar] [CrossRef] [Green Version]
  52. Chu, F.-J.; Xu, R.; Vela, P.A. Learning Affordance Segmentation for Real-World Robotic Manipulation via Synthetic Images. IEEE Robot. Autom. Lett. 2019, 4, 1140–1147. [Google Scholar] [CrossRef]
Figure 1. NiHA’s minimal architecture for a cognitive robot [13].
Figure 1. NiHA’s minimal architecture for a cognitive robot [13].
Electronics 10 02216 g001
Figure 2. System overview.
Figure 2. System overview.
Electronics 10 02216 g002
Figure 3. Affordance learning: YOLOv3 architecture.
Figure 3. Affordance learning: YOLOv3 architecture.
Electronics 10 02216 g003
Figure 4. Natural language processing module.
Figure 4. Natural language processing module.
Electronics 10 02216 g004
Figure 5. Knowledge representation module.
Figure 5. Knowledge representation module.
Electronics 10 02216 g005
Figure 6. Image grid (3 × 3).
Figure 6. Image grid (3 × 3).
Electronics 10 02216 g006
Figure 7. Grid reference point.
Figure 7. Grid reference point.
Electronics 10 02216 g007
Figure 8. Relative position of objects (items).
Figure 8. Relative position of objects (items).
Electronics 10 02216 g008
Figure 9. Affordance recognition results of 9 frames out 160 in total.
Figure 9. Affordance recognition results of 9 frames out 160 in total.
Electronics 10 02216 g009
Figure 10. Affordance recognition–confusion matrices.
Figure 10. Affordance recognition–confusion matrices.
Electronics 10 02216 g010
Figure 11. Conceptual graphs.
Figure 11. Conceptual graphs.
Electronics 10 02216 g011
Figure 12. Semantic network.
Figure 12. Semantic network.
Electronics 10 02216 g012
Figure 13. (a) Original frame, (b) blob detection and counting for query, (c) query.
Figure 13. (a) Original frame, (b) blob detection and counting for query, (c) query.
Electronics 10 02216 g013
Figure 14. (a) Original frame, (b) object affordance results for query, (c) query.
Figure 14. (a) Original frame, (b) object affordance results for query, (c) query.
Electronics 10 02216 g014
Figure 15. (a) Similarity score, (b) semantic network for query, (c) query.
Figure 15. (a) Similarity score, (b) semantic network for query, (c) query.
Electronics 10 02216 g015
Figure 16. (a) Similarity score, (b) semantic network for query, (c) query.
Figure 16. (a) Similarity score, (b) semantic network for query, (c) query.
Electronics 10 02216 g016
Figure 17. (a) Similarity score, (b) semantic network for query, (c) query.
Figure 17. (a) Similarity score, (b) semantic network for query, (c) query.
Electronics 10 02216 g017
Figure 18. (a) Similarity score, (b) semantic network for query, (c) query.
Figure 18. (a) Similarity score, (b) semantic network for query, (c) query.
Electronics 10 02216 g018
Table 1. State-of-the-art comparison with the proposed method.
Table 1. State-of-the-art comparison with the proposed method.
WorkPlatformTaskPerceptionData SourceControl StructureGroundingAffordance DatasetKnowledge Base/OntologyEvaluation Metric/Method
[20]N/AObject ManipulationVisualDemonstrationNoNo6 Categories/330 ViewsNoAccuracy
[3]PR2Object ManipulationVisualInteractionRACE 10 Categories
/339 Views
Semantic MemoryAccuracy
[33]ToeObject ManipulationVisualLabelsRobot Imagination SystemYesGeometric ShapesNoToken Test
[38]ToeObject ManipulationVisualLabels Yes10 Classes/30 Sign SymbolsNoAccuracy
[34]Khepera IIINavigationVisualLabelsMulti-robot Control SystemNoN/ANoN/A
[39]PR2Action PredictionVisuo-SpatialInteraction N/AN/AGraphN/A
[35]AtlasManipulation
/Navigation
VisualLabelsYesNo8 ClassesN/AN/A
[27]iCubObject ManipulationVisualHeuristicYesYesN/ASemantic MemoryN/A
[21]BioloidObject ManipulationVisualLabelsC5M No5 Classes/4 Affordance ClassesN/AAccuracy
[22]NAOAction Prediction/NavigationVisualLabels/Trail & ErrorYesNo8 Action ClassesN/AAccuracy
[18]iCubObject ManipulationVisualHeuristicNoYesN/AN/AAccuracy
[23]N/AAction PredictionVisualHeuristic/LabelsNoNo9 Classes/10 Object CategoriesN/AWeighted F Measure
[24]N/AObject ManipulationVisualLabelsNoNo7 Classes/105 ObjectsN/ARecognition Accuracy
[36]Fanuc Object ManipulationVisualLabelsYesNo13 ClassesN/AAccuracy
[25]N/AAction PredictionVisualLabelsNoNo13 Classes/6 Action AffordancesN/AAccuracy
[37]Valkyrie Object ManipulationVisualLabelsYesNoN/AN/AN/A
[28]PR2Object ManipulationVisualLabelsYesNoN/AKnowRobN/A
[26]Walk
man
Action PredictionVisualLabelsNoNo10 Object Classes/
9 Affordance Classes
NoWeighted F Measure
[16]Wheeled RobotObject ManipulationVisual/AuditoryLabelsYesYesN/AKnowledge Graph N/A
OursUR5Grasping/Object ManipulationVisual/AuditoryLabels/HeuristicNiHAYes7 Affordance ClassesSemantic MemoryF1 Measure/Semantic Similarity
Table 2. ConceptNet to semantic memory node and edge transformation detail.
Table 2. ConceptNet to semantic memory node and edge transformation detail.
ConceptNet to Semantic Memory
ItemsOriginal Terms Attached toAdopted TermsAttached to
Unit of KnowledgeEdge or AssertionConceptNet Concept NodeGraph-Based Ontology
Attributes FieldsAssertionsPropertiesNodes/Edges
Attribute_1UriAssertionconceptUriNode
Attribute_2relAssertionRelationShip TypeEdge
Attribute_3start (Concept)AssertionConcept NodeNode
Attribute_4end (Concept)AssertionConcept NodeNode
Attribute_5weightAssertionweightEdge
Attribute_6sourcesAssertion--
Attribute_7licenseAssertion--
Attribute_8datasetAssertiondatasetEdge
Attribute_9surfaceTextAssertionNameNode
--pos (Extracted frm Uri)Node
--Id (Extracted frm Uri)Node
<id> (Graph Index)Node/Edge
Table 3. WordNet to semantic memory nodes and edge transformation detail.
Table 3. WordNet to semantic memory nodes and edge transformation detail.
WordNet to Semantic Memory
ItemsAdopted TermsAttached to
HyponymIsAEdge
HypernymIsAEdge
Member HomonymPartOfEdge
Substance HolonymPartOfEdge
Part HolonymPartOfEdge
Member MeronymPartOfEdge
Substance MeronymPartOfEdge
Part Meronym PartOfEdge
Topic DomainDomainEdge
Region DomainDomainEdge
Usage DomainDomainEdge
AttributeAttributeEdge
EntailmentEntailmentEdge
CausesCausesEdge
Also SeeAlsoSeeEdge
Verb GroupVerbGroupEdge
Similar ToSimilarToEdge
Table 4. Performance metrics: precision, recall, F1 score.
Table 4. Performance metrics: precision, recall, F1 score.
True PositiveFalse PositiveFalse NegativeTrue NegativePrecisionRecallF1 Score
Playable6528166740.6990.8020.747
Readable80406990.9521.0000.976
Writeable815356440.6040.9420.736
Callable284707080.3731.0000.544
Cleanable19141014870.9790.6540.784
Drinkable8914176630.8640.8400.852
Edible981126720.9900.8910.938
Average 0.780.870.80
Table 5. State-of-the-art comparison with proposed affordance learning.
Table 5. State-of-the-art comparison with proposed affordance learning.
WorkAffordance/ObjectsRobotic TaskSize of DatasetEvaluation Metrics
[47]9 Classes/10 Object CategoriesAction Prediction8835 RGB ImagesWeighted F Score = 73.35
[48]7 Classes/105 ObjectsObject Manipulation30,000 RGB-D Image PairsRecognition Accuracy = 95.0%
[6]7 Classes/42 ObjectsObject Grasp8960 RGB ImagesRecognition Accuracy = 100%
[49]28 Homes/24 Offices/17 ClassesAction Prediction550 RGB-D ViewsMax Precision = 88.40
[50]17 ClassesAction Prediction250 RGB-D VideosTime Saving Accuracy
[51]9 ObjectsObject ManipulationRGB-D ImagesConfidence Level
[52]7 Classes/17 Categories/105 ObjectsObject Manipulation28k+ RGB-D Images Weighted F Measure
Ours7 Classes/26 Objects (Originally 8 Classes/30 Objects)Object Grasp7622 (Originally 8538) RGB ImagesAverage F Score = 0.80
Table 6. Links to demonstration videos.
Table 6. Links to demonstration videos.
Human CuesVideo
I am feeling thirstyhttps://youtu.be/A16Q0Od7vg4
(Accessed on 8 September 2021)
I am hungry, I need something to eat.https://youtu.be/YJe9CCo1z-M
(Accessed on 8 September 2021)
Give me anything to play a video game.https://youtu.be/R46WCwMzryc
(Accessed on 8 September 2021)
I am hungry (unsuccessful)https://youtu.be/f2vJswBkpZs
(Accessed on 8 September 2021)
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bukhari, S.T.S.; Qazi, W.M. Perceptual and Semantic Processing in Cognitive Robots. Electronics 2021, 10, 2216. https://doi.org/10.3390/electronics10182216

AMA Style

Bukhari STS, Qazi WM. Perceptual and Semantic Processing in Cognitive Robots. Electronics. 2021; 10(18):2216. https://doi.org/10.3390/electronics10182216

Chicago/Turabian Style

Bukhari, Syed Tanweer Shah, and Wajahat Mahmood Qazi. 2021. "Perceptual and Semantic Processing in Cognitive Robots" Electronics 10, no. 18: 2216. https://doi.org/10.3390/electronics10182216

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop