You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

10 September 2021

Perceptual and Semantic Processing in Cognitive Robots

and
Intelligent Machines & Robotics, Department of Computer Science, COMSATS University Islamabad, Lahore 54000, Pakistan
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Cognitive Robotics

Abstract

The challenge in human–robot interaction is to build an agent that can act upon human implicit statements, where the agent is instructed to execute tasks without explicit utterance. Understanding what to do under such scenarios requires the agent to have the capability to process object grounding and affordance learning from acquired knowledge. Affordance has been the driving force for agents to construct relationships between objects, their effects, and actions, whereas grounding is effective in the understanding of spatial maps of objects present in the environment. The main contribution of this paper is to propose a methodology for the extension of object affordance and grounding, the Bloom-based cognitive cycle, and the formulation of perceptual semantics for the context-based human–robot interaction. In this study, we implemented YOLOv3 to formulate visual perception and LSTM to identify the level of the cognitive cycle, as cognitive processes synchronized in the cognitive cycle. In addition, we used semantic networks and conceptual graphs as a method to represent knowledge in various dimensions related to the cognitive cycle. The visual perception showed average precision of 0.78, an average recall of 0.87, and an average F1 score of 0.80, indicating an improvement in the generation of semantic networks and conceptual graphs. The similarity index used for the lingual and visual association showed promising results and improves the overall experience of human–robot interaction.

1. Introduction

This paper proposes an affordance- and grounding-based approach for the formation of perceptual semantics in robots for human–robot interaction (HRI). Perceptual semantics play a vital role in ensuring a robot understands its environment and the implication of its actions [1,2]. The challenge is to build robots with the ability to process their acquired knowledge into perception and affordance [1,2,3,4,5,6]. In this context, the significance of affordance can be rationalized by the following scenario taken from human–human interaction (HHI): if we state the following information to another human “X” that “I am feeling thirsty” rather than “I want to drink beverage using a red bottle”, the human “X” will be able to understand the relationship between “drink” and “thirst”. The link between these two terms is “thirst causes the desire to drink”. This ability to establish a relationship between “drink” and “thirst” based on semantic analysis is called affordance. Consequently, the human “X” will offer something to drink. Let us assume we have a robot with the ability to process affordance and a similar situation is present in human–robot interaction; then, it is expected that the robot may perform a similar action as the human “X” in HHI. For robots, this type of interaction is currently a challenge, although there are various contributions in this direction [5,6] with the focus on visual affordance. In the scenario presented above, visual affordance is not sufficient for understanding of the relationship between “thirst”, and “drink”. The robot also needs to ground the objects placed on the table. Object grounding is an approach that allows the robot to profile objects in the environment [6] i.e., “How many objects belong to a drinkable category?” will be answered with the response having the position of the object as “There is one drinkable object located at the left side of the table”. This challenge becomes even more complex when it is implemented for cognitive robots, because their design rationale considers factors such as internal regularization, control, and synchronization of autonomous processes through a cognitive cycle (understanding, attending, and acting) [7,8,9,10,11]. A reference cognitive cycle may consist of a variant of the phases of perception, understanding, and action [7,11]. In this study, we adopted an extended version [12] of Bloom’s taxonomy as a cognitive cycle. The reason for using the Bloom-based cognitive cycle is that it provides a map between the level of cognitive processes and the type of knowledge domain. The detailed Bloom taxonomy map is provided in a previous paper [12]. In addition, Bloom uses the action verbs to steer the cognitive-knowledge-domain map [12]. The control structure used in this study is shown in Figure 1, which is an extract from our previously reported work [13]. The detailed utilization of the control structure in Figure 1 is discussed in Section 4.
Figure 1. NiHA’s minimal architecture for a cognitive robot [13].
In this study, we proposed perceptual semantics based on extended-object grounding and machine perception (visual and lingual). In this regard, we performed a table-top experiment using a Universal Robot (UR5). A dataset for affordance learning comprising 7622 images (Section 4.1) was prepared for the training of a YOLOv3-based perception module (Section 4.1). A Bloom-based cognitive cycle identifier was also implemented for the identification of cognitive levels (see Section 4.2). The semantic memory was constructed from ConceptNet and WordNet having 1.47 million nodes and 3.13 million relationships (see Section 4.3). Our analysis of the experimental data/results (see Section 5) suggests that perceptual learning alone is not sufficient to access the environment; the inclusion of seed knowledge is important to understand the extended affordance features (i.e., the relationship between “drink” and “thirst”). Moreover, the inclusion of a cognitive cycle identifier helps the robot to choose between “what to reply”, “what not to reply”, “when to reply”, and “what would be the procedure”. The work reported in this paper is an effort to contribute to the advancement of building robots with a better understanding of the environment.

3. Problem Formulation

This study proposed a method for human–robot interaction (HRI). For this purpose, semantic memory S m for an agent from the atom of knowledge ( A t o m ) is an essential first step. The atom of knowledge is generated from visual and auditory sensory stimuli. The human-centered environment consists of household items (objects) present on the table.
Let the items that exist in the workspace be presented as I = { i 1 , i 2 , i 3 , . , i n } , and the properties of the item I p be represented as I p = { n a m e ,   a f f o r d a n c e ,   l o c a t i o n ,   d i r e c t i o n } . The affordance of the item is defined as I p a f f o r d a n c e = {   c a l l a b l e ,   c l e a n a b l e ,   d r i n k a b l e ,   e d i b l e ,   p l a y a b l e ,   r e a d a b l e ,   w r i t a b l e } and the location as I p l o c a t i o n : i t e m   ×   gives the parameters concerning the visual frame. The direction of the items is presented according to the center of the visual frame as I p d i r e c t i o n : I t e m × I t e m [ c e n t e r ,   r i g h t ,   l e f t ,   t o p ,   b o t t o m ,   b o t t o m l e f t ,   b o t t o m r i g h t ,   t o p r i g h t ,   t o p l e f t ] .
Let the auditory stream be based on m number of words as W = { N , A d j , V } . The word W can be recognized as noun node N = { N N ,   N N S ,   N N P ,   N N P S } , adjective node A d j = { J J ,   J J R ,   J J S } and the verb node can as V = { V B ,   V B D ,   V B P ,   V B N ,   V B G ,   V B Z } . The noun, adjective, and verb are checked with an a priori knowledge base of concepts and features. The concept is defined as C = { c 1 , c 2 , c 3 . c k } , whereas feature F = { f 1 , f 2 , f 3 . f k } . The atom of knowledge is represented as A t o m = { c , c ,   c , f } . The semantic memory of the system is the collection of k atoms of knowledge S m = { A t o m 1 , A t o m 2 , A t o m 3 . A t o m k } . Let the cognitive cycle based on Bloom’s revised taxonomy be selected as cognitive level C o g l e v e l = [ P e r c e p t i o n ,   U n d e r s t a n d i n g C o m p r e h e n s i o n s ,   E x e c u t i o n C o n t r o l ,   P o s t E x e c u t i o n A n a l y s i s ,   E v a l u a t i o n ,   S y n t h e s i s ] . The knowledge dimension K n o w l e d g e D i m e n s i o n = [ F a c t u a l ,   C o n c e p t u a l ,   P r o c e d u r a l ,   M e t a C o g n i t i o n ] can be selected based on action verbs proposed in revised Bloom’s taxonomy. The two-dimensional array, i.e., matrix can be represented as B C o g m a t r i x = C o g l e v e l × K n o w l e d g e D i m e n s i o n . The selected cognitive cycle is an instance of a matrix as B C o g c y c l e = ( C o g l e v e l i , K n o w l e d g e D i m e n s i o n i ) .

4. Methodology

This section explains the methodology for the development of artifacts highlighted in the problem formulation. These artifacts include perception (i.e., visual, and lingual), working memory (i.e., object grounding and semantic similarity analysis), and construction of semantic memory (seed knowledge and explicit knowledge). The lingual perception is further divided into knowledge representation, cognitive cycle identifier, and natural language processing module. The core architecture is depicted in Figure 2.
Figure 2. System overview.

4.1. Visual Perception

The visual perception module is based on multiple levels. The first level is based on affordance learning and the next is based on item name identification.
Affordance Learning: The affordance module is trained on a dataset [40] consisting of objects used commonly in the household. The 30 items chosen to date can be on categorized as callable, cleanable, drinkable, edible, playable, readable, writable, and wearable [6]. A total of 8538 images were taken by a Samsung Galaxy 7 camera. The system (see Figure 2) was trained to recognize seven classes, i.e., callable, cleanable, drinkable, edible, playable, readable, and writable. The number of total images used for training purposes was 7622 excluding the wearable category. The system trained on YOLOv3 [41] to identify the items placed on the table-top setup with 18,000 iterations having an average loss of 0.1176 (see Appendix A, Figure A1). The architecture of YOLOv3 with its configuration is shown in Figure 3. The detailed configuration of the training pipeline is presented in Table A1 in Appendix A.
Figure 3. Affordance learning: YOLOv3 architecture.
Item Name Identification: The items classified based on affordance learning are further assigned names, i.e., Drinkable as Bottle or Cup. For the said purpose, a pre-trained YOLOv3 classifier [42] was used to identify the name of the commonly used household items. This module uses the position of the image determined by the YOLOv3 classifier to localize the detected object in the table-top setup. The system returns the item sets as I = { i 1 , i 2 , i 3 , , i n } , and the properties of items are classified as I p = { n a m e ,   a f f o r d a n c e , l o c a t i o n ,   d i r e c t i o n } .

4.2. Lingual Perception

We developed a rule-based chatbot for the acquisition of perceptual semantics from the auditory stream. The co-worker (i.e., human) communicates with the robot through a speech engine based on Google Speech-to-Text API. The stream is then sent to the Natural Language Processing module for tokenization, part-of-speech tagging, name entity tagging, and basic dependency tagging. Further processing is done in Knowledge Representation for the formation of the conceptual graph and semantic network, whereas Cognitive Cycle Identifier modules are used for the classification of cognitive processes in the cycle.
Natural Language Processing: The Natural Language Processing module consists of four submodules: Tokenization, Part of Speech (POS) tagger, Name Entity (NE) tagger and Basic Dependency (BD) (see Figure 4). The input stream (sentence) is tokenized in the Tokenization module and further tagged using the Part of Speech (POS) tagger. The stream is then classified into noun N = { N N ,   N N S ,   N N P ,   N N P S } , adjective A d j = { J J ,   J J R ,   J J S } , and verb V = { V B ,   V B D ,   V B P ,   V B N ,   V B G ,   V B Z } using NLTK (Natural Language Toolkit). Detail about the POS tags can be found at [43]. Furthermore, CoreNLP is used to identify BD and NE tags for the formulation of atom of knowledge elements (concepts, relations, and features).
Figure 4. Natural language processing module.
Knowledge RepresentationKnowledge Representation module consists of Triplet Extraction, Semantic Association and Atom of Knowledge (see Figure 5). Knowledge is constructed after the NLP module has processed the stream. The Knowledge Representation module extracts the triplets (i.e., predicate, object, and subject) from the processed sentences. The predicate is extracted from the previously processed verb set V , whereas the subject is extracted from the noun set N . The last triplet is assigned based on the last of the adjective set A d j or noun. The association between concepts is created using an a priori knowledge base by searching the concept nodes for similarities based on relationship types such as “InstanceOf”, “IsA”, “PartOf”, “DerivedFrom”, “Synonym”, “CreatedBy”, “HasProperty”, “UsedFor”, “HasA”, “FormOf”, and “RelatedTo”. Based on these associations, the atom of knowledge is constructed.
Figure 5. Knowledge representation module.
Cognitive Cycle Identifier: The sensory stimuli based on sentences are evaluated in Bloom’s taxonomy-based cognitive module. The module is constructed on a system trained on Long-Short-Term Memory (LSTM) with an improvised dataset based on Yahya’s model with 300 epochs and a cost function of 1.903 × 10−6 [44]. The cognitive level is determined as C o g l e v e l = [ P e r c e p t i o n ,   U n d e r s t a n d i n g C o m p r e h e n s i o n s ,   E x e c u t i o n C o n t r o l ,   P o s t E x e c u t i o n A n a l y s i s ,   E v a l u a t i o n ,   S y n t h e s i s ] . The cognitive levels are dataset classes. The stream is then tokenized and parsed using the Natural Language Toolkit (NLTK). The knowledge domain is further classified based on the action verbs of Bloom’s revised taxonomy [12] to determine the instance of B C o g m a t r i x . The instance then initiates the designated cognitive process applicable for potential knowledge dimension and action.

4.3. Semantic Memory

Semantic Memory is constituted in an a priori and an a posteriori manner. The seed knowledge is developed from ConceptNet and WordNet, whereas the posterior knowledge is constructed when the agent interacts with the environment and stored in the Explicit Knowledge repository.
Seed Knowledge: Seed knowledge is constituted based on atoms of knowledge from WordNet and ConceptNet. The knowledge base has 1.47 million nodes and 3.13 million relationships (53 relationship types, i.e., AlsoSee, Antonym, AtLocation, Attribute, CapableOf, Cause, Causes, etc.). The nodes consist of 117,659 Synsets (WordNet Nodes), 1.33 million Concept (ConceptNet), and 157,300 Lemma nodes. The Lemma nodes are extracted from Concept nodes based on “root words”. These nodes are partially or fully matchable with Synset nodes. The semantic memory-based seed (tacit) knowledge is represented as S m = { A t o m 1 , A t o m 2 , A t o m 3 . A t o m k } and atoms as A t o m = { c o n c e p t , c o n c p e t ,   c o n c e p t , f e a t u r e } . The transformation of ConceptNet and WordNet ontologies to the proposed seed knowledge, i.e., semantic memory can be seen in Table 2 and Table 3.
Table 2. ConceptNet to semantic memory node and edge transformation detail.
Table 3. WordNet to semantic memory nodes and edge transformation detail.
Explicit KnowledgeExplicit Knowledge is constructed based on the semantic network [45] and conceptual graph [46] drawn from working memory. These graphs are constructed in the Knowledge Representation module.

4.4. Working Memory

Working memory acts as an executive control in the proposed system, whose primary responsibility is to formulate object grounding and semantic similarity analysis.
Object Grounding: The localization is further used to determine the accessibility coordinates of the robotic arm. We started with the simplest approach by dividing the table-top setup into a 3 × 3 grid as shown in Figure 6.
Figure 6. Image grid (3 × 3).
The grid is divided into several directions as defined in I d i r e c t i o n : I t e m × I t e m [ c e n t e r ,   r i g h t ,   l e f t ,   t o p ,   b o t t o m ,   b o t t o m l e f t ,   b o t t o m r i g h t , t o p r i g h t ,   t o p l e f t ] . This approach is workable to determine the exact position of the item. However, we want to know the relative positions of the items. The grid is further described based on a reference point, i.e., center in Figure 7.
Figure 7. Grid reference point.
This reference point consideration is further extended to position the item relative to others as shown in Figure 8.
Figure 8. Relative position of objects (items).
Semantic Analysis: The semantic similarity between atoms of knowledge constructed from words W coming from Lingual Perception and atoms of knowledge constricted from I p a f f o r d a n c e coming from Visual Perception is evaluated.
S ( A t o m W ,   A t o m I p a f f o r d a n c e ) = | A t o m W A t o m I p a f f o r d a n c e | | A t o m W A t o m I p a f f o r d a n c e |
The S maximum scores indicate the similarity between A t o m W and A t o m I p a f f o r d a n c e .

5. Results

To validate the proposed methodology, we conducted multiple experiments. The experimental results are based on human collaborator interaction with the agent. In the first phase, visual perception analysis is discussed, and the subsequent phases are based on object grounding, cognitive, and semantic analysis.

5.1. Visual Perception

We trained the agent on YOLOv3 and tested it to validate the proposed methods on 160 video frames comprising a collection of 783 different household objects (see Figure 9 for a subset of video frames). The categorization of objects placed on the table-top scenarios is based on callable, cleanable, drinkable, edible, playable, readable, and writable affordance classes. Each frame contains an average of nine objects placed on various areas of the table to identify and relate the location with spoken commands.
Figure 9. Affordance recognition results of 9 frames out 160 in total.
The results are shown in confusion matrices in Figure 10. The results indicate that for cleanable affordance the negative predictions were mostly callable and writeable. This happens in the case of duster (cleanable) and spunch (cleanable) because their shape is similar to that of a cellphone (callable). In some cases, the yellow spunch was misclassified as sticky-note (writable). Furthermore, the toys (playable) were misclassified as drinkable objects in 12 instances due to geometric similarities. Moreover, performance metrics were calculated for affordance recognition and can be seen in Table 4.
P r e c i s i o n = T r u e   P o s i t i v e s T r u e   P o s i t i v e s + f a l s e   p o s i t i v e s
R e c a l l = T r u e   P o s i t i v e s T r u e   P o s i t i v e s + f a l s e   n e g a t i v e s
F 1   S c o r e = 2   P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l .
Figure 10. Affordance recognition–confusion matrices.
Table 4. Performance metrics: precision, recall, F1 score.
Table 4 contains false positive, false negative, true positive, and true negative parameters. Based on these parameters’ precision, recall, and f1 score are calculated for all seven affordance classes. The results indicate that the precision is good in the case of the cleanable object but the recall has a low value, whereas callable has good recall but has low precision. Moreover, the f1 score of callable is lowest amongst the remainder of the classes. The affordance learning is compared with the current state-of-the-art in Table 5. Furthermore, the objects were classified using a pre-trained COCO model for object grounding and knowledge representation. The knowledge is represented using both a conceptual graph and a semantic network. The conceptual graph is used for further action selection and the semantic network becomes part of the semantic memory.
Table 5. State-of-the-art comparison with proposed affordance learning.

5.2. Lingual Perception and Object Grounding

This section is based on the object grounding results, formation of the conceptual graph, and semantic network. To display the formulation of the conceptual graph, one of the previously discussed video frames is used (Figure 9a). In this phase, the information extracted from video frames is used to address the affordance of an object and position with respect to the center of the frame and the position of other objects. This information is further transformed using the COCO model as “The cellphone is located at the bottom left side of the table” (see Figure 11). For this purpose, two types of graphs were generated, i.e., a conceptual graph (CG) and a semantic network (SN). CG is generated separately for each instance in the frame. CG is composed of two nodes, i.e., conceptNode (cellphone, located, side, table, bottom, left) and relationNode (object, at, attr, of), whereas the empty relation is represented as “Link” (see Figure 11a). This type of graph helps the agent to check the dependency factor. In the case of “The frog is located on the left side of the table” the nodes are slightly different, i.e., conceptNode (frog, located, side, table, left) and relationNode (agent, at, attr, of) (see Figure 11c). Both examples have two relationNodes distinct nodes, i.e., “cellphone” is represented as “object” whereas “frog” is represented as an “agent”. This information helps understand the nature of the item, its role, and placement.
Figure 11. Conceptual graphs.
The semantic network (SN) for a video frame (Figure 9a) is constructed to be stored in the semantic memory for future processing (see Figure 12). SN is composed of the ConceptNode and relationship in the form of edges. The edges for “LocatedAt” and “LocatedOn” indicate the path towards the position of the item, whereas “NEXT” is an empty relationship that points towards the succeeding node(s).
Figure 12. Semantic network.
If the item in the frame is not recognized based on affordance (see Figure 9a), i.e., “Rubik’s cube” (as playable) and “pen” (as writeable), then the agent will not be able to ground the position and direction of an item. The grounding is formulated after the affordance recognition in the form of sentences and then as a conceptual graph (see Figure 11) and a semantic network (see Figure 12).

5.3. Cognitive Cycle Identifier

This section is based on the results from the Bloom-based Cognitive Cycle Identifier. In this phase, the verbal sensory stimuli are analyzed for the action selection, i.e., “How many items are present on the table?”, “How many objects belong to a drinkable category?”, “Which object is used to reduce the hunger?”, and “Which item is used to reduce the intensity of thirst?”. The action verbs are further accessed for the identification of the “cognitive domain”, as described in Bloom’s revised taxonomy [12]. After the identification of the “cognitive domain,” the agent chooses its actions as “Blob Detection and Counting”, “Affordance Recognition”, and “Jaccard Semantic Similarity” (see Figure 13, Figure 14, Figure 15, Figure 16, Figure 17 and Figure 18). The results shown here are encouraging and represent an important step towards an advancement in perceptual semantics in cognitive robots.
Figure 13. (a) Original frame, (b) blob detection and counting for query, (c) query.
Figure 14. (a) Original frame, (b) object affordance results for query, (c) query.
Figure 15. (a) Similarity score, (b) semantic network for query, (c) query.
Figure 16. (a) Similarity score, (b) semantic network for query, (c) query.
Figure 17. (a) Similarity score, (b) semantic network for query, (c) query.
Figure 18. (a) Similarity score, (b) semantic network for query, (c) query.
The Universal Robot (UR5)-based demonstrations can be accessed through Table 6.
Table 6. Links to demonstration videos.

6. Conclusions

In this work, we proposed perceptual and semantic processing for human–robot interaction in the agent. The contribution of the proposed work is the extension of affordance learning, Bloom’s taxonomy as a cognitive cycle, object grounding, and perceptual semantics. The experiments were conducted on the agent using 160 video frames with household objects in a table-top scenario and human cues that contained implicit instructions. The results suggest that the overall HRI experience was improved due to the proposed method and the agent was able to address implicit lingual cues (see Table 6).

Author Contributions

Conceptualization, W.M.Q.; Methodology, S.T.S.B. and W.M.Q.; Supervision, W.M.Q.; Validation, S.T.S.B.; Writing—original draft, S.T.S.B.; Writing—review & editing, W.M.Q. Both authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Services Syndicate Private Limited for providing access to Universal Robot (UR5) for experimentations.

Acknowledgments

Authors acknowledge the support of Services Syndicate Private Limited for providing access to Universal Robot (UR5) for experimentations.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Layer configurations.
Table A1. Layer configurations.
Layer TypeLayerFiltersConcatenationSize/Strd(dil)Output
0 Convolutional conv32 3 × 3/ 1608 × 608 × 32
1 conv64 3 × 3/ 2304 × 304 × 64
2 conv32 1 × 1/ 1304 × 304 × 32
3 conv64 3 × 3/ 1304 × 304 × 64
4 Residual Shortcut Layer 304 × 304 × 64
5 Convolutional conv128 3 × 3/ 2152 × 152 × 128
2 × conv64 1 × 1/ 1152 × 152 × 64
conv128 3 × 3/ 1152 × 152 × 128
11Residual Shortcut Layer 152 × 152 × 128
12 Convolutional conv256 3 × 3/ 276 × 76 × 256
8 × conv128 1 × 1/ 176 × 76 × 128
conv256 3 × 3/ 176 × 76 × 256
36Residual Shortcut Layer 76 × 76 × 256
37 Convolutional conv512 3 × 3/ 238 × 38 × 512
8 × conv256 1 × 1/ 138 × 38 × 256
conv512 3 × 3/ 138 × 38 × 512
61Residual Shortcut Layer 38 × 38 × 512
62 Convolutional conv1024 3 × 3/ 219 × 19 × 1024
4 × conv512 1 × 1/ 119 × 19 × 512
conv1024 3 × 3/ 119 × 19 × 1024
74Residual Shortcut Layer 19 × 19 × 1024
3 × Convolutional conv512 1 × 1/ 119 × 19 × 512
80 conv1024 3 × 3/ 119 × 19 × 1024
81 conv39 1 × 1/ 119 × 19 × 39
82 Detection yolo
83 route79 ->
84 Convolutional conv256 1 × 1/ 119 × 19 × 256
85 Upsampling upsample 2x38 × 38 × 256
86 route: 85 -> 618561 38 × 38 × 768
3 × Convolutional conv256 1 × 1/ 138 × 38 × 256
92 conv512 3 × 3/ 138 × 38 × 512
93 conv39 1 × 1/ 138 × 38 × 39
94 Detection yolo
95 route91 ->
96 Convolutional conv128 1 × 1/ 138 × 38 × 128
97 Upsampling upsample 2 x76 × 76 × 128
98 route: 97 -> 369736 76 × 76 × 384
3 × Convolutional conv128 1 × 1/ 176 × 76 × 128
104 conv256 3 × 3/ 176 × 76 × 256
105 conv39 1 × 1/ 176 × 76 × 39
106 Detection yolo
Figure A1. Affordance training iterations graph.

References

  1. Dubba, K.S.R.; Oliveira, M.R.d.; Lim, G.H.; Kasaei, H.; Lopes, L.S.; Tome, A.; Cohn, A.G. Grounding Language in Perception for Scene Conceptualization in Autonomous Robots. In Proceedings of the AAAI 2014 Spring Symposium, Palo Alto, CA, USA, 24–26 March 2014. [Google Scholar]
  2. Kotseruba, I.; Tsotsos, J.K. 40 years of cognitive architectures: Core cognitive abilities and practical applications. Artif. Intell. Rev. 2020, 53, 17–94. [Google Scholar] [CrossRef] [Green Version]
  3. Oliveira, M.; Lopes, L.S.; Lim, G.H.; Kasaei, S.H.; Tomé, A.M.; Chauhan, A. 3D object perception and perceptual learning in the RACE project. Robot. Auton. Syst. 2016, 75, 614–626. [Google Scholar] [CrossRef]
  4. Oliveira, M.; Lim, G.H.; Lopes, L.S.; Kasaei, S.H.; Tomé, A.M.; Chauhan, A. A perceptual memory system for grounding semantic representations in intelligent service robots. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 14–18 September 2014; Institute of Electrical and Electronics Engineers (IEEE): New York, NY, USA, 2014; pp. 2216–2223. [Google Scholar]
  5. Lopes, M.; Melo, F.S.; Montesano, L. Affordance-based imitation learning in robots. In Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, CA, USA, 30 November 2006; IEEE: New York, NY, USA, 2007; pp. 1015–1021. [Google Scholar]
  6. Mi, J.; Tang, S.; Deng, Z.; Goerner, M.; Zhang, J. Object affordance based multimodal fusion for natural Human-Robot interaction. Cogn. Syst. Res. 2019, 54, 128–137. [Google Scholar] [CrossRef]
  7. Sowa, J.F. The Cognitive Cycle. In Proceedings of the 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), Lodz, Poland, 13–16 September 2015; IEEE: New York, NY, USA, 2015; Volume 5, pp. 11–16. [Google Scholar]
  8. McCall, R.J. Fundamental Motivation and Perception for a Systems-Level Cognitive Architecture. Ph.D. Thesis, The University of Memphis, Memphis, TN, USA, 2014. [Google Scholar]
  9. Paraense, A.L.; Raizer, K.; de Paula, S.M.; Rohmer, E.; Gudwin, R.R. The cognitive systems toolkit and the CST reference cognitive architecture. Biol. Inspired Cogn. Archit. 2016, 17, 32–48. [Google Scholar] [CrossRef]
  10. Blanco, B.; Fajardo, J.O.; Liberal, F. Design of Cognitive Cycles in 5G Networks. In Collaboration in A Hyperconnected World; Springer Science and Business Media LLC: London, UK, 2016; pp. 697–708. [Google Scholar]
  11. Madl, T.; Baars, B.J.; Franklin, S. The Timing of the Cognitive Cycle. PLoS ONE 2011, 6, e14803. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Krathwoh, D. A Revision of Bloom’s Taxonomy: An Overview. Theory Pract. 2002, 41, 213–264. [Google Scholar]
  13. Qazi, W.M.; Bukhari, S.T.S.; Ware, J.A.; Athar, A. NiHA: A Conscious Agent. In Proceedings of the COGNITIVE 2018, The Tenth International Conference on Advanced Cognitive Technologies and Applications, Barcelona, Spain, 18–22 February 2018; pp. 78–87. [Google Scholar]
  14. Marques, H.G. Architectures for Embodied Imagination. Neurocomputing 2009, 72, 743–759. [Google Scholar] [CrossRef]
  15. Samsonovich, A.V. On a roadmap for the BICA Challenge. Biol. Inspired Cogn. Archit. 2012, 1, 100–107. [Google Scholar] [CrossRef]
  16. Breux, Y.; Druon, S.; Zapata, R. From Perception to Semantics: An Environment Representation Model Based on Human-Robot Interactions. In Proceedings of the 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Nanjing and Tai’an, China, 27–31 August 2018; IEEE: New York, NY, USA, 2018; pp. 672–677. [Google Scholar] [CrossRef]
  17. Bornstein, M.H.; Gibson, J.J. The Ecological Approach to Visual Perception. J. Aesthet. Art Crit. 1980, 39, 203. [Google Scholar] [CrossRef]
  18. Cruz, F.; Magg, S.; Weber, C.; Wermter, S. Training Agents With Interactive Reinforcement Learning and Contextual Affordances. IEEE Trans. Cogn. Dev. Syst. 2016, 8, 271–284. [Google Scholar] [CrossRef]
  19. Min, H.; Yi, C.; Luo, R.; Zhu, J.; Bi, S. Affordance Research in Developmental Robotics: A Survey. IEEE Trans. Cogn. Dev. Syst. 2016, 8, 237–255. [Google Scholar] [CrossRef]
  20. Kjellström, H.; Romero, J.; Kragić, D. Visual object-action recognition: Inferring object affordances from human demonstration. Comput. Vis. Image Underst. 2011, 115, 81–90. [Google Scholar] [CrossRef]
  21. Thomaz, A.L.; Cakmak, M. Learning about objects with human teachers. In Proceedings of the 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI), San Diego, CA, USA, 11–13 March 2009; IEEE: New York, NY, USA, 2009; pp. 15–22. [Google Scholar]
  22. Wang, C.; Hindriks, K.V.; Babuška, R. Robot learning and use of affordances in goal-directed tasks. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; Institute of Electrical and Electronics Engineers (IEEE): New York, NY, USA, 2013; pp. 2288–2294. [Google Scholar]
  23. Nguyen, A.; Kanoulas, D.; Muratore, L.; Caldwell, D.G.; Tsagarakis, N.G. Translating Videos to Commands for Robotic Manipulation with Deep Recurrent Neural Networks. 2017. Available online: https://www.researchgate.net/publication/320180040_Translating_Videos_to_Commands_for_Robotic_Manipulation_with_Deep_Recurrent_Neural_Networks (accessed on 17 September 2018).
  24. Myers, A.; Teo, C.L.; Fermuller, C.; Aloimonos, Y. Affordance detection of tool parts from geometric features. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; IEEE: New York, NY, USA, 2015; pp. 1374–1381. [Google Scholar]
  25. Moldovan, B.; Raedt, L.D. Occluded object search by relational affordances. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 7 June 2014; IEEE: New York, NY, USA, 2014; pp. 169–174. [Google Scholar]
  26. Nguyen, A.; Kanoulas, D.; Caldwell, D.G.; Tsagarakis, N.G. Object-based affordances detection with Convolutional Neural Networks and dense Conditional Random Fields. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: New York, NY, USA, 2017; pp. 5908–5915. [Google Scholar]
  27. Antunes, A.; Jamone, L.; Saponaro, G.; Bernardino, A.; Ventura, R. From human instructions to robot actions: Formulation of goals, affordances and probabilistic planning. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA); Institute of Electrical and Electronics Engineers (IEEE), Stockholm, Sweden, 16–21 May 2016; IEEE: New York, NY, USA, 2016; pp. 5449–5454. [Google Scholar]
  28. Tenorth, M.; Beetz, M. Representations for robot knowledge in the KnowRob framework. Artif. Intell. 2017, 247, 151–169. [Google Scholar] [CrossRef]
  29. Roy, D.; Hsiao, K.-Y.; Mavridis, N. Mental Imagery for a Conversational Robot. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2004, 34, 1374–1383. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach; Pearson Education, Inc.: Upper Saddle River, NJ, USA, 1994. [Google Scholar]
  31. Madl, T.; Franklin, S.; Chen, K.; Trappl, R. A computational cognitive framework of spatial memory in brains and robots. Cogn. Syst. Res. 2018, 47, 147–172. [Google Scholar] [CrossRef] [Green Version]
  32. Shaw, D.B. Robots as Art and Automation. Sci. Cult. 2018, 27, 283–295. [Google Scholar] [CrossRef]
  33. Victores, J.G. Robot Imagination System; Universidad Carlos III de Madrid: Madrid, Spain, 2014. [Google Scholar]
  34. Diana, M.; De La Croix, J.-P.; Egerstedt, M. Deformable-medium affordances for interacting with multi-robot systems. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; Institute of Electrical and Electronics Engineers (IEEE): New York, NY, USA, 2013; pp. 5252–5257. [Google Scholar]
  35. Fallon, M.; Kuindersma, S.; Karumanchi, S.; Antone, M.; Schneider, T.; Dai, H.; D’Arpino, C.P.; Deits, R.; DiCicco, M.; Fourie, D.; et al. An Architecture for Online Affordance-based Perception and Whole-body Planning. J. Field Robot. 2014, 32, 229–254. [Google Scholar] [CrossRef] [Green Version]
  36. Sun, Y.; Ren, S.; Lin, Y. Object–object interaction affordance learning. Robot. Auton. Syst. 2014, 62, 487–496. [Google Scholar] [CrossRef]
  37. Hart, S.; Dinh, P.; Hambuchen, K. The Affordance Template ROS package for robot task programming. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 26 2015; IEEE: New York, NY, USA, 2015; pp. 6227–6234. [Google Scholar]
  38. Gago, J.J.; Victores, J.G.; Balaguer, C. Sign Language Representation by TEO Humanoid Robot: End-User Interest, Comprehension and Satisfaction. Electronics 2019, 8, 57. [Google Scholar] [CrossRef] [Green Version]
  39. Pandey, A.K.; Alami, R. Affordance graph: A framework to encode perspective taking and effort based affordances for day-to-day human-robot interaction. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems; Institute of Electrical and Electronics Engineers (IEEE), Tokyo, Japan, 3–7 November 2013; IEEE: New York, NY, USA, 2013; pp. 2180–2187. [Google Scholar]
  40. Bukhari, S.T.S.; Qazi, W.M.; Intelligent Machines & Robotics Group, COMSATS University Islamabad, Lahore Campus. Affordance Dataset. 2019. Available online: https://github.com/stsbukhari/Dataset-Affordance (accessed on 8 September 2021).
  41. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 18–20 June 1996; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
  42. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  43. Taylor, A.; Marcus, M.; Santorini, B. The Penn Treebank: An Overview. Treebanks 2003, 20, 5–22. [Google Scholar]
  44. Yahya, A.A.; Osman, A.; Taleb, A.; Alattab, A.A. Analyzing the Cognitive Level of Classroom Questions Using Machine Learning Techniques. Procedia-Soc. Behav. Sci. 2013, 97, 587–595. [Google Scholar] [CrossRef] [Green Version]
  45. Sowa, J.F. Semantic Networks. In Encyclopedia of Cognitive Science; American Cancer Society: Chicago, IL, USA, 2006. [Google Scholar]
  46. Sowa, J.F. Conceptual graphs as a universal knowledge representation. Comput. Math. Appl. 1992, 23, 75–93. [Google Scholar] [CrossRef] [Green Version]
  47. Do, T.-T.; Nguyen, A.; Reid, I. AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; IEEE: New York, NY, USA, 2018; pp. 1–5. [Google Scholar]
  48. Myers, A. From Form to Function: Detecting the Affordance of Tool Parts using Geometric Features and Material Cues. Ph.D. Thesis, University of Maryland, College Park, MD, USA, 2016. [Google Scholar]
  49. Jiang, Y.; Koppula, H.; Saxena, A.; Saxena, A. Hallucinated Humans as the Hidden Context for Labeling 3D Scenes. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 18–20 June 1996; IEEE: New York, NY, USA, 2013; pp. 2993–3000. [Google Scholar]
  50. Koppula, H.S.; Jain, A.; Saxena, A. Anticipatory Planning for Human-Robot Teams. In Experimental Robotics; Springer: Berlin/Heidelberg, Germany, 2016; pp. 453–470. [Google Scholar]
  51. Baleia, J.; Santana, P.; Barata, J. On Exploiting Haptic Cues for Self-Supervised Learning of Depth-Based Robot Navigation Affordances. J. Intell. Robot. Syst. 2015, 80, 455–474. [Google Scholar] [CrossRef] [Green Version]
  52. Chu, F.-J.; Xu, R.; Vela, P.A. Learning Affordance Segmentation for Real-World Robotic Manipulation via Synthetic Images. IEEE Robot. Autom. Lett. 2019, 4, 1140–1147. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.