Perceptual and Semantic Processing in Cognitive Robots

: The challenge in human–robot interaction is to build an agent that can act upon human implicit statements, where the agent is instructed to execute tasks without explicit utterance. Understanding what to do under such scenarios requires the agent to have the capability to process object grounding and affordance learning from acquired knowledge. Affordance has been the driving force for agents to construct relationships between objects, their effects, and actions, whereas grounding is effective in the understanding of spatial maps of objects present in the environment. The main contribution of this paper is to propose a methodology for the extension of object affordance and grounding, the Bloom-based cognitive cycle, and the formulation of perceptual semantics for the context-based human–robot interaction. In this study, we implemented YOLOv3 to formulate visual perception and LSTM to identify the level of the cognitive cycle, as cognitive processes synchronized in the cognitive cycle. In addition, we used semantic networks and conceptual graphs as a method to represent knowledge in various dimensions related to the cognitive cycle. The visual perception showed average precision of 0.78, an average recall of 0.87, and an average F1 score of 0.80, indicating an improvement in the generation of semantic networks and conceptual graphs. The similarity index used for the lingual and visual association showed promising results and improves the overall experience of human–robot interaction.


Introduction
This paper proposes an affordance-and grounding-based approach for the formation of perceptual semantics in robots for human-robot interaction (HRI). Perceptual semantics play a vital role in ensuring a robot understands its environment and the implication of its actions [1,2]. The challenge is to build robots with the ability to process their acquired knowledge into perception and affordance [1][2][3][4][5][6]. In this context, the significance of affordance can be rationalized by the following scenario taken from human-human interaction (HHI): if we state the following information to another human "X" that "I am feeling thirsty" rather than "I want to drink beverage using a red bottle", the human "X" will be able to understand the relationship between "drink" and "thirst". The link between these two terms is "thirst causes the desire to drink". This ability to establish a relationship between "drink" and "thirst" based on semantic analysis is called affordance. Consequently, the human "X" will offer something to drink. Let us assume we have a robot with the ability to process affordance and a similar situation is present in human-robot interaction; then, it is expected that the robot may perform a similar action as the human "X" in HHI. For robots, this type of interaction is currently a challenge, although there are various contributions in this direction [5,6] with the focus on visual affordance. In the scenario presented above, visual affordance is not sufficient for understanding of the relationship between "thirst", and "drink". The robot also needs to ground the objects placed on the table. Object grounding is an approach that allows the robot to profile objects in the environment [6] i.e., "How many objects belong to a drinkable category?" will be answered with the response having the position of the object as "There is one drinkable object located at the left side of the table". This challenge becomes even more complex when it is implemented for cognitive robots, because their design rationale considers factors such as internal regularization, control, and synchronization of autonomous processes through a cognitive cycle (understanding, attending, and acting) [7][8][9][10][11]. A reference cognitive cycle may consist of a variant of the phases of perception, understanding, and action [7,11]. In this study, we adopted an extended version [12] of Bloom's taxonomy as a cognitive cycle. The reason for using the Bloom-based cognitive cycle is that it provides a map between the level of cognitive processes and the type of knowledge domain. The detailed Bloom taxonomy map is provided in a previous paper [12]. In addition, Bloom uses the action verbs to steer the cognitive-knowledge-domain map [12]. The control structure used in this study is shown in Figure 1, which is an extract from our previously reported work [13]. The detailed utilization of the control structure in Figure 1 is discussed in Section 4.
Electronics 2021, 10, x FOR PEER REVIEW 2 of 23 having the position of the object as "There is one drinkable object located at the left side of the table". This challenge becomes even more complex when it is implemented for cognitive robots, because their design rationale considers factors such as internal regularization, control, and synchronization of autonomous processes through a cognitive cycle (understanding, attending, and acting) [7][8][9][10][11]. A reference cognitive cycle may consist of a variant of the phases of perception, understanding, and action [7,11]. In this study, we adopted an extended version [12] of Bloom's taxonomy as a cognitive cycle. The reason for using the Bloom-based cognitive cycle is that it provides a map between the level of cognitive processes and the type of knowledge domain. The detailed Bloom taxonomy map is provided in a previous paper [12]. In addition, Bloom uses the action verbs to steer the cognitive-knowledge-domain map [12]. The control structure used in this study is shown in Figure 1, which is an extract from our previously reported work [13]. The detailed utilization of the control structure in Figure 1 is discussed in Section 4.  [13].
In this study, we proposed perceptual semantics based on extended-object grounding and machine perception (visual and lingual). In this regard, we performed a table-top experiment using a Universal Robot (UR5). A dataset for affordance learning comprising 7622 images (Section 4.1) was prepared for the training of a YOLOv3-based perception module (Section 4.1). A Bloom-based cognitive cycle identifier was also implemented for the identification of cognitive levels (see Section 4.2). The semantic memory was constructed from ConceptNet and WordNet having 1.47 million nodes and 3.13 million relationships (see Section 4.3). Our analysis of the experimental data/results (see Section 5) suggests that perceptual learning alone is not sufficient to access the environment; the inclusion of seed knowledge is important to understand the extended affordance features (i.e., the relationship between "drink" and "thirst"). Moreover, the inclusion of a cognitive cycle identifier helps the robot to choose between "what to reply", "what not to reply", "when to reply", and "what would be the procedure". The work reported in this paper is an effort to contribute to the advancement of building robots with a better understanding of the environment.

Related Work
There is a growing need for robots and other intelligent agents to have safe interactions with partners, mainly human beings. In this regard, the need for perceptual semantics formulated using affordance learning and object grounding is vital for human-robot interaction (HRI) [14][15][16].
Affordance is considered to be the catalyst in establishing a relationship between accessible objects, their effects, and actions carried out by robots [17,18]. Affordance capability can be induced in an agent through interaction, demonstration, annotation, heuristics, and trails [19]. Most of the work undertaken in object affordance learning is based on visual perception [3,[20][21][22][23][24][25][26], whereas lingual cues can also provide additional advantages In this study, we proposed perceptual semantics based on extended-object grounding and machine perception (visual and lingual). In this regard, we performed a table-top experiment using a Universal Robot (UR5). A dataset for affordance learning comprising 7622 images (Section 4.1) was prepared for the training of a YOLOv3-based perception module (Section 4.1). A Bloom-based cognitive cycle identifier was also implemented for the identification of cognitive levels (see Section 4.2). The semantic memory was constructed from ConceptNet and WordNet having 1.47 million nodes and 3.13 million relationships (see Section 4.3). Our analysis of the experimental data/results (see Section 5) suggests that perceptual learning alone is not sufficient to access the environment; the inclusion of seed knowledge is important to understand the extended affordance features (i.e., the relationship between "drink" and "thirst"). Moreover, the inclusion of a cognitive cycle identifier helps the robot to choose between "what to reply", "what not to reply", "when to reply", and "what would be the procedure". The work reported in this paper is an effort to contribute to the advancement of building robots with a better understanding of the environment.

Related Work
There is a growing need for robots and other intelligent agents to have safe interactions with partners, mainly human beings. In this regard, the need for perceptual semantics formulated using affordance learning and object grounding is vital for human-robot interaction (HRI) [14][15][16].
Affordance is considered to be the catalyst in establishing a relationship between accessible objects, their effects, and actions carried out by robots [17,18]. Affordance capability can be induced in an agent through interaction, demonstration, annotation, heuristics, and trails [19]. Most of the work undertaken in object affordance learning is based on visual perception [3,[20][21][22][23][24][25][26], whereas lingual cues can also provide additional advantages that can significantly improve affordance [19]. Therefore, in this study, we focused on both visual and lingual cues. In addition to visual and lingual cues, Breux et al. [16] considered ontologies based on WordNet to extract the action cues and ground the relationships between objects and features (properties). This improved the results and HRI but covered only seven types of relationships (isA, hasA, prop, usedFor, on, linked-to, and homonym), which limits the agent's recognition and understanding capabilities to the stated semantic associations. Implementation of semantic memory has also been reported in the literature [3,27,28]. Antunes et al. [27] reported the use of semantic memory for HRI and discussed the scenario of "make a sandwich" having explicit information objects and their actions. This system [27] does not have the capabilities to cater to situations such as "I am feeling hungry", in which the robot understands that there is a need to make the sandwich. This suggests that semantic memory is more like a knowledge repository.
Object grounding based on either lingual or visual perception is used to profile the object in the environment [6,29]. The grounding is mapped in terms of exact and relative location(s) of the object(s), i.e., "left-of " [16]. Oliveira et al. [3,4] discussed the importance of semantic memory for HRI and interaction-based acquisition of knowledge. The mentioned system uses object grounding without incorporating object affordance; therefore, it is unable to process the feature-based lingual cues. The object grounding techniques used in this paper are similar to that introduced in the Robustness by Autonomous Competence Enhancement (RACE) project [1]. The table-top setup is represented as a grid having nine positions, i.e., center, right, le f t, top, bottom, bottom − le f t, bottom − right, top − right, top − le f t (see Section 4.4).
An agent by design has a control structure that can be as simple as a sense act [30] or as complex as a cognitive architecture [8,13,31]. These control structures may be a collection of memories, learnings, and other mental faculties depending on the architectural complexity [8,13]. The systems with semantic memories are those that fall within the domain of cognitive architecture [8,13,31]. These processes in these control structures are regularized, controlled, and synchronized through a cognitive cycle. A reference cognitive cycle consists of perception, understanding, and action [7][8][9][10]. A limitation of these cognitive cycles is that they do not provide explicit guidelines to map the degree of processing on various knowledge levels and dimensions [7][8][9][10]. This is a challenge in the implementation of cognitive agents in the selection of appropriate cognitive processes and knowledge dimensions from the lingual cues. Bloom's taxonomy provides the method to map lingual cues with the cognitive levels and knowledge dimensions [12], but it has not yet been used as a cognitive cycle.
The analysis of the current work suggests that significant improvement in the state-ofthe-art can be made by increasing the number of semantic relationships, combining both supervised and heuristic approaches for acquiring affordance and formulation of object grounding. We proposed a semantic memory consisting of 53 types of relationships having 1.47 million nodes and 3.13 million relationships to enhance the benefits of affordance learning (see Section 4.3).
We incorporated Bloom's taxonomy in a standard cognitive cycle to identify the cognitive process and knowledge dimensionality based on the identification of action verbs from lingual cues (see Section 4.2). The detailed comparison of the state-of-the-art with the proposed method is symbolized in Table 1.

Problem Formulation
This study proposed a method for human-robot interaction (HRI). For this purpose, semantic memory S m for an agent from the atom of knowledge (Atom) is an essential first step. The atom of knowledge is generated from visual and auditory sensory stimuli. The human-centered environment consists of household items (objects) present on the table.
Let the items that exist in the workspace be presented as I = {i 1 , i 2 , i 3 , . . . ., i n }, and the properties of the item I p be represented as I p = {name, a f f ordance, location, direction}. The affordance of the item is defined as I p a f f ordance = {callable, cleanable, drinkable, edible, playable, readable, writable} and the location as I plocation : item → R × R gives the parameters concerning the visual frame. The direction of the items is presented according to the center of the visual frame as I pdirection : Let the auditory stream be based on m number of words as W = {N, A dj , V}. The word W can be recognized as noun node N = {NN, NNS, NNP, NNPS}, adjective node A dj = {J J, J JR, J JS} and the verb node can as V = {VB, VBD, VBP, VBN, VBG, VBZ}. The noun, adjective, and verb are checked with an a priori knowledge base of concepts and features. The concept is defined as The two-dimensional array, i.e., matrix can be represented as BCog matrix = Cog level × Knowledge Dimension . The selected cognitive cycle is an instance of a matrix as BCog cycle = (Cog level i , Knowledge Dimension i ).

Methodology
This section explains the methodology for the development of artifacts highlighted in the problem formulation. These artifacts include perception (i.e., visual, and lingual), working memory (i.e., object grounding and semantic similarity analysis), and construction of semantic memory (seed knowledge and explicit knowledge). The lingual perception is further divided into knowledge representation, cognitive cycle identifier, and natural language processing module. The core architecture is depicted in Figure 2.

Visual Perception
The visual perception module is based on multiple levels. The first level is based on affordance learning and the next is based on item name identification.
Affordance Learning: The affordance module is trained on a dataset [40] consisting of objects used commonly in the household. The 30 items chosen to date can be on categorized as callable, cleanable, drinkable, edible, playable, readable, writable, and wearable [6]. A total of 8538 images were taken by a Samsung Galaxy 7 camera. The system (see Figure 2) was trained to recognize seven classes, i.e., callable, cleanable, drinkable, edible, playable, readable, and writable. The number of total images used for training purposes was 7622 excluding the wearable category. The system trained on YOLOv3 [41] to identify the items placed on the table-top setup with 18,000 iterations having an average loss of 0.1176 (see Appendix A, Figure A1). The architecture of YOLOv3 with its configuration is shown in Figure 3. The detailed configuration of the training pipeline is presented in Table  A1 in Appendix A.

Visual Perception
The visual perception module is based on multiple levels. The first level is based on affordance learning and the next is based on item name identification.
Affordance Learning: The affordance module is trained on a dataset [40] consisting of objects used commonly in the household. The 30 items chosen to date can be on categorized as callable, cleanable, drinkable, edible, playable, readable, writable, and wearable [6]. A total of 8538 images were taken by a Samsung Galaxy 7 camera. The system (see Figure 2) was trained to recognize seven classes, i.e., callable, cleanable, drinkable, edible, playable, readable, and writable. The number of total images used for training purposes was 7622 excluding the wearable category. The system trained on YOLOv3 [41] to identify the items placed on the table-top setup with 18,000 iterations having an average loss of 0.1176 (see Appendix A, Figure A1). The architecture of YOLOv3 with its configuration is shown in Figure 3. The detailed configuration of the training pipeline is presented in Table A1 in Appendix A.
Item Name Identification: The items classified based on affordance learning are further assigned names, i.e., Drinkable as Bottle or Cup. For the said purpose, a pre-trained YOLOv3 classifier [42] was used to identify the name of the commonly used household items. This module uses the position of the image determined by the YOLOv3 classifier to localize the detected object in the table-top setup. The system returns the item sets as I = {i 1 , i 2 , i 3 , . . . , i n }, and the properties of items are classified as I p = {name, a f f ordance, location, direction}. Figure 2) was trained to recognize seven classes, i.e., callable, cleanable, drinkable, edible, playable, readable, and writable. The number of total images used for training purposes was 7622 excluding the wearable category. The system trained on YOLOv3 [41] to identify the items placed on the table-top setup with 18,000 iterations having an average loss of 0.1176 (see Appendix A, Figure A1). The architecture of YOLOv3 with its configuration is shown in Figure 3. The detailed configuration of the training pipeline is presented in Table  A1 in Appendix A.

Lingual Perception
We developed a rule-based chatbot for the acquisition of perceptual semantics from the auditory stream. The co-worker (i.e., human) communicates with the robot through a speech engine based on Google Speech-to-Text API. The stream is then sent to the Natural Language Processing module for tokenization, part-of-speech tagging, name entity tagging, and basic dependency tagging. Detail about the POS tags can be found at [43]. Furthermore, CoreNLP is used to identify BD and NE tags for the formulation of atom of knowledge elements (concepts, relations, and features).

Lingual Perception
We developed a rule-based chatbot for the acquisition of perceptual semantics from the auditory stream. The co-worker (i.e., human) communicates with the robot through a speech engine based on Google Speech-to-Text API. The stream is then sent to the Natural Language Processing module for tokenization, part-of-speech tagging, name entity tagging, and basic dependency tagging. Detail about the POS tags can be found at [43]. Furthermore, CoreNLP is used to identify BD and NE tags for the formulation of atom of knowledge elements (concepts, relations, and features).  Figure 5). Knowledge is constructed after the NLP module has processed the stream. The Knowledge Representation module extracts the triplets (i.e., predicate, object, and subject) from the processed sentences. The predicate is extracted from the previously processed verb set , whereas the subject is extracted from the noun set . The last triplet is assigned based on the last of the adjective set or noun. The association between concepts is created using an a priori Knowledge Representation: Knowledge Representation module consists of Triplet Extraction, Semantic Association and Atom of Knowledge (see Figure 5). Knowledge is constructed after the NLP module has processed the stream. The Knowledge Representation Electronics 2021, 10, 2216 7 of 20 module extracts the triplets (i.e., predicate, object, and subject) from the processed sentences. The predicate is extracted from the previously processed verb set V, whereas the subject is extracted from the noun set N. The last triplet is assigned based on the last of the adjective set A dj or noun. The association between concepts is created using an a priori knowledge base by searching the concept nodes for similarities based on relationship types such as "InstanceOf", "IsA", "PartOf", "DerivedFrom", "Synonym", "CreatedBy", "HasProperty", "UsedFor", "HasA", "FormOf", and "RelatedTo". Based on these associations, the atom of knowledge is constructed.
The cognitive levels are dataset classes. The stream is then tokenized and parsed using the Natural Language Toolkit (NLTK). The knowledge domain is further classified based on the action verbs of Bloom's revised taxonomy [12] to determine the instance of . The instance then initiates the designated cognitive process applicable for potential knowledge dimension and action.

Semantic Memory
Semantic Memory is constituted in an a priori and an a posteriori manner. The seed knowledge is developed from ConceptNet and WordNet, whereas the posterior knowledge is constructed when the agent interacts with the environment and stored in the Explicit Knowledge repository.   [12] to determine the instance of BCog matrix . The instance then initiates the designated cognitive process applicable for potential knowledge dimension and action.

Semantic Memory
Semantic Memory is constituted in an a priori and an a posteriori manner. The seed knowledge is developed from ConceptNet and WordNet, whereas the posterior knowledge is constructed when the agent interacts with the environment and stored in the Explicit Knowledge repository.
Seed Knowledge: Seed knowledge is constituted based on atoms of knowledge from WordNet and ConceptNet. The knowledge base has 1.47 million nodes and 3.13 million relationships (53 relationship types, i.e., AlsoSee, Antonym, AtLocation, Attribute, Ca-pableOf, Cause, Causes, etc.). The nodes consist of 117,659 Synsets (WordNet Nodes), 1.33 million Concept (ConceptNet), and 157,300 Lemma nodes. The Lemma nodes are extracted from Concept nodes based on "root words". These nodes are partially or fully matchable with Synset nodes. The semantic memory-based seed (tacit) knowledge is represented as S m = {Atom 1 , Atom 2 , Atom 3 . . . .Atom k } and atoms as Atom = { concept, concpet , concept, f eature }. The transformation of ConceptNet and Word-Net ontologies to the proposed seed knowledge, i.e., semantic memory can be seen in Tables 2 and 3.
Explicit Knowledge: Explicit Knowledge is constructed based on the semantic network [45] and conceptual graph [46] drawn from working memory. These graphs are constructed in the Knowledge Representation module.

Working Memory
Working memory acts as an executive control in the proposed system, whose primary responsibility is to formulate object grounding and semantic similarity analysis.
Object Grounding: The localization is further used to determine the accessibility coordinates of the robotic arm. We started with the simplest approach by dividing the table-top setup into a 3 × 3 grid as shown in Figure 6. The grid is divided into several directions as defined in ] . This approach is workable to determine the exact position of the item. However, we want to know the relative positions of the items. The grid is further described based on a reference point, i.e., center in Figure 7. This reference point consideration is further extended to position the item relative to others as shown in Figure 8. The grid is divided into several directions as defined in I direction : This approach is workable to determine the exact position of the item. However, we want to know the relative positions of the items. The grid is further described based on a reference point, i.e., center in Figure 7. The grid is divided into several directions as defined in ] . This approach is workable to determine the exact position of the item. However, we want to know the relative positions of the items. The grid is further described based on a reference point, i.e., center in Figure 7.  This reference point consideration is further extended to position the item relative to others as shown in Figure 8. Semantic Analysis: The semantic similarity between atoms of knowledge constructed from words coming from Lingual Perception and atoms of knowledge constricted from coming from Visual Perception is evaluated.

Semantic Analysis:
The semantic similarity between atoms of knowledge constructed from words W coming from Lingual Perception and atoms of knowledge constricted from I p a f f ordance coming from Visual Perception is evaluated.
The S maximum scores indicate the similarity between Atom W and Atom I p a f f ordance .

Results
To validate the proposed methodology, we conducted multiple experiments. The experimental results are based on human collaborator interaction with the agent. In the first phase, visual perception analysis is discussed, and the subsequent phases are based on object grounding, cognitive, and semantic analysis.

Visual Perception
We trained the agent on YOLOv3 and tested it to validate the proposed methods on 160 video frames comprising a collection of 783 different household objects (see Figure 9 for a subset of video frames). The categorization of objects placed on the table-top scenarios is based on callable, cleanable, drinkable, edible, playable, readable, and writable affordance classes. Each frame contains an average of nine objects placed on various areas of the table to identify and relate the location with spoken commands. The results are shown in confusion matrices in Figure 10. The results indicate that for cleanable affordance the negative predictions were mostly callable and writeable. This happens in the case of duster (cleanable) and spunch (cleanable) because their shape is similar to that of a cellphone (callable). In some cases, the yellow spunch was misclassified as stickynote (writable). Furthermore, the toys (playable) were misclassified as drinkable objects in 12 instances due to geometric similarities. Moreover, performance metrics were calculated for affordance recognition and can be seen in Table 4. The results are shown in confusion matrices in Figure 10. The results indicate that for cleanable affordance the negative predictions were mostly callable and writeable. This happens in the case of duster (cleanable) and spunch (cleanable) because their shape is similar to that of a cellphone (callable). In some cases, the yellow spunch was misclassified as stickynote (writable). Furthermore, the toys (playable) were misclassified as drinkable objects in 12 instances due to geometric similarities. Moreover, performance metrics were calculated for affordance recognition and can be seen in Table 4.     Table 4 contains false positive, false negative, true positive, and true negative parameters. Based on these parameters' precision, recall, and f1 score are calculated for all seven affordance classes. The results indicate that the precision is good in the case of the cleanable object but the recall has a low value, whereas callable has good recall but has low precision.
Moreover, the f1 score of callable is lowest amongst the remainder of the classes. The affordance learning is compared with the current state-of-the-art in Table 5. Furthermore, the objects were classified using a pre-trained COCO model for object grounding and knowledge representation. The knowledge is represented using both a conceptual graph and a semantic network. The conceptual graph is used for further action selection and the semantic network becomes part of the semantic memory.

Lingual Perception and Object Grounding
This section is based on the object grounding results, formation of the conceptual graph, and semantic network. To display the formulation of the conceptual graph, one of the previously discussed video frames is used (Figure 9a). In this phase, the information extracted from video frames is used to address the affordance of an object and position with respect to the center of the frame and the position of other objects. This information is further transformed using the COCO model as "The cellphone is located at the bottom left side of the table" (see Figure 11). For this purpose, two types of graphs were generated, i.e., a conceptual graph (CG) and a semantic network (SN). CG is generated separately for each instance in the frame. CG is composed of two nodes, i.e., conceptNode (cellphone, located, side, table, bottom, left) and relationNode (object, at, attr, of ), whereas the empty relation is represented as "Link" (see Figure 11a). This type of graph helps the agent to check the dependency factor. In the case of "The frog is located on the left side of the table" the nodes are slightly different, i.e., conceptNode (frog, located, side, table, left) and relationNode (agent, at, attr, of ) (see Figure 11c). Both examples have two relationNodes distinct nodes, i.e., "cellphone" is represented as "object" whereas "frog" is represented as an "agent". This information helps understand the nature of the item, its role, and placement. (e) The minion is located at the central position of the table.
(f) The cloth is located on the right side of the table.
(g) The marker is located at the bottom-right side of the table. Figure 11. Conceptual graphs. Figure 11. Conceptual graphs.
The semantic network (SN) for a video frame (Figure 9a) is constructed to be stored in the semantic memory for future processing (see Figure 12). SN is composed of the ConceptNode and relationship in the form of edges. The edges for "LocatedAt" and "LocatedOn" indicate the path towards the position of the item, whereas "NEXT" is an empty relationship that points towards the succeeding node(s).
If the item in the frame is not recognized based on affordance (see Figure 9a), i.e., "Rubik's cube" (as playable) and "pen" (as writeable), then the agent will not be able to ground the position and direction of an item. The grounding is formulated after the affordance recognition in the form of sentences and then as a conceptual graph (see Figure 11) and a semantic network (see Figure 12). Electronics 2021, 10, x FOR PEER REVIEW 16 of 23 Figure 12. Semantic network.

Cognitive Cycle Identifier
This section is based on the results from the Bloom-based Cognitive Cycle Identifier. In this phase, the verbal sensory stimuli are analyzed for the action selection, i.e., "How many items are present on the table?", "How many objects belong to a drinkable category?", "Which object is used to reduce the hunger?", and "Which item is used to reduce the intensity of thirst?". The action verbs are further accessed for the identification of the "cognitive domain", as described in Bloom's revised taxonomy [12]. After the identification of the "cognitive domain," the agent chooses its actions as "Blob Detection and Counting", "Affordance Recognition", and "Jaccard Semantic Similarity" (see . The results shown here are encouraging and represent an important step towards an advancement in perceptual semantics in cognitive robots.

Cognitive Cycle Identifier
This section is based on the results from the Bloom-based Cognitive Cycle Identifier. In this phase, the verbal sensory stimuli are analyzed for the action selection, i.e., "How many items are present on the table?", "How many objects belong to a drinkable category?", "Which object is used to reduce the hunger?", and "Which item is used to reduce the intensity of thirst?". The action verbs are further accessed for the identification of the "cognitive domain", as described in Bloom's revised taxonomy [12]. After the identification of the "cognitive domain," the agent chooses its actions as "Blob Detection and Counting", "Affordance Recognition", and "Jaccard Semantic Similarity" (see . The results shown here are encouraging and represent an important step towards an advancement in perceptual semantics in cognitive robots.

Cognitive Cycle Identifier
This section is based on the results from the Bloom-based Cognitive Cycle Identifier. In this phase, the verbal sensory stimuli are analyzed for the action selection, i.e., "How many items are present on the table?", "How many objects belong to a drinkable category?", "Which object is used to reduce the hunger?", and "Which item is used to reduce the intensity of thirst?". The action verbs are further accessed for the identification of the "cognitive domain", as described in Bloom's revised taxonomy [12]. After the identification of the "cognitive domain," the agent chooses its actions as "Blob Detection and Counting", "Affordance Recognition", and "Jaccard Semantic Similarity" (see . The results shown here are encouraging and represent an important step towards an advancement in perceptual semantics in cognitive robots.            The Universal Robot (UR5)-based demonstrations can be accessed through Table 6.   The Universal Robot (UR5)-based demonstrations can be accessed through Table 6.  The Universal Robot (UR5)-based demonstrations can be accessed through Table 6.

Conclusions
In this work, we proposed perceptual and semantic processing for human-robot interaction in the agent. The contribution of the proposed work is the extension of affordance learning, Bloom's taxonomy as a cognitive cycle, object grounding, and perceptual semantics. The experiments were conducted on the agent using 160 video frames with household objects in a table-top scenario and human cues that contained implicit instructions. The results suggest that the overall HRI experience was improved due to the proposed method and the agent was able to address implicit lingual cues (see Table 6).  Figure A1. Affordance training iterations graph.