Towards Automated Semantic Explainability of Multimedia Feature Graphs

: Multimedia feature graphs are employed to represent features of images, video, audio, or text. Various techniques exist to extract such features from multimedia objects. In this paper, we describe the extension of such a feature graph to represent the meaning of such multimedia features and introduce a formal context-free PS-grammar (Phrase Structure grammar) to automatically generate human-understandable natural language expressions based on such features. To achieve this, we deﬁne a semantic extension to syntactic multimedia feature graphs and introduce a set of production rules for phrases of natural language English expressions. This explainability, which is founded on a semantic model provides the opportunity to represent any multimedia feature in a human-readable and human-understandable form, which largely closes the gap between the technical representation of such features and their semantics. We show how this explainability can be formally deﬁned and demonstrate the corresponding implementation based on our generic multimedia analysis framework. Furthermore, we show how this semantic extension can be employed to increase the effectiveness in precision and recall experiments.


Introduction and Motivation
Bridging the semantic gap has been a research goal for many years. Narrowing down the gap between detected features from multimedia assets (i.e., images, video, audio, text, and social media) and their semantic representation has led to numerous investigations and research in the field of Multimedia Information Retrieval (MMIR) [1]. With the extraction of MMIR supporting features and the integration of these features into data models, internal representations of these features are created. MMIR applications process these representations e.g., w.r.t. indexing, retrieval, and querying and employ them to analyze relationships between or within assets. In particular, the topic of querying becomes highly important, as e.g., every single minute, 147,000 photos are uploaded to Facebook, 41.6 million WhatsApp messages are sent, or 347,000 stories are posted by Instagram [2] and the users still expect highly accurate query results. Due to higher resolutions of images and video, the Level-Of-Detail (LOD) in multimedia assets has increased significantly. Current professional cameras such as the Sony α7R IV 35 are equipped with a resolution of 61.0 megapixels [3], and smartphones such as the Xiaomi Redmi Note 10 Pro even push that boundary to 108 megapixels [4]. This high LOD is also reflected by other multimedia types, e.g., text, where news agencies maintain huge archives of textual information, enriched by user comments, web information, or social media [5].
The figures given above demonstrate that it has become more and more important to semantically understand MMIR supporting features to increase the precision and recall of quasi-semantic MMIR querying. The Semantic Web [6] and all its related technologies and concepts are a sound basis for knowledge representation, reasoning, inferencing, and truth maintenance. To bridge the gap between detected features and their semantic representation, a machine-readable and formal approach is required, as well as a human-understandable explanation of the corresponding processing steps. However, the requirement of a high LOD, the increasing number of assets, and the demand for fast and accurate semantic querying contradict each other and lead to further challenges in the area of explainability in a human-understandable way.
In this paper, we present a solution for the automated explainability of MMIR processing steps in the form of human-understandable natural language texts based on a semantic modeling, which also supports inferencing and reasoning.

State-of-the-Art and Related Work
This section gives an overview of current techniques and standards for semantic indexing and retrieval and discusses related work. We introduce the Multimedia Feature Graph (MMFG) and the Graph Code concept, also discussing semantic analysis and intelligent information retrieval methods.
The GMAF produces a Multimedia Feature Graph (MMFG), which is defined in [7] and represents various integrated multimedia features. Within an application, these MMFGs are typically represented as a collection. For the remainder of this paper, some internal structures, sets, and definitions of the MMFG are relevant (see also [7,22]): .., f t n } is the feature term vocabulary, i.e., the set of feature term labels, which represent detected features. Elements of FVT MMFG are represented by Nodes(n) in the MMFG graph structure. • FVT Coll = n i=1 FVT MMFG i is the feature term vocabulary of the collection of MMFGs within a MMIR application. • the set FRT MMFG = {cn, s, sr} representing the feature relationship types of an MMFG, where cn represents the "child" relationship, s represents the "synonym" relationship, and "sr" represents the "spacial" relationship between feature vocabulary terms. Elements of FRT MMFG are represented by links between Nodes in the MMFG graph structure.
As the integration of multimedia features within MMFGs produce a much higher level-of-detail (LOD), effective and efficient algorithms are required to process these feature graphs. Hence, we introduced Graph Codes, which are a 2D projection of MMFGs and corresponding algorithms, performing calculations based on matrices instead of graph traversal and also support the higher LOD of MMFGs [7]. For these Graph Codes, we introduced a metric for similarity calculation, the mathematical concepts of the indexing and retrieval algorithms, including a detailed evaluation regarding performance, precision, and recall. Figure 1 briefly summarizes this concept as a foundation for subsequent sections of this paper. Figure 1a shows a snippet of an exemplary MMFG visualized in a graph editing tool [23]. Figure 1b illustrates a part of the MMFG in an object diagram including various node and relationship types, which can be represented as a Graph Code table (see Figure 1c) based on the graph's valuation matrix. A Graph Code's matrix representation is shown in Figure 1d, where the correspondence to mathematical matrix calculations is obvious (It is notable for multimedia that due to the current object detection algorithms, MMFGs and their corresponding Graph Codes contain semantic information (e.g., "is a"), spatial information (e.g., "above"), and also temporal information (by the temporal ordering of sub-collections of Graph Codes)). To calculate MMIR results based on Graph Codes, a Graph Code Metric is defined, which can be applied for similarity algorithms. In general, every detected feature can be regarded as a multimedia indexing term. The indexing term of any relevant feature thus becomes part of the vocabulary of the overall retrieval index. In Multimedia Indexing and Retrieval (MMIR), typically these terms have structural and/or semantic relationships to each other and represent the basis for semantic query construction and result representation. In [7], we already defined a metric for similarity of Graph Codes, which is a triple M GC = (M F , M FR , M RT ) containing a feature-metric M F based on the vocabulary, a feature-relationship-metric M FR based on the possible relationships, and a feature-relationshiptype-metric M RT based on the actual relationship types. This metric can be applied for result representation and has to be considered when constructing corresponding queries. Querying and result presentation based on Graph Codes is discussed in [24].
Current Graph Code Query Construction technologies employ structured query languages (e.g., SQL, OQL, XML-Query), including Visual Query Languages (VQLs) and Natural Language Querying (NL) [25]. A Meaning Driven Data Query Language (MD-DQL) [25,26] can support query construction by system-made suggestions of natural language based terms. In the field of Natural Language Processing (NLP), there have been several approaches to automatically translate natural language into structured queries, e.g., NLP to SPARQL processing [27,28]. Typically, results of these kinds of queries are represented in the form of ranked lists. All these query construction methodologies require semantic modeling.
Semantic Representation is covered by the concepts and standards defined by the Semantic Web [29], where the manual, semiautomatic, and automated generation of annotations is defined. The basis for these representations and annotations is a set of domain-oriented vocabulary terms. Once a basic "subclass" relationship is introduced between vocabulary terms, taxonomies can be built, which structure these terms in the form of class/subclass relationships. Typically, taxonomies also contain a consistent set of predefined textual labels and synonyms. Thesauri can be used to model further relationships (e.g., "broader", "narrower") for additional structuring, scoping, and increased expressiveness. Resource Description Framework (RDF) [6] can serve as a foundation. It covers the description of any kind of resource by employing XML Syntax. The Resource Description Framework Schema (RDFS) provides domain specific extension points and a standardized model of exchanging RDF documents. As RDF is based on XML, it can automatically be represented in the form of a graph model, which provides the opportunity to employ a mapping to the MMFG on a structural level. RDF-techniques such as publishing or linking, with a shared data model can act as a base layer for other technologies [6,[29][30][31]. Finally, ontologies describe arbitrary relationships between taxonomy terms (now called concepts) going beyond the hierarchical taxonomy structure. OWL [29] represents these concepts and relationships as classes and properties [32]. Once such a well-defined formal semantic model is in place, reasoning and inference algorithms can also be applied to such semantic representations [33].
However, to clearly define how the concepts of an ontology can be combined automatically (e.g., during automatic inferencing), a well-defined grammar [34] is required. Based on such grammars, an algorithmic implementation can distinguish between valid and invalid statements of a given formal or natural language. According to [34], a grammar G = (V, T, P, S) for a language L is defined by the tuple of vocabulary terms V, the list of terminal symbols T, which terminate valid sentences of L, production rules P, which describe valid combinations of vocabulary terms and a set of starting symbols S for sentences of L.
In [35], PS-grammars (Phrase Structure grammars) are employed as a specialized form to generate language terms by production rules, in which the left side of the rule is replaced by the right side. If, e.g., α → β is a production rule in P, and φ, ρ are literals in V, then φαρ → φβρ is a direct replacement. Rules of PS-Grammars (PSG) are further detailed by four types, 0: unlimited PS-rule, 1: context sensitive PS-rule, 2: context-free PS-rule, 3: regular PS-rule, which denote systematic restrictions of the production rules. These restrictions lead to a hierarchy of formal languages and the corresponding calculation and validation complexities (see Table 1). Table 1. PS-grammar (Phrase Structure grammar) hierarchy of formal languages according to [35]. Typically, when defining grammars, the set V will contain additional classes to structure the possible production rules (typically defined as Chomsky rules [34]), e.g., classes to describe Nominal Phrases (NP), Verbal Phrases (VP), Prepositional Phrases (PP), or other word types such as Adjectives (ADJ), and their location in validly produced sentences [35]. In many cases, grammars are designed that V ∩ T = ∅. As an example, the sentence "The hat is above the head" can be represented by the context-free grammar G en = (V en , T en , P en , S en ) for simple English sentences:

Restriction
• V en = {S en , NP, VP, V, N, DET, PR} representing the variables of the grammar. • T en = {the, hat, is, above, head} is the set of terminal symbols, with V ∩ T = ∅.

•
The production rules for this grammar can be defined as follows: The production rules of PS − Grammars can be visualized in form of so called PStrees [35]. Figure 2 shows such a PS-tree with the sentence, "The hat is above the head" based on the exemplary MMFG of Figure 1b. In this example, the use of Nominal Phrases (NP), Verbal Phrases (VP), Prepositional Phrases (PP), and the Start Symbol S are also illustrated. For example: • The NP, "The hat", consists of the determiner "The", and the noun "hat".
• The NP, "the head", is built by the determiner "the", and the noun "head". • The PP, "above the head", is constructed by the preposition "above", and the NP "the head". • The VP, "is above the head", consists of the verb "is", and the PP "above the head". • The starting symbol for this sentence is constructed by a NP, "The hat" and the VP, "is above the head". By applying these production rules, both construction and the analysis of well-formed sentences can be approved. The introduction of a grammar is also a prerequisite for reasoning [35], where additional semantic features are generated by calculating inferences and conclusions [36]. A formal grammar for MMFGs and Graph Codes and the corresponding concepts based on this grammar, is defined in Section 3 (modeling) and will be the basis for constructing Semantic MMFGs (SMMFG).
To manage and maintain such semantic information based on MMFGs, several more technical Knowledge Representation and Processing Systems have been specified and introduced. For example, the W3C introduced the Simple Knowledge Organization System (SKOS) [37] as a standard way to represent knowledge organization systems with RDF [6]. Reasoning (i.e., the automated extension and maintenance of information), Truth Maintenance (i.e., the automated calculation of information validity), and Inference Systems (i.e., deducting new information based on logical rules) also contribute to improving the semantic information of a MMIR collection [33,38,39].
One common approach for Automated Reasoning and Inferencing is the concept of Non-monotonic Reasoning [38], which is based on justified beliefs and reasonable assumptions. Typically, so called Default Logics are employed to represent knowledge in form of rules. For example, the rule A : b/C is intended to state that, "if A is believed, and each b ∈ B can be consistently believed, then C should be believed". A is a prerequisite, b a set of justifications [33,38]. For the calculation and representation of Default Logic, two major approaches serve as a foundation and are named after their inventors Reiter [40] and Poole [41]. Both approaches result in the common concept of Knowledge Extensions, which represent the set of possible rules, which are assumed (or calculated) to be believed.
Semantic Querying and Retrieval can be performed by employing, e.g., SPARQL [42], which is a standardized query language that operates on RDF, RDFS, or OWL representations and also supports the inclusion of semantic features. Semantic reasoning, which is applied to the underlying semantic representation of an ontology, is often called, "intelligent retrieval". This means that automatic reasoning can derive new semantic feature annotations from existing feature representations. This further means that newly derived features have not been detected, but are actually derived by means of reasoning. Thus, the Multimedia object is annotated with additional features that are not a result of extraction or detection, but derived by logical and semantic reasoning. The resulting feature index is extended and retrieval results will be more accurate where additional features will be attributed to MMFGs automatically [36]. In the remainder of this paper, we will call the originally detected or extracted features, Internal Feature (F Int ), and the features that have been derived by reasoning External Features (F Ext ).
However, to follow this systematic approach, typically a substantial manual effort is required to map syntactic feature representations to semantic representations. Further, the introduction of rules, a basic logic, and truth maintenance criteria must be performed manually. Hence, in Section 3 we define a formal, standardized, and automated approach for the integration of these systems.
In summary, we can state that current technologies provide a sufficient set of appropriate algorithms, tools, and concepts for semantic modeling, representation, indexing, and retrieval. However, concepts such as RDF, RDFS, OWL, SPARQL, or the Semantic Web mostly rely on graph-based semantic representation structures and thus underlie similar constraints regarding efficiency as syntactic feature graphs. The introduction of grammars provides a standardized way of constructing and analyzing well formed sentences. As Graph Codes provide a set of algorithms to significantly increase effectiveness, LOD, and efficiency for graph-based IR algorithms; we now present their application and extension into Semantic Graph Codes and the corresponding processing algorithms.

Modeling and Design
In this section, we define and introduce several semantic extensions of syntactic MMFGs, which serve as a basis for optimized Semantic Graph Code processing and the corresponding application of semantic concepts such as annotation, topic modeling, reasoning, or inferencing. We also introduce Explainable Semantic Graph Codes, providing a human-readable representation of multimedia feature graphs.
The MMFG has been designed to represent MMIR features on a pure syntactical basis, containing only, Internal Features F Int . To support semantic extensions for MMFGs, we apply Semantic Web concepts [29]. In addition, we introduce a context-free PS-grammar G MMFG for the construction of human-readable, i.e., valid natural language textual phrases and statements based on MMFG features. This enables the construction of a formal semantic representation on the one hand, and establishes the basis for natural language textual explanations on the other hand. This combination leads to a well-defined semantic representation of MMFGs, Semantic Multimedia Feature Graph (SMMFG), and to the Explainable Semantic Multimedia Feature Graph (ESMMFG). In particular, ESMMFGs can serve as a basis for inferencing and thus support the production of additional External Features F Ext . The employment of a PS-grammar in addition to the semantic extension has the advantage that the representation of such an MMFG is not only machine-readable, but also "human-readable", i.e., the representation supports transparency, explainability, and reproducibility for humans.
The structure of this section follows the logical sequence of extensions from simple MMFGs to semantic SMMFGs and explainable ESMMFGs. Hence, in Section 3.1, we discuss the initial formal foundation of the chosen approach. In Section 3.2, annotations for MMFGs are introduced, which are then employed to define the semantic extension in the form of SMMFGs in Section 3.3. Finally, explainability is introduced by the definition of a PS-grammar resulting in ESMMFGs in Section 3.4.

Formal Representation of an MMFG
As shown in the state-of-the-art discussion (see Section 2), MMFGs are purely syntactical structures based on a data model, which forms a multimedia feature graph with nodes and edges. The formal model of such a syntactic MMFG representation is shown in Figure 3a. Here, and in subsequent sections, we further detail and extend the exemplary MMFG from Section 2, which serves as a exemplary syntactic representation of a simple MMFG (see Figure 3b). To support the formal representation of such an MMFG with a formal language (i.e., grammar), an additional element-the root node S MMFG -must be defined, which acts as a starting symbol for valid formal language expressions (see Section 1). This is shown in Figure 3b. T MMFG is a set of textual labels (LBL) for elements in FRT and FVT. • P MMFG is the set of production rules, which produce sentences based on, FRT and FVT. In its simplest form, P can be defined as: When such an initial grammar is applied to the exemplary MMFG of Figure 3c, formal language expressions such as, "person has-child head. head has-child hat. hat is-above head.", can be produced based on feature vocabulary terms and feature relationship terms. All these sentences are built on the pattern, node-relationship-node. Although this initial grammar could be employed for a basic representation of syntactical MMFGs, further extensions must obviously be constructed, particularly with respect to supporting higher level semantics and improving human readability. This initial grammar leaves the following open challenges: • The initial grammar does not yet represent the syntactic structure of MMFGs. • The initial grammar is not a context-free PS-grammar. • The initial grammar for MMFGs does not reflect the structural elements of MMFGs and their corresponding production rules. For example, the structural element cn (i.e., child node) should be transformed into the grammatical structure, "has a child named", represented by several textual labels. Another example would be the spacial relationship, sr : above, which should be represented by a set of textual labels forming the phrase, "is above of". • The initial grammar does not provide a semantic description of the syntactic relationships.
However, this initial definition illustrates the overall approach of representing MMFG structures with a formal language grammar with an approximation approach. Based on this, we now formally model semantic and explainable MMFGs and hence introduce several extensions to MMFGs and the corresponding grammars.

Enabling Annotation of Formal MMFG Representations
Modeling a formal representation for MMFGs, we initially apply an annotation pattern to support the linking of external semantic annotations with a special node type Annotation Anchor (aa) to represent a link to an external semantic annotation. Such annotation anchors can be linked to syntactic MMFG nodes or syntactic relationships with a special relationship type, the Annotation Relationship (ar). To support this, the representation of all MMFG relationships has also been extended by corresponding annotation anchors to allow their semantic annotation. Thus, any syntactic MMFG node or relationship can be linked with an annotation relationship to an annotation anchor. This means that any syntactic resource within an MMFG can be semantically annotated.
The formal representation of such a basic MMFG including annotation anchors is shown in Figure 4a and is further detailed in the following sections. Figure 4b shows the extension of an exemplary MMFG by Annotation Anchors (aa) and Annotation Relationships (ar). To clearly distinguish between a feature and its textual representation, the introduced variable LBL is employed, which allows the production of human-understandable textual representations of MMIR features. With these extensions, MMFG nodes and relationships can initially be linked to elements of existing semantic representations of vocabularies, taxonomies, or ontologies, and their corresponding machine-readable representations in the Semantic Web. The grammar G MMFG can be extended to support the construction of formal language expressions including the annotation pattern by adding the elements ar and aa to the set V MMFG , so that V MMFG = FRT ∪ FVT ∪ ar, aa and by refining the production rules as follows: FVT → ar, FRT → ar, ar → aa, aa → LBL} Now, the grammar G MMFG supports the construction of additional language expressions such as: "person is-annotated-with-the-semantic-concept rdf:person. has-child is-annotated-with-the-semantic-concept rdf:related. head is-annotated-with-the-semanticconcept rdf:head". This annotation pattern will now be further employed as a basis for the semantic representation of MMFGs.

Semantic Annotation of Formal MMFG Representations
The introduction of annotation anchors and annotation relationships is a purely syntactic extension; however, this syntactic structure must be annotated semantically. Hence, we now introduce a semantic annotation, which means that each syntactical element of an MMFG will be annotated with semantics and the purely syntactic MMFG becomes a semantically annotated MMFG-a Semantic Multimedia Feature Graph (SMMFG). For such an SMMFG, we define the following additional elements or structures: In an MMFG, each node and each relation is linked by an Annotation Relationship (ar) to an Annotation Anchor (aa), which now represents a Semantic Node Representation (snr) or a Semantic Relation Representation (srr). An Annotation Anchor (aa) is a URI for the node or relation it represents and used to link these MMFG nodes or relations to semantic node representations and semantic relationship representations. • In an SMMFG, the hasName relation links Semantic Node Representations with Semantic Feature Vocabulary Terms and Semantic Relation Representations with Semantic Relationship Vocabulary Terms. Each srr is linked via the hasDomainNode and hasRangeNode relations to the corresponding snr's. Figure 5a shows how Annotation Anchors can now be linked with snr and srr to semantic concepts described in the sets, SFVT SMMFG , and SRVT SMFG . As already outlined, the semantic representation of the syntax of an MMFG contains relationships itself. For example, each srr has two relationships, hasDomainNode and hasRangeNode, which link to corresponding snr elements. They form the 1st level semantic annotation of an MMFG and thus a Semantic Multimedia Feature Graph (SMMFG). The 2nd level semantic annotation is modeled by the elements of s f vt and srvt, which represent the semantic information of the feature vocabulary terms. Figure 5b shows the semantic extension applied to the above example. Based on these syntactic extensions, also the formal grammar, G MMFG can be extended, resulting in a formal grammar, G SMMFG for the representation of valid formal language expressions of SMMFG structures.
is the set of semantic representations of descriptions of MMFG nodes and relationships (see also Figure 5). • T SMMFG is an extension to, T MMFG , and includes additional textual descriptions of the semantic relationships: "hasName", "hasDomainNode", "hasRangeNode", "describes". • P SMMFG extends the production rules, P MMFG , as follows: The grammar G SMMFG supports the construction of additional language expressions such as, "the-semantic-concept rdf:person hasName person. the-semantic-concept rdf:head hasName head. there-is-a-semantic-relationship-between rdf:person and rdf:hat which has-DomainNode rdf:person and hasRangeNode rdf:head. the-semantic-relationship between rdf:person and rdf:head is-described-by rdf:relation and hasName rdf:related".
Although these sentences describe further details of a SMMFG in a formally correct way and increase the readability for machines, human-readability decreases due to the mixture of syntactic, semantic, and structural labels. This means that until now humanreadability depends on the selection of adequate readable and understandable textual labels. To eliminate this dependency, an automated construction of human-readable textual expressions must be achieved. Summarizing this, until now we defined a formal way to represent syntactic and semantic elements of MMFGs and SMMFGs by introducing the formal grammars, G MMFG and G SMMFG , with which not only the syntactic structure of an MMFG, but also the semantic enrichment of SMMFGs, can be represented by formal language expressions. To achieve our final goal of human-understandable i.e., natural lan-guage expressions, we introduce a PS-grammar in the next section, which transforms these machine-readable expressions into human-readable expressions.

Explainable SMMFGs
In Section 2 we introduced the concept of context-free PS-grammars. This type of grammar is typically employed for the production of natural languages consisting of e.g., Nominal Phrases (NP), Verbal Phrases (VP), nouns (N), verbs (V), adjectives (A), determiners (DET). The grammar G en = (V en , T en , P en , S en ) from Section 2 can be employed to produce valid English sentences (as illustrated in Figure 2).
Based on the grammars, G en , G MMFG , and G SMMFG , we can now define such a contextfree PS-grammar, G ESMMFG , which transforms any MMFG or SMMFG formally into human-readable (i.e., explainable) natural language expressions. MMFGs or SMMFGs that are extended in this way, become explainable and will be called Explainable Semantic Multimedia Feature Graphs (ESMMFG) in the remainder of this paper. Figure 6 shows the introduction of the PS-grammar and the corresponding schema for the syntactic ESMMFG representation. Formally, we define, G ESMMFG = (V ESMMFG , T ESMMFG , P ESMMFG , S ESMMFG ) as follows: • The variables, V ESMMFG are based on the variables, V en , of the English grammar, G en , and additionally includes the variables of the previously defined grammars: It thus represents the union of variables defining the English grammar (i.e., V en = NP, VP, V, N, DET, PR), the syntactic elements of an MMFG (i.e., V MMFG = FRT ∪ FVT ∪ {ar, aa}), and the semantic enrichment (i.e., V SMMFG = V MMFG ∪ SNR ∪ SRR ∪ SFVT ∪ SRVT). • T ESMMFG ∩ V ESMMFG = ∅ is the set of terminal symbols and represented by the labels LBL, which can be regarded as any English word of type noun, verb, determiner, adjective, or preposition. The production of these words is based on the semantic feature and semantic relationship vocabulary. The order, in which such LBLs can be arranged to formulate valid expressions, is given by the following production rules. • P ESMMFG is the set of production rules and defines how the MMFG and SMMFG structures can be formally transformed into valid natural language expressions. P ESMMFG also contains the simple production rules previously defined in, P MMFG and P SMMFG ; however, the phrase structure of, P ESMMFG , leads to various additional and refining elements: is the starting symbol for any valid expression. This means that any natural language representation of an MMFG or SMMFG starts with the processing of the root-element; however, as the root-node of an MMFG is a node itself, G ESMMFG , can also be employed to produce expressions of subgraphs of an MMFG or SMMFG.
The application of the production rules P ESMMFG is shown in Figures 7-9. For illustration purposes, some of the production rules representing mostly internal structures have been omitted for readability purposes.   These examples show that the expressiveness of ESMMFGs increased significantly with the introduction of, G ESMMFG , and that natural language sentences can now be built formally based on syntactic and semantic structures of MMFGs.
It is important to note that any natural language expression that is generated based on G ESMMFG is content-wise true (i.e., correct) as it purely represents the original multimedia features in a formal, but human-readable way. In addition, G ESMMFG , provides unlimited options for the production of valid natural language expressions due to its underlying phrase structure. This means that any multimedia feature can now be represented as a natural language, human-readable text. It is up to the application to define, which phrases should be used, which level of abstraction should be applied, or which subset of MMFG-nodes has to be selected for the natural language representation. Examples of the application of, G ESMMFG , to multimedia features of various domains is given in Section 5 (evaluation).
The formal definition of G ESMMFG furthermore guarantees that any element of an MMFG (i.e., any detected multimedia feature) can be structurally and semantically represented. It also ensures that the semantic information of any multimedia feature can be mapped to semantic systems, interpreted, and employed for inferencing and reasoning. Furthermore, any MMFG can now be represented as a syntactically correct and human readable text, which further supports automatic processing by employing a selection of numerous text-bases algorithms, e.g., for argument extraction; however, the generated text is highly dependent on the construction of phrases based on the detected (or calculated) MMIR features in the original MMFG. As this textual representation might be different depending on the MMIR processing step (e.g., the explanation of query construction, result presentation, or the ranking of an element in the result list), also different strategies for the construction of phrases need to be employed. This is reflected by introducing various subclasses for the corresponding processing steps as illustrated by the implementation samples (see Section 4).
Currently, the order of the constructed sentences is based on the order of nodes in the original MMFG data structure. This will produce good results for text-based multimedia documents, as the order of explaining texts will follow the document structure; however, for other multimedia types (e.g., images), the order of the descriptions of detected objects will be random. This can be subject to further improvements in future work. It should also be mentioned that the presented concept is a pure mathematical approach to calculate explaining texts for multimedia features without any need of machine learning tasks as, e.g., in deep LSTM language modeling [44].
In this section, we outline how an MMFG can be semantically represented and extended by a Semantic Multimedia Feature Graph. We define how explainability and transparency can be introduced to syntactic data structures based on MMIR features resulting in explainable multimedia feature graphs. To evaluate the full potential of this semantic extension, we apply this concept to the Generic Multimedia Analysis Framework (GMAF), in which MMFGs and now also SMMFGs, and ESMMFGs can be processed employing Graph Codes, which are particularly optimized for MMIR calculations. Hence, in the following subsection, we show briefly how the concept of Graph Codes can be semantically represented and extended by applying the algorithms of Graph Codes to Semantic Multimedia Feature Graphs.

Semantic Graph Codes
Graph Codes are a 2D representation of MMFGs, which are computationally optimized particularly when employed for MMIR. They are calculated from MMFGs by employing an encoding function, f enc , which transforms an MMFG into the Graph Code structure. In the following subsections, we define how Graph Codes can be transformed into Semantic Graph Codes with corresponding operations.
Until now, the dictionary of a Graph Code (GC) is based on the feature vocabulary terms of detected features (i.e., textual labels describing detected features) of the MMFG. The dictionary of Semantic Graph Codes (SGC) is based on the semantic representation of the meaning of such feature vocabulary terms, i.e., unique labels of elements of SNR. Thus, in a SGC, unique identifiers are used to represent the dictionary. However, this representation can lead to ambiguous representation possibilities.
For example, the feature vocabulary term, Jaguar, of a Graph Code, GC Jaguar , could be connected to the semantic node representation for an animal, but it could also be connected to the semantic node representation for a car. To solve these problems, a function, sq(vt i ), is introduced to determine the unambiguous semantic representation of the meaning of a feature vocabulary term. sq(vt i ) performs a semantic query for each, vt i , to the semantic model and receives either a single unique result, snr i ∈ SNR, representing the meaning of, vt i , or it receives a list of possible results (e.g., the Jaguar-animal or the Jaguar-car semantic nodes). To identify the correct element in this list of possible results, we apply the Graph Code Similarity algorithm to, GC Jaguar , and each element of the result list. To do this, for each element of the result list, we perform an additional query on the semantic model (e.g., for the Jaguar-car) and represent the result as an MMFG, which is then transformed into a Graph Code, GC Result i , GC Result j . The, GC Jaguar , will not just contain the detected feature vocabulary term Jaguar, but additional information. As Graph Codes support a high Level-Of-Detail (LOD) and are generated by recursive processing, they will also contain numerous additionally detected feature vocabulary terms, such as wheel, road, window, ... or whiskers, teeth, furr, .... Our query to the semantic model also returns relationships of the Jaguar to semantic node representations of some of these detected feature vocabulary terms. Thus, if the Jaguar in our MMFG is a car, the similarity to the result of the semantic query for the Jaguar-car will be more similar to, GC Jaguar , than the result of the Jaguar-animal query. In addition, topic modeling can be applied to further optimize the selection of semantic query results as unambiguous semantic node representations.
Summarizing this, the construction of Semantic Graph Codes utilizes a semantic query, sq(vt i ), for each feature vocabulary term, vt i , to identify the unambiguous semantic node representation, snr i . To construct SGCs, the encoding function, f enc , has to be modified to employ, sq(vt i ), when calculating the dictionary for SMMFGs: In addition to the calculation of IDs, f enc will also eliminate synonym and relationship nodes from a SGC-dictionary, as they are represented in the corresponding semantic model. So, f enc will return an empty value for MMFG nodes of the type Synonym.
For further illustration of our example (see Figure 1c), we define the function, sq(vt i ), in a way that it returns the following values for the vocabulary terms of this example (see Table 2). For the vocabulary terms "Individual" and "Human Being" representing MMFG nodes of type Synonym, the function, sq(vt i ), does not return any value, as these relationships are already represented by the semantic model and hence do not need to be repeated in each individual Semantic Graph Code. Applying, sq(vt i ), and f enc , to our example would result in a compressed Semantic Graph Code SGC ex (see Table 3).  Further concepts of Graph Codes, such the calculation of similarity, recommendations, querying, or result presentation remain unchanged. However, it should be noted that Semantic Graph Codes lead to a further compression of the Graph Code matrix, as synonyms, or common knowledge can be removed from the Graph Code, as it is already represented in the external semantic system. Furthermore, new knowledge that exists in the external system can be employed for Semantic Graph Codes, serving as a basis for inferencing and reasoning. As the initial construction of Graph Codes is purely based on the detected feature vocabulary terms of a given multimedia object, the Graph Code vocabulary is typically very small. For Semantic Graph Codes, these feature vocabulary terms are translated into semantic IDs, or even removed, when they exist in the general semantic model of the application, which leads to a further compression, but also requires a representation of the overall semantic model (i.e., the ontology or taxonomy). This model can be represented in an SKOS, which is queried at runtime to identify such vocabulary terms. It is also possible to represent the complete semantic knowledge in the form of a Semantic Graph Code; however, this would lead to a very high number of feature vocabulary terms (i.e., any term in the ontology), and to very large Semantic Graph Codes. This approach is not recommended, as Graph Codes are optimized for indexing and not for knowledge representation.

Summary
In this section, we discussed our approach to formally defining natural language expressions from multimedia feature graphs. We showed how a context-free PS grammar can be built to generate human-readable English sentences formally and thus are able to close the gap between the technical representation of a multimedia feature and a humanunderstandable representation of the meaning of such a feature. We also showed how semantics can be introduced to the GMAF framework in the form of Semantic Graph Codes. Based on this modeling, we now provide further details of the implementation in the next section.

Implementation
Basis for the implementation of the semantic extensions and concepts discussed is the current GMAF prototype, which is written in Java and has a Java Swing UI attached. The prototype including the presented code samples of this section is available at [9], and frequently updated according to the ongoing progress of this research. Following the Factory Design Pattern [45], the GMAF has been extended to utilize external semantic representation frameworks. In this section, we discuss exemplary implementation details. First of all, in Section 4.1, the representation of SMMFGs by RDF and RDFS is given.

Semantic Extension of the GMAF
For each MMIR application based on the GMAF, an external semantic framework can be attached by implementing (and configuring) an adapter class, which is defined by the interface Semantic Extension (see Figure 10). The detected vocabulary term is then passed to the external semantic framework. Whenever Semantic Graph Codes must be calculated (i.e., for indexing, retrieval, querying, filtering) in the GMAF, this extension is called to provide information from or to an external semantic representation. In our prototype, we built two implementations for this. One serves as a reference for the connection of external semantic representations and is implemented to employ the Semantic Web (class SemWebExtension), the other serves as an internal default implementation to illustrate and validate the concepts of Section 3, in particular relevance and topic calculation and inferencing (class DefaultExtension).

Semantic Representation
As external semantic representation, several tools, databases, or services can be applied. In our prototypical implementation, we chose Wikidata [46,47], as it not just serves as a basis for many other MMIR applications, it also provides a fully functioning SPARQL interface [42], which can be utilized for semantic querying. Hence, attaching Wikidata to the GMAF is straightforward as illustrated in Figure 11. Figure 11. Querying an external semantic system for unique IDs with SPARQL.
As Wikidata already supports SPARQL, GMAF-SPARQL-Queries can simple be forwarded and fused with already existing Query Graph Codes. The resulting Semantic Graph Codes are displayed in the GMAF-UI as shown in Figure 12 and can be applied to all kind of queries (including Manual Querying, Query By Example, and Query Refinement) [24] and result presentation. As illustrated in this subsection, connecting external semantic representation systems to the GMAF is quite straightforward. In the case of external systems, the effectiveness of semantic results in terms of precision and recall experiments mainly depends on the external system. In our evaluation, we wanted to compare our internal default representation and the algorithms for topic modeling and intelligent information retrieval with these external systems, as semantic extensions are handled transparently within the GMAF (i.e., the GMAF itself does not perform any enrichment or modification). Hence, in the next subsection, we discuss the implementation of the defaults, which will also be employed later for evaluation purposes.
Summarizing this, we showed in this section that the semantic extension to the GMAF can be implemented based on an interface extension point, which generically provides access to external systems. If no such external system is available, the internal default algorithms for semantic extensions can serve as a good alternative, as Graph Codes and their metrics provide a well-defined mathematical model for intelligent information retrieval. An evaluation of our implementation including experiments is given in Section 5.

Explainability
For the implementation of Explainable Semantic Multimedia Feature Graphs, we apply the design patterns, Interpreter, Composite, and Facade. The Composite pattern is employed to recursively process ESMMFG nodes and to construct the final human-readable text. The Interpreter pattern is chosen to represent the phrase structure of the underlying grammar, where each existing element (e.g., NP, VP, FVT) is encapsulated by a subclass being responsible for the correct construction of valid expressions. Finally, the Facade pattern serves as a wrapper and provides a simple API, with which explanations of a given MMFG can be generated. A simple call of this module can be made as follows: The parameter levelO f Detail is used to define the number of recursions that should be applied for the generation of natural language text. This directly corresponds to the level of detail of the detected MMIR features. The parameter languageLevel can define the style of the produced natural language text. Currently, there is a selection of simple, medium, and complex. Furthermore, for each step of the MMIR process, a different PS-Tree can be constructed. For example, the method produceComparisonPSTree will calculate a PS-Tree with phrases to explain why esmmfg1 has been ranked before esmmfg2 in a result list. The method produceResultPSTree calculates a PS-Tree with phrases that explain why an element is part of the result list. Further, the method produceQueryPSTree would construct a PS-tree with phrases to outline, which query has been calculated e.g., from a given keyword list or from a query by example pattern. The solution has been implemented in an extendable way, so that further subclasses of LanguageModel can be employed to refine, extend, or newly define human-understandable natural language phrases. Figure 13 shows the result for a given image with settings, levelO f Detail = 2 and languageLevel = Simple. Figure 13 also shows the integration of explainability into the GMAF user interface.

Evaluation
In this section, we discuss concept and algorithm evaluation. In previous work [7], we evaluated Graph Codes retrieval against existing graph-traversal-based algorithms and were able to prove that their efficiency and effectiveness are superior to graph-based solutions. In the first part of this section, we extend this previous evaluation by experiments based on Semantic Graph Codes based on images from the Flickr30k and the DIV2K dataset. In the second part of this evaluation section, we chose to employ the text sample dataset of the 2021 TREC conference's News evaluation [48] with 600.000 full-text articles from the Washington Post [5] to illustrate semantic retrieval and inference.

Semantic Retrieval
To determine the effectiveness of the Graph Code Algorithm, we selected five test queries from the annotations of a random set of 1000 Flickr30k images and calculated values for precision and recall for these [7]. When Graph Codes are transformed into Semantic Graph Codes, the same evaluation employs synonyms and "is-a" relations of the external (or internal) semantic model. In the following experiment, we compared our previous results with results based on external and attached internal semantic models. Table 4 shows the measured results for queries based on data, which contain rel, relevant results, sel, matching results. The columns, Precision P B , Recall R B , and the F1-score (F1 = 2 * P B * R B P B +R B ), F1 B contain values for the basis experiment without any semantic extension. The values in columns P I , R I , and F1 I are calculated for the internal semantic analysis, and the values in columns P E , R E , and F1 E are calculated when using an external semantic model, in this case the Wikidata extension. The findings of this experiment can be summarized as follows: • Any semantic enrichment increases the values for precision and recall (summarized by their F1 value) by 18% (see bold in Table 4). • An additional 4% increase can be achieved, when an external semantic system is connected.
These results for effectiveness are currently applied to an image dataset [49]; however, in many MMIR applications, text retrieval rates of effectiveness are also important. Hence, in the next subsection, we discuss the evaluation of our algorithms based on text datasets.

Text Retrieval and Inference
For the evaluation of text retrieval, we employed the TREC2020 dataset of the Washington Post News Archive [5,48], and followed the evaluation criteria of the TREC2021 challenge, which is based on Similarity (Top-10) and Recommendations (Top-10). For both tasks, the calculation of a semantic model of each text is required, in our case an MMFG and the corresponding SGC, including the application of a metric. As discussed in [24], similarity can be calculated by applying, M F , based on semantic vocabulary terms, and recommendations are calculated by applying, M RT . For this evaluation, we measured effectiveness based to the published results of previous years. In the first test scenario, we applied a standard "Bag-Of-Words" algorithm without any semantic enhancement. The second test-scenario then employs the full semantic analysis and features described in Section 3, but does not yet include reasoning and inferencing. This is added in the third experiment. In the fourth scenario, we attached an external framework (Wikidata) and compared all the results to the TREC reference results. We measured the values, P Sim , and R Sim , as Precision and Recall of Similarity, i.e., if the retrieved documents are in the Top-10-List, P Rec , and R Rec , as Precision and Recall for the Recommendation Top-10, and the corresponding F1 Sim , and F1 Rec values.
This experiment shows that the introduction of semantics to text analysis provides an increase in effectiveness of 150% (see in Table 5). The difference between an external and internal implementation exists, but is not very significant and highly dependent on the dataset and the external system. A more detailed evaluation of external systems can provide further insight; however, in the context of this paper, we are able to prove that the concepts of Section 3 are also valid for text retrieval and provide a significant increase in effectiveness.

Explainability
In addition, to these results in the area of efficiency and effectiveness, Explainability has also been evaluated. For this evaluation, we generated various texts describing a number of typical MMIR asset types and their corresponding features. An exemplary text is shown in Figure 13.
Thus, in this evaluation section, we were able to show that semantic enrichment of MMIR applications provides a significant increase in retrieval effectiveness. We also showed that explainability provides a huge potential, in particular, in combination with Semantic Graph Codes, topic modeling, and intelligent information retrieval, improvements of up to 150% can be achieved. Finally, in the next section, we summarize our results.

Conclusions and Future Work
In this paper, we discussed concepts and algorithms for narrowing the semantic gap [1], i.e., the gap between detected features in multimedia objects and the meaning of these features. We introduced a well-defined semantic representation of the MMFG and enhanced the concept of Graph Codes to fully support semantic querying, filtering, reasoning, and inferencing. Both external and internal implementations of the semantic model can be attached to the GMAF, which opens a varied range of extensions to existing MMIR applications and standards. In addition, we showed that our internal default algorithms are also highly effective, and can serve as a solid basis for further implementations. In our evaluation of both image-and text-based datasets, the results of our Semantic Graph Code algorithms and the corresponding concepts give evidence for our modeling approach. Hence, Semantic Graph Codes are an effective and efficient foundation for automated reasoning and inferencing for any MMIR application.
However, there are still some remaining challenges: the presentation of inferencing conflicts to the user, the implementation of additional integration with existing SKOS systems, further implementation of language models, and the evaluation of our implementation with further datasets. These challenges will be part of our ongoing and future work, which is also frequently updated in our GitHub repository [9].

Data Availability Statement:
The data presented in this study are openly available in [9].

Conflicts of Interest:
The authors declare no conflict of interest.