In this section, details of the evaluation of the POC are discussed. Additionally, following the structure of the previous sections, for each of the problem areas, selected experiments are presented, which outline the overall improvement of MMIR by employing Smart MMIR approaches. First, an evaluation of the integrability of the Smart MMIR components is given in
Section 5.2. Then, experiments in the area of scalability are presented in
Section 5.3, and finally, in
Section 5.4 results in the area of explainability are presented.
5.1. Soundness
The introduction of soundness provides additional insight and further expressiveness to users, which can be regarded as a major improvement of explainability in MMIR applications. Hence, in the following discussion, further experiments and the corresponding results are shown, which demonstrate the benefits of in various application areas.
The detetection of security-relevant traffic scenes is one major task in the area of
automotive and autonomous driving. The introduction of
soundness can contribute to this task by comparing the actual traffic scene to expected or uncritical and secure traffic scenes. One major advantage of this is that the calculation of
soundness falls down to simple matrix operations, which can be performed extremely fast, even in real time, which is highly important in the area of autonomous driving. In the following experiment, we investigated if and how
soundness can be employed to approve, if the behavior of cyclists can be regarded as safe, or if a higher risk for injuries has to be expected in the case of an accident. Therefore, we took legal texts as
sound input, which define the recommendations for safe cycling (such as wearing a helmet) and created a
graph code of this text. Then, a set of images was processed with the GMAF to also calculate the corresponding
graph codes . The images were taken from Adobe Stock [
35] (see
Figure 12).
contained vocabulary terms and relationships that, for example, described that wearing a helmet is safe, the handling of smartphones during driving is not safe, etc. In total,
had 132 vocabulary terms and the corresponding relationships. For this experiment, we did not use the intersection of
and
, as this would lead to a loss of relevant safeness parameters. Instead, we decided to leave all 132 vocabulary terms and relationships as input for the calculation of
soundness. In total, 250 images were processed in this way. The results show that no image fully complies to all vocabulary terms and relationships and thus provide a perfectly sound result. This was, of course, expected, as legal texts and the corresponding transformation into
graph codes as well as the object detection algorithms employed within the GMAF produced slightly different levels of features. Even after a semantic analysis based on
, there was no perfectly sound result. However, the experiment shows that most images of the chosen dataset produce a
soundness of
= 0.7–0.8 (see example images shown in
Figure 12a). Some images show a significantly lower value as shown in
Figure 12b with
and
Figure 12c with
. A visual examination shows that images with lower
values contain indicators for safety violations, such as not wearing a helmet or dealing with a smartphone during cycling.
Another area where
soundness can support MMIR processes is the area of
news and fake news. As a underlying dataset, we selected the text archive of the Washington Post [
36], which is also part of the reference datasets of the TREC conference [
37] and contains about 750,000 articles in machine readable JSON-format (see
Figure 13a). These articles were processed into
graph codes (see
Figure 13b).
Based on these prerequisites, we conducted two experiments. First, the
soundness between two articles in the same topic area is calculated. Second, the
soundness parameter is employed to determine contradicting documents within the same topic area. In both cases, it is required to work on articles within a similar topic. It does not make sense to compare sports articles with international politics. As a starting point, we selected an article that has also been employed during the TREC 2021 conference about “Coyotes in Maryland” (see
Figure 14).
Based on this starting point, different datasets were selected for both experiments, and
was calculated for the base article and the elements in the datasets. For the first experiment, a similarity search (based on
) was performed to define the dataset. For the second experiment, a search for recommendations (i.e., somehow related articles) based on
was performed to define the dataset. The expectation is that similar articles would mostly be
sound, while in the recommendations, also contradicting elements can be found. In this manner, we selected 25 documents for each experiment, the results of which are shown in
Table 1.
In the first row of
Table 1, the input document (see
Figure 14) with Doc-Id “c23f5d3face1” is processed and—of course—achieves the highest possible value for similarity, recommendation and soundness. In the remainder of
Table 1, the other documents of the 25 selected items and the corresponding processing values are shown. The last row in the table with Doc-Id “fake news” contains an article that was re-written based on the original text (see
Figure 14) with the narrative “As birds have moved into the area other animals such as coyotes have been driven out. This can lead to the downturn of the number of other animals killed by the birds. While birds are natural predators, which get rid of coyotes, they also have an impact by attacking people and their pets”. So basically, the terms “coyote, bird, other animals” were switched to produce a fake news article.
The results for
soundness in this experiment show that
soundness is independent from similarity or recommendations. Furthermore, it shows that it can be employed for fake news detection, as the value for manually produced fake articles is significantly lower than the values for the other articles. We assume that the combination of all
Graph Code metrics and
will deliver the best fake detection results. This will be further elaborated as part of future work. However, even this experiment shows that
can provide a highly relevant measure. Furthermore, it is important to highlight that the calculation of
falls down to simple matrix operations, which can be processed easily, efficiently, and even in parallel. This will be shown in the experiments in
Section 5.3. Additionally, a further compression in terms of
feature fusion can be an additional means to compress the
graph codes for processing. This is now shown in the next section.
5.3. Scalability Area
In the area of scalability, several quantitative experiments were conducted to further refine and detail the set of experiments already shown in [
2].
Figure 16 shows the corresponding results. The details of this extended evaluation are given in
Table 2 and
Table 3 based on the number of input images
c, the number of calculated MMFG nodes
n, the corresponding edge number
e, the Neo4J runtime with
(i.e., that Neo4J compares up to three links between nodes for similarity). The
and
column shows the runtime of the corresponding GMAF implementation. The evaluation of scalability is shown in
Table 4 based on
n nodes,
i GMAF instances, the number
a of multimedia objects per instance, and the runtime
t for the execution of the experiment. Furthermore, in
Table 5, the overall runtime based on the number of physical servers for horizontal scaling
and the number of instances per physical sever
is evaluated. Finally,
Table 6 shows the parallelization (i.e., vertical scaling) based on CPU and GPU implementations of the
graph code algorithms.
In
Figure 16a, a comparison of the runtime of a similarity search based on graphs (blue) and
graph codes (red) is shown. For the graph calculations, a standard Neo4J database [
38] was employed and the calculated MMFGs were inserted. On GMAF side, a standard Java implementation of the above-mentioned metrics was employed for this comparison. The experiment was executed on the same machine. The results of this experiment clearly prove that
graph codes have a better scaling (linear vs. polynomic or exponential) than graph-based algorithms. In this experiment, a speedup of factor 20 was achieved; however, the switch to linear complexity is, of course, even more important than the numbers.
Figure 16b shows the results of a runtime measuring of a horizontal distribution of GMAF instances, which perform
graph code based operations. This also shows that the overall runtime of a query processing can be reduced significantly by adding additional nodes to a GMAF setup. The optimal number of nodes for this particular experiment is between 8 and 10, and leads to an improvement of the overall processing time by a factor of 8.01 (8 nodes with processing time of 81 s vs. 1 node with processing time of 635 s). For this experiment, huge collections containing 750,000 elements were employed to obtain reliable results of the possible speedup.
Figure 16c shows both the result values and a diagram of an experiment for vertical scaling on different hardware. In particular, here, the CUDA implementation for NVIDIA GPUs was evaluated. This experiment showed that significant improvement can be achieved also within a single GMAF instance by enabling parallel processing. In this example, a speedup of factor 40 was measured, which is only limited by the number of parallel processing units on the GPU. If, theoretically, the whole collection fits into the GPU memory, any MMIR processing can be performed in a single step, producing results immediately.
Depending on the application, these three scaling methods can be flexibly combined and integrated with each other. If these experiments are combined, the overall processing time can be reduced by factor . This means that when the previous processing of a MMIR request takes, for example, 6.400 s (i.e., one hour and 45 min), the same request can be resolved with Smart MMIR in a single second.