Author Contributions
Conceptualization, R.S., G.T., P.T.; methodology, R.S., G.T., P.T.; software, R.S.; formal analysis, R.S., G.T., P.T.; writing—original draft preparation, R.S.; writing—review and editing, G.T., P.T.; visualization, R.S.; supervision, G.T., P.T.; project administration, P.T.; funding acquisition, P.T. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Difference between spatio-temporal and semantic predictive coding. The top part of the figure demonstrates how spatio-temporal features are utilized for predicting the next frame in video sequences that are characterized by smooth dynamics. In the bottom part, the proposed scheme can predict the next element in sequences of challenging handwritten digits and letters as well as potentially more abstract concepts, like item to buy or movie to watch next in recommendation systems.
Figure 1.
Difference between spatio-temporal and semantic predictive coding. The top part of the figure demonstrates how spatio-temporal features are utilized for predicting the next frame in video sequences that are characterized by smooth dynamics. In the bottom part, the proposed scheme can predict the next element in sequences of challenging handwritten digits and letters as well as potentially more abstract concepts, like item to buy or movie to watch next in recommendation systems.
Figure 2.
A brief illustration of the proposed framework. The generator is provided with an image sequence of ordered alphanumerics (digits or letters). The generated output is then propagated to the discriminator and the arbiter. The role of the discriminator is to distinguish between images originating from the generator or the initial dataset, resulting in an adversarial loss. The role of the arbiter is the categorization of the generator’s output based on its semantic meaning. Because the arbiter has been already competently trained as a classifier of digits (or letters alternatively), there is no further need for an update of its weights, thus the backpropagated gradients are exclusively used for the generator’s optimization.
Figure 2.
A brief illustration of the proposed framework. The generator is provided with an image sequence of ordered alphanumerics (digits or letters). The generated output is then propagated to the discriminator and the arbiter. The role of the discriminator is to distinguish between images originating from the generator or the initial dataset, resulting in an adversarial loss. The role of the arbiter is the categorization of the generator’s output based on its semantic meaning. Because the arbiter has been already competently trained as a classifier of digits (or letters alternatively), there is no further need for an update of its weights, thus the backpropagated gradients are exclusively used for the generator’s optimization.
Figure 3.
Provided that we feed the generator with a specific sequence of associated symbols (the digits ‘5’, ‘6’, and ‘7’ in this example), we expect in return a visual response (the digit ‘8’) that will contextually match the given input. The digits in the figure have been obtained from the MNIST dataset.
Figure 3.
Provided that we feed the generator with a specific sequence of associated symbols (the digits ‘5’, ‘6’, and ‘7’ in this example), we expect in return a visual response (the digit ‘8’) that will contextually match the given input. The digits in the figure have been obtained from the MNIST dataset.
Figure 4.
Block diagram of the generator’s functionality.
Figure 4.
Block diagram of the generator’s functionality.
Figure 5.
Block diagram of the discriminator’s and the arbiter’s functionalities.
Figure 5.
Block diagram of the discriminator’s and the arbiter’s functionalities.
Figure 6.
Evolution of the generation operation as the number of training iterations increases.
Figure 6.
Evolution of the generation operation as the number of training iterations increases.
Figure 7.
Indicative generative results during inference. In each case, the smaller upper symbols represent the input of the generator (unseen new data), while the larger single symbol below corresponds to the predicted outcome. For example, in the first case, we feed the generator with the letters (‘h’, ‘i’, ‘j’, ‘k’) and, as a result, we get the letter ‘l’ which is clearly correct.
Figure 7.
Indicative generative results during inference. In each case, the smaller upper symbols represent the input of the generator (unseen new data), while the larger single symbol below corresponds to the predicted outcome. For example, in the first case, we feed the generator with the letters (‘h’, ‘i’, ‘j’, ‘k’) and, as a result, we get the letter ‘l’ which is clearly correct.
Figure 8.
Illustrating the quality of two different chained predictions, at inference, for the digits’ dataset. In each case, the smaller upper digits represent the input of the generator, while the larger single digit corresponds to the predicted outcome. In the first link of each chain the input exclusively consists of samples originating from the test dataset. In the last link of each chain, the input exclusively consists of the samples that G generated in the previous links.
Figure 8.
Illustrating the quality of two different chained predictions, at inference, for the digits’ dataset. In each case, the smaller upper digits represent the input of the generator, while the larger single digit corresponds to the predicted outcome. In the first link of each chain the input exclusively consists of samples originating from the test dataset. In the last link of each chain, the input exclusively consists of the samples that G generated in the previous links.
Figure 9.
Illustrating the quality of two different chained predictions, at inference, for the letters’ dataset. In each case, the smaller upper letters represent the input of the generator, while the larger single letter corresponds to the predicted outcome. In the first link of each chain the input exclusively consists of samples originating from the test dataset. In the last link of each chain the input exclusively consists of the samples that G generated in the previous links.
Figure 9.
Illustrating the quality of two different chained predictions, at inference, for the letters’ dataset. In each case, the smaller upper letters represent the input of the generator, while the larger single letter corresponds to the predicted outcome. In the first link of each chain the input exclusively consists of samples originating from the test dataset. In the last link of each chain the input exclusively consists of the samples that G generated in the previous links.
Figure 10.
A single-input chained prediction during inference. The generator receives an input image of the digit ‘7’ from the test set and correctly provides the next digit ‘8’ as his response. Subsequently, the resulting image of ‘8’ is fed back to G as a new input, predicting the digit ‘9’ as the next output and so on so forth.
Figure 10.
A single-input chained prediction during inference. The generator receives an input image of the digit ‘7’ from the test set and correctly provides the next digit ‘8’ as his response. Subsequently, the resulting image of ‘8’ is fed back to G as a new input, predicting the digit ‘9’ as the next output and so on so forth.
Figure 11.
Characteristic examples of generated symbols (digits) during inference with the adoption of the Arbitrated Generative Adversarial Network (A-GAN) loss versus the utilization of the -GAN loss. In both cases, we contrast images from the highest performing models in terms of the derived accuracy.
Figure 11.
Characteristic examples of generated symbols (digits) during inference with the adoption of the Arbitrated Generative Adversarial Network (A-GAN) loss versus the utilization of the -GAN loss. In both cases, we contrast images from the highest performing models in terms of the derived accuracy.
Figure 12.
Characteristic examples of generated symbols (letters) during inference with the adoption of the A-GAN loss versus the utilization of the -GAN loss. In both cases, we contrast images from the highest performing models in terms of the derived accuracy.
Figure 12.
Characteristic examples of generated symbols (letters) during inference with the adoption of the A-GAN loss versus the utilization of the -GAN loss. In both cases, we contrast images from the highest performing models in terms of the derived accuracy.
Table 1.
Inference accuracy results in the chained prediction scenario for a highly capable model and, also, in a case where a considerable performance deterioration is observed from link to link. The results correspond to the digits’ dataset.
Table 1.
Inference accuracy results in the chained prediction scenario for a highly capable model and, also, in a case where a considerable performance deterioration is observed from link to link. The results correspond to the digits’ dataset.
| 1st in Chain | 2nd in Chain | 3rd in Chain | 4th in Chain | 5th in Chain |
---|
Potent Model | 99.50% | 99.34% | 99.36% | 99.06% | 98.74% |
Underperforming Model | 97.52% | 92.22% | 83.74% | 73.46% | 61.01% |
Table 2.
Inference accuracy results in the chained prediction scenario for a highly capable model and, also, in a case where a considerable performance deterioration is observed from link to link. The results correspond to the letters’ dataset.
Table 2.
Inference accuracy results in the chained prediction scenario for a highly capable model and, also, in a case where a considerable performance deterioration is observed from link to link. The results correspond to the letters’ dataset.
| 1st in Chain | 2nd in Chain | 3rd in Chain | 4th in Chain | 5th in Chain |
---|
Potent Model | 95.84% | 94.88% | 93.08% | 92.36% | 91.28% |
Underperforming Model | 88.18% | 85.90% | 80.08% | 72.29% | 60.71% |
Table 3.
Inference accuracy results in the chained prediction scenario for different cardinalities (value of t) of the training/testing input sequence. The results correspond to the digits’ dataset.
Table 3.
Inference accuracy results in the chained prediction scenario for different cardinalities (value of t) of the training/testing input sequence. The results correspond to the digits’ dataset.
| 1st in Chain | 2nd in Chain | 3rd in Chain | 4th in Chain | 5th in Chain |
---|
4 input frames | 99.50% | 99.34% | 99.36% | 99.06% | 98.74% |
3 input frames | 99.06% | 98.68% | 98.04% | 97.90% | 96.18% |
2 input frames | 98.08% | 97.46% | 96.86% | 95.46% | 92.92% |
1 input frame | 97.44% | 96.20% | 95.08% | 93.16% | 90.34% |
Table 4.
Inference accuracy results in the chained prediction scenario for different cardinalities (value of t) of the training/testing input sequence. The results correspond to the letters’ dataset.
Table 4.
Inference accuracy results in the chained prediction scenario for different cardinalities (value of t) of the training/testing input sequence. The results correspond to the letters’ dataset.
| 1st in Chain | 2nd in Chain | 3rd in Chain | 4th in Chain | 5th in Chain |
---|
4 input frames | 95.84% | 94.88% | 93.08% | 92.36% | 91.28% |
3 input frames | 94.54% | 92.82% | 91.45% | 89.72% | 87.13% |
2 input frames | 91.10% | 87.40% | 82.20% | 79.24% | 73.68% |
1 input frame | 85.50% | 76.12% | 67.24% | 60.62% | 54.02% |
Table 5.
Comparing the performance of various GAN models at inference, when using the arbiter’s classification loss versus the utilization of a pixel error-based loss, namely the loss. In bold, the best performing model of each corresponding comparison.
Table 5.
Comparing the performance of various GAN models at inference, when using the arbiter’s classification loss versus the utilization of a pixel error-based loss, namely the loss. In bold, the best performing model of each corresponding comparison.
| 1st in Chain | 2nd in Chain | 3rd in Chain | 4th in Chain | 5th in Chain |
---|
A-GAN best case—digits | 99.50% | 99.34% | 99.36% | 99.06% | 98.74% |
-GAN best case—digits | 99.26% | 99.34% | 99.20% | 99.04% | 98.71% |
A-GAN best case—letters | 95.84% | 94.88% | 93.08% | 92.36% | 91.28% |
-GAN best case—letters | 83.23% | 82.58% | 82.16% | 81.24% | 80.84% |