# Summarization of Videos with the Signature Transform

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction and Problem Statement

## 2. Signature Transform

#### 2.1. RMSE and MAE Signature and Log-Signature

**Definition 1.**

#### 2.2. Summarization of Videos with RMSE Signature

- ${\overline{S}}_{*}$: Element-wise mean Signature Transform of the target summary to the score of the corresponding video;
- $\overline{S}$: Element-wise mean Signature Transform of a uniform random sample of the corresponding video;
- $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$: Root mean squared error between the spectra of $\overline{S}$ and ${\overline{S}}_{*}$ with the same summary length. For the computation of standard deviation and mean, this value is calculated ten times, changing $\overline{S}$;
- $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$: Root mean squared error between the spectra of $\overline{S}$ and $\overline{S}$ with the same summary length. For computation of standard deviation and mean, this value is calculated ten times, changing both $\overline{S}$ each time;
- $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{n}$: Baseline based on the Signature Transform. It corresponds to $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$, where ${\overline{S}}_{*}$ is, in this case, a fixed uniform random sample denoted as ${\overline{S}}_{u}$. We repeat this procedure n times and choose the minimum candidate according to standard deviation, ${\overline{S}}_{{u}_{min}}$, to propose as a summary;
- $std\left(\right)$: Standard deviation.

## 3. Summarization of Videos via Text-Conditioned Object Detection

## 4. Experiments: Dataset and Metrics

#### 4.1. Assessment of the Metrics

- The proposed metrics demonstrate that human evaluators can perform above average during the task, effectively capturing the dominant harmonic frequencies present in the video.
- Another crucial aspect to emphasize is that the metrics are able to evaluate human annotators with fair criteria and identify which subjects are creating competitive summaries.
- Moreover, the observations from this study indicate that the metrics serve as a reliable proxy for evaluating summaries without the need for annotated data, as they correlate strongly with human annotations.

- Annotations with lower standard deviations offer a better harmonic representation of the overall video;
- Annotations with higher standard deviations suggest that important harmonic components are missing from the given summary;
- The metrics make it simple to identify annotated summaries that may need to be relabeled for improved accuracy.

- Content based: the Signature Transform is a content-based approach that captures the salient features of the video data. This means that the proposed measure is not reliant on manual annotations or subjective human ratings, which can be time consuming and prone to biases.
- Robustness: the Signature Transform is a robust feature extraction technique that can handle different types of data, including videos with varying frame rates, resolutions, and durations. This means that the proposed measure can be applied to a wide range of video datasets without the need for pre-processing or normalization.
- Efficiency: the Signature Transform is a computationally efficient approach that can be applied to large-scale datasets. This means that the proposed measure can be used to evaluate the effectiveness of visual summaries quickly and accurately.
- Flexibility: the Signature Transform can be applied to different types of visual summaries, including keyframe-based and shot-based summaries. This means that the proposed measure can be used to evaluate different types of visual summaries and compare their effectiveness.

#### 4.2. Evaluation

## 5. Conclusions and Future Work

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

DNN | Deep Neural Networks |

AMT | Amazon Mechanical Turk |

RMSE | Root Mean Squared Error |

MAE | Mean Absolute Error |

VLM | Visual Language Models |

LLM | Large Language Models |

GAN | Generative Adversarial Networks |

CLIP | Contrastive Language–Image Pre-training |

LSTM | Long Short-Term Memory |

RNN | Recurrent Neural Network |

NLP | Natural Language Processing |

CPU | Central Processing Unit |

MOS | Mean Opinion Score |

## References

- de Avila, S.E.F.; Lopes, A.; da Luz, A., Jr.; de Albuquerque Araújo, A. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognit. Lett.
**2011**, 32, 56–68. [Google Scholar] [CrossRef] - Gygli, M.; Grabner, H.; Gool, L.V. Video summarization by learning submodular mixtures of objectives. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating summaries from user videos. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
- Kanehira, A.; Gool, L.V.; Ushiku, Y.; Harada, T. Viewpoint-aware video summarization. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Liang, G.; Lv, Y.; Li, S.; Zhang, S.; Zhang, Y. Video summarization with a convolutional attentive adversarial network. Pattern Recognit.
**2022**, 131, 108840. [Google Scholar] [CrossRef] - Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. TVSum: Summarizing web videos using titles. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5179–5187. [Google Scholar]
- Zhu, W.; Lu, J.; Han, Y.; Zhou, J. Learning multiscale hierarchical attention for video summarization. Pattern Recognit.
**2022**, 122, 108312. [Google Scholar] [CrossRef] - Ngo, C.-W.; Ma, Y.-F.; Zhang, H.-J. Automatic video summarization by graph modeling. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003. [Google Scholar]
- Fajtl, J.; Sokeh, H.S.; Argyriou, V.; Monekosso, D.; Remagnino, P. Summarizing videos with attention. In Proceedings of the Computer Vision—ACCV 2018 Workshops: 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
- Zhu, W.; Lu, J.; Li, J.; Zhou, J. DSNet: A flexible detect-to-summarize network for video summarization. IEEE Trans. Image Process.
**2020**, 30, 948–962. [Google Scholar] [CrossRef] [PubMed] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed] - Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- de Curtò, J.; de Zarzà, I.; Yan, H.; Calafate, C.T. On the applicability of the hadamard as an input modulator for problems of classification. Softw. Impacts
**2022**, 13, 100325. [Google Scholar] [CrossRef] - de Zarzà, I.; de Curtò, J.; Calafate, C.T. Detection of glaucoma using three-stage training with efficientnet. Intell. Syst. Appl.
**2022**, 16, 200140. [Google Scholar] [CrossRef] - Dwivedi, K.; Bonner, M.F.; Cichy, R.M.; Roig, G. Unveiling functions of the visual cortex using task-specific deep neural networks. PLoS Comput. Biol.
**2021**, 17, e100926. [Google Scholar] [CrossRef] - Dwivedi, K.; Roig, G.; Kembhavi, A.; Mottaghi, R. What do navigation agents learn about their environment? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10276–10285. [Google Scholar]
- Rakshit, S.; Tamboli, D.; Meshram, P.S.; Banerjee, B.; Roig, G.; Chaudhuri, S. Multi-source open-set deep adversarial domain adaptation. In Proceedings of the Computer Vision—ECCV: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 735–750. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI; Springer: Cham, Switzerland, 2015. [Google Scholar]
- Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Thao, H.; Balamurali, B.; Herremans, D.; Roig, G. Attendaffectnet: Self-attention based networks for predicting affective responses from movies. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 8719–8726. [Google Scholar]
- Mahasseni, B.; Lam, M.; Todorovic, S. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Zhang, K.; Chao, W.-L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In Proceedings of the Computer Vision–ECCV: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
- Zhao, B.; Li, X.; Lu, X. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017. [Google Scholar]
- Rochan, M.; Ye, L.; Wang, Y. Video summarization using fully convolutional sequence networks. In Proceedings of the Computer Vision–ECCV: 15th European Conference, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
- Yuan, L.; Tay, F.E.; Li, P.; Zhou, L.; Feng, J. Cycle-sum: Cycle-consistent adversarial lstm networks for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
- Zhang, K.; Grauman, K.; Sha, F. Retrospective encoders for video summarization. In Proceedings of the Computer Vision–ECCV: 15th European Conference, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
- Zhou, K.; Qiao, Y.; Xiang, T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference (AAAI), New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Narasimhan, M.; Rohrbach, A.; Darrell, T. Clip-it! Language-Guided Video Summarization. Adv. Neural Inf. Process. Syst.
**2021**, 34, 13988–14000. [Google Scholar] - Plummer, B.A.; Brown, M.; Lazebnik, S. Enhancing video summarization via vision-language embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Otani, M.; Nakashima, Y.; Rahtu, E.; Heikkilä, J. Rethinking the evaluation of video summaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- de Curtò, J.; de Zarzà, I.; Yan, H.; Calafate, C.T. Signature and Log-signature for the Study of Empirical Distributions Generated with GANs. arXiv
**2022**, arXiv:2203.03226. [Google Scholar] - Lyons, T. Rough paths, signatures and the modelling of functions on streams. arXiv
**2014**, arXiv:1405.4537. [Google Scholar] - Bonnier, P.; Kidger, P.; Arribas, I.P.; Salvi, C.; Lyons, T. Deep signature transforms. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December2019; Volume 32. [Google Scholar]
- Chevyrev, I.; Kormilitzin, A. A primer on the signature method in machine learning. arXiv
**2016**, arXiv:1603.03788. [Google Scholar] - Kidger, P.; Lyons, T. Signatory: Differentiable computations of the signature and logsignature transforms, on both CPU and GPU. arXiv
**2020**, arXiv:2001.00706. [Google Scholar] - Liao, S.; Lyons, T.J.; Yang, W.; Ni, H. Learning stochastic differential equations using RNN with log signature features. arXiv
**2019**, arXiv:1908.0828. [Google Scholar] - Morrill, J.; Kidger, P.; Salvi, C.; Foster, J.; Lyons, T.J. Neural CDEs for long time series via the log-ode method. arXiv
**2021**, arXiv:2009.08295. [Google Scholar] - Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. arXiv
**2022**, arXiv:2204.14198. [Google Scholar] - Gu, X.; Lin, T.-Y.; Kuo, W.; Cui, Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv
**2022**, arXiv:2104.13921. [Google Scholar] - Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv
**2022**, arXiv:2206.07682. [Google Scholar] - de Curtò, J.; de Zarzà, I.; Calafate, C.T. Semantic scene understanding with large language models on unmanned aerial vehicles. Drones
**2023**, 7, 114. [Google Scholar] [CrossRef] - Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. et al. Learning transferable visual models from natural language supervision. arXiv
**2021**, arXiv:2103.00020. [Google Scholar] - Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S.K.S.; Ayan, B.K.; Mahdavi, S.S.; Lopes, R.G.; et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv
**2022**, arXiv:2205.11487. [Google Scholar] - Cui, Y.; Niekum, S.; Gupta, A.; Kumar, V.; Rajeswaran, A. Can foundation models perform zero-shot task specification for robot manipulation? In Proceedings of the Learning for Dynamics and Control Conference, Palo Alto, CA, USA, 23–24 June 2022. [Google Scholar]
- Nair, S.; Rajeswaran, A.; Kumar, V.; Finn, C.; Gupta, A. R3M: A universal visual representation for robot manipulation. arXiv
**2022**, arXiv:2203.12601. [Google Scholar] - Zeng, A.; Florence, P.; Tompson, J.; Welker, S.; Chien, J.; Attarian, M.; Armstrong, T.; Krasin, I.; Duong, D.; Wahid, A.; et al. Transporter networks: Rearranging the visual world for robotic manipulation. In Proceedings of the Conference on Robot Learning, Online, 15–18 November 2020. [Google Scholar]
- Huang, W.; Abbeel, P.; Pathak, D.; Mordatch, I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv
**2022**, arXiv:2201.07207. [Google Scholar] - Zeng, A.; Attarian, M.; Ichter, B.; Choromanski, K.; Wong, A.; Welker, S.; Tombari, F.; Purohit, A.; Ryoo, M.; Sindhwani, V.; et al. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv
**2022**, arXiv:2204.00598. [Google Scholar] - Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 ×16 words: Transformers for image recognition at scale. arXiv
**2021**, arXiv:2010.11929. [Google Scholar] - Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple open-vocabulary object detection with vision transformers. arXiv
**2022**, arXiv:2205.06230. [Google Scholar] - Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python, 1st ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]

**Figure 1.**Conceptual plot with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ standard deviation and mean for two given summaries (our method and a counterexample) of 12 frames using a randomly picked video from Youtube to illustrate how to select a proper summary according to the proposed metric.

**Figure 3.**Comparison of distribution of selected frames for a subset of videos (Tides, Sulfur Hexafluoride, Centre of Gravity and Bubbles) using the method based on text-conditioned object detection and the baselines using the Signature Transform.

**Figure 4.**Summarization of videos using the baseline based on the Signature Transform in comparison to the summarization using text-conditioned object detection. $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{10}$, $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{20}$ and ${\overline{S}}_{*}$ summaries for two videos of the introduced dataset. The best summary among the three, according to the metric, is highlighted.

**Figure 5.**Plot with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ standard deviation and mean.

**Figure 6.**Plot with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ standard deviation and mean.

**Figure 7.**Error bar plot with mean and standard deviation for each human-annotated summary of the subset of 20 videos from [1]. Sampling rate: 1 frame per second.

**Figure 8.**Visual depiction of human annotated summaries together with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ of video V11, Table 3. Sampling rate: 1 frame per second. Highlighted values on the table correspond to the lowest standard deviation.

**Figure 9.**Visual depiction of human annotated summaries together with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ of video V19, Table 3. Sampling rate: 1 frame per second. Highlighted frames can increase the accuracy of the annotated summary by user 5. Highlighted values on the table correspond to the lowest standard deviation.

**Figure 10.**Visual depiction of human annotated summaries, together with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ of video V75, Table 4. Sampling rate: 1 frame per second. Highlighted values on the table correspond to the lowest standard deviation.

**Figure 11.**Visual depiction of human annotated summaries together with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ of video V76, Table 4. Sampling rate: 1 frame per second. Highlighted frames can increase the accuracy of the annotated summary by user 3. Highlighted values on the table correspond to the lowest standard deviation.

**Table 1.**Descriptive statistics with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ (target summary against random uniform sample) and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ (random uniform sample against random uniform sample). $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{10}$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{20}$ correspond to the baselines based on the Signature Transform using 10 and 20 random samples, respectively. Highlighted results in blue/brown correspond to values better than std ($\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$). Yellow values indicate when std ($\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$) is lower than std ($\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$).

Descriptive Statistics | Summary | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{*})$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},\overline{\mathit{S}})$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{{\mathit{u}}_{\mathit{m}\mathit{i}\mathit{n}}}){|}_{10}$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{{\mathit{u}}_{\mathit{m}\mathit{i}\mathit{n}}}){|}_{20}$ | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Video | # Frames | Length | # Frames (%) | Std | Mean | Std | Mean | Std | Mean | Std | Mean |

Tides | 159 | 10 m 29 s | 35 (22%) | 13,663 | 202,388 | 14,838 | 155,986 | 8859 | 157,455 | 7312 | 167,480 |

Sulfur Hexafluoride | 230 | 15 m 12 s | 47 (20%) | 22,727 | 217,935 | 22,607 | 179,409 | 7194 | 161,995 | 7722 | 173,490 |

Centre of Gravity | 155 | 10 m 14 s | 33 (21%) | 12,333 | 181,460 | 16,404 | 168,824 | 8481 | 160,779 | 12,416 | 175,971 |

Bubbles | 174 | 11 m 30 s | 35 (20%) | 23,127 | 201,553 | 16,806 | 185,702 | 7461 | 194,993 | 5711 | 175,176 |

Airplanes | 158 | 10 m 24 s | 22 (14%) | 19,964 | 215,688 | 23,591 | 231,539 | 8417 | 227,391 | 10,235 | 233,020 |

Protons | 174 | 11 m 30 s | 25 (14%) | 29,853 | 252,224 | 20,186 | 262,434 | 12,835 | 251,907 | 11,542 | 250,512 |

Hydrophobic | 168 | 11 m 06 s | 29 (17%) | 15,016 | 251,671 | 25,835 | 248,548 | 11,973 | 250,131 | 13,917 | 245,761 |

States of Matter | 332 | 22 m 03 s | 78 (23%) | 16,249 | 156,408 | 9709 | 130,064 | 6630 | 115,454 | 5340 | 121,028 |

Spool Racer | 332 | 22 m 02 s | 90 (27%) | 15,903 | 142,520 | 11,883 | 136,147 | 7054 | 137,621 | 8112 | 151,888 |

Paper Airplane | 332 | 22 m 03 s | 29 (9%) | 20,642 | 235,639 | 11,829 | 221,220 | 5400 | 224,718 | 9385 | 177,448 |

Loudest Sound | 332 | 22 m 01 s | 93 (28%) | 16,898 | 179,963 | 8304 | 148,885 | 7884 | 138,561 | 4355 | 147,016 |

Lightning | 332 | 22 m 01 s | 70 (21%) | 15,237 | 169,338 | 21,862 | 162,849 | 9300 | 177,008 | 7494 | 153,797 |

Light Challenge | 332 | 22 m 02 s | 82 (25%) | 12,566 | 152,488 | 10,546 | 126,117 | 5490 | 139,700 | 4874 | 129,044 |

Hot Air Balloon | 332 | 22 m 01 s | 98 (30%) | 8620 | 150,366 | 5417 | 144,634 | 3516 | 137,141 | 4165 | 138,453 |

Hoop Glider | 332 | 22 m 01 s | 82 (25%) | 6419 | 148,065 | 6752 | 132,544 | 4051 | 133,897 | 4966 | 133,894 |

Drag Race | 332 | 22 m 03 s | 73 (22%) | 9384 | 135,228 | 8931 | 125,264 | 4375 | 122,615 | 4645 | 129,851 |

All about Balance | 332 | 22 m 03 s | 59 (18%) | 14,023 | 182,063 | 14,238 | 182,179 | 7801 | 176,219 | 6914 | 167,727 |

Air Pressure | 332 | 22 m 03 s | 65 (20%) | 10,123 | 166,342 | 18,314 | 151,664 | 6386 | 145,897 | 4602 | 148,232 |

Friction and Momentum | 162 | 10 m 42 s | 28 (17%) | 18,754 | 217,403 | 22,443 | 218,203 | 13,348 | 202,288 | 12,238 | 205,680 |

Electricity | 162 | 10 m 41 s | 30 (19%) | 24,376 | 298,238 | 22,885 | 279,820 | 16,889 | 268,932 | 10,263 | 270,619 |

Catapult | 169 | 11 m 11 s | 27 (16%) | 26,413 | 271,643 | 31,265 | 214,727 | 15,158 | 203,290 | 10,222 | 188,008 |

Carbonation and More | 165 | 10 m 53 s | 40 (24%) | 18,977 | 237,142 | 18,107 | 226,044 | 12,130 | 234,278 | 11,884 | 214,149 |

Carbon Dioxide | 162 | 10 m 41 s | 38 (23%) | 25,862 | 245,415 | 18,806 | 217,270 | 13,838 | 207,828 | 7760 | 211,504 |

Bridge | 164 | 10 m 51 s | 21 (13%) | 25,839 | 269,412 | 26,038 | 271,551 | 10,761 | 263,747 | 13,038 | 264,532 |

Bread Experiment | 337 | 22 m 22 s | 59 (18%) | 15,099 | 189,086 | 8575 | 146,771 | 5542 | 153,224 | 5691 | 156,230 |

Balloon Power | 337 | 22 m 22 s | 53 (16%) | 14,075 | 157,542 | 29,415 | 147,710 | 7741 | 128,920 | 7351 | 134,545 |

Attraction and Forces | 654 | 43 m 30 s | 81 (12%) | 5955 | 107,097 | 7486 | 102,965 | 3701 | 96,266 | 2093 | 99,271 |

Puzzles | 209 | 13 m 48 s | 46 (22%) | 11,258 | 185,502 | 19,012 | 196,762 | 14,620 | 199,556 | 14,622 | 197,064 |

Average | 264 | 17 m 30 s | 52 (20%) | 14/28 (50%) | 28/28 (100%) | 28/28 (100%) |

**Table 2.**Descriptive statistics for a set of videos with varying numbers of frames per summary with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{10}$ (brown) and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ (yellow).

Dataset | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{{\mathit{u}}_{\mathit{m}\mathit{i}\mathit{n}}}{\left)\right|}_{10}$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},\overline{\mathit{S}})$ | Visualization | ||||
---|---|---|---|---|---|---|---|

Video | # Frames | Summary (%) | Std | Mean | Std | Mean | Plot (Std,Std) |

Tides | 159 | 8 (5%) | 22,786 | 422,026 | 54,067 | 390,483 | |

16 (10%) | 12,851 | 254,984 | 37,713 | 263,881 | |||

24 (15%) | 9423 | 202,925 | 17,935 | 224,797 | |||

32 (20%) | 9074 | 183,933 | 15,700 | 186,621 | |||

40 (25%) | 4782 | 158,183 | 13,903 | 159,452 | |||

Sulfur Hexafluoride | 230 | 12 (5%) | 30,325 | 452,134 | 68,212 | 362,061 | |

23 (10%) | 12,701 | 281,425 | 39,872 | 246,967 | |||

35 (15%) | 12,034 | 228,530 | 20,846 | 201,740 | |||

46 (20%) | 9241 | 190,985 | 28,621 | 175,440 | |||

58 (25%) | 7914 | 161,618 | 9021 | 152,310 | |||

Centre of Gravity | 155 | 8 (5%) | 48,787 | 406,502 | 49,234 | 369,648 | |

16 (10%) | 22,163 | 252,841 | 21,974 | 276,366 | |||

24 (15%) | 8050 | 212,893 | 26,776 | 229,959 | |||

31 (20%) | 10,963 | 180,953 | 35,813 | 184,437 | |||

39 (25%) | 2528 | 164,666 | 16,259 | 163,007 | |||

Bubbles | 174 | 9 (5%) | 24,538 | 401,406 | 37,816 | 397,470 | |

18 (10%) | 11,669 | 272,430 | 49,740 | 276,152 | |||

27 (15%) | 12,965 | 213,336 | 19,125 | 215,961 | |||

35 (20%) | 10,331 | 190,639 | 13,792 | 183,984 | |||

44 (25%) | 7625 | 173,009 | 9427 | 162,091 |

**Table 3.**Descriptive statistics with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ (target summary against random uniform sample) and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ (random uniform sample against random uniform sample). Lower is better. Sampling rate: 1 frame per second. Dataset in [1], videos from V11 to V20. Highlighted results in blue/yellow correspond to the lowest values, either std ($\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$) or std ($\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$), respectively.

Youtube, Dataset | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{*})$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},\overline{\mathit{S}})$ | Visualization | |||||
---|---|---|---|---|---|---|---|---|

Video | # Frames | User | # Frames User | Std | Mean | Std | Mean | Plot (Std,Std) |

V11 | 48 | 1 | 10 | 26,644 | 171,106 | 46,655 | 151,483 | |

2 | 12 | 13,673 | 202,172 | 15,479 | 155,481 | |||

3 | 10 | 29,857 | 213,880 | 51,590 | 182,327 | |||

4 | 9 | 21,192 | 236,959 | 52,982 | 196,303 | |||

5 | 8 | 31,627 | 254,336 | 52,925 | 193,520 | |||

V12 | 59 | 1 | 11 | 15,497 | 436,723 | 46,551 | 252,142 | |

2 | 17 | 18,927 | 359,562 | 24,665 | 177,286 | |||

3 | 15 | 26,071 | 342,161 | 31,703 | 180,066 | |||

4 | 11 | 25,330 | 429,272 | 82,323 | 242,627 | |||

5 | 14 | 34,479 | 348,834 | 39,199 | 188,417 | |||

V13 | 59 | 1 | 19 | 12,238 | 187,001 | 24,649 | 114,155 | |

2 | 9 | 25,267 | 287,479 | 34,635 | 166,495 | |||

3 | 18 | 7790 | 187,346 | 21,203 | 126,432 | |||

4 | 14 | 9544 | 222,496 | 25,553 | 140,508 | |||

5 | 18 | 12,298 | 198,349 | 27,138 | 124,386 | |||

V14 | 59 | 1 | 9 | 32,739 | 302,118 | 51,770 | 183,978 | |

2 | 16 | 20,249 | 219,068 | 44,235 | 141,927 | |||

3 | 17 | 24,345 | 222,559 | 35,235 | 113,806 | |||

4 | 10 | 20,498 | 244,509 | 27,548 | 155,515 | |||

5 | 16 | 26,561 | 200,139 | 32,840 | 143,384 | |||

V15 | 57 | 1 | 12 | 14,454 | 237,551 | 51,812 | 207,845 | |

2 | 11 | 20,018 | 301,650 | 46,590 | 209,491 | |||

3 | 13 | 13,192 | 261,014 | 42,337 | 171,810 | |||

4 | 13 | 36,408 | 305,376 | 30,041 | 179,442 | |||

5 | 14 | 44,931 | 261,859 | 54,428 | 180,145 | |||

V16 | 70 | 1 | 9 | 35,722 | 449,758 | 95,662 | 376,411 | |

2 | 9 | 86,863 | 425,107 | 65,626 | 328,563 | |||

3 | 12 | 41,260 | 388,869 | 43,186 | 340,133 | |||

4 | 9 | 51,299 | 447,523 | 65,698 | 375,162 | |||

5 | 13 | 42,200 | 369,517 | 52,316 | 302,677 | |||

V17 | 59 | 1 | 12 | 17,668 | 324,562 | 36,166 | 242,235 | |

2 | 13 | 26,203 | 262,895 | 32,930 | 243,366 | |||

3 | 18 | 10,957 | 250,543 | 30,660 | 177,779 | |||

4 | 12 | 19,956 | 300,390 | 20,252 | 223,791 | |||

5 | 16 | 12,611 | 297,707 | 28,433 | 207,258 | |||

V18 | 50 | 1 | 13 | 35,152 | 501,230 | 74,454 | 260,574 | |

2 | 14 | 40,896 | 559,244 | 70,863 | 274,572 | |||

3 | 14 | 46,791 | 540,747 | 39,899 | 246,964 | |||

4 | 10 | 33,309 | 541,490 | 56,012 | 329,343 | |||

5 | 14 | 30,663 | 420,924 | 72,998 | 308,756 | |||

V19 | 65 | 1 | 15 | 6114 | 186,893 | 16,695 | 119,136 | |

2 | 20 | 6701 | 225,075 | 6899 | 103,517 | |||

3 | 20 | 5339 | 167,085 | 8834 | 103,752 | |||

4 | 13 | 8462 | 185,452 | 12,020 | 129,608 | |||

5 | 6 | 23,992 | 275,155 | 32,512 | 208,629 | |||

V20 | 61 | 1 | 15 | 23,716 | 627,121 | 52,711 | 540,857 | |

2 | 12 | 19,933 | 707,823 | 86,586 | 609,589 | |||

3 | 9 | 52,818 | 787,188 | 93,656 | 747,199 | |||

4 | 11 | 43,598 | 688,065 | 68,016 | 617,091 | |||

5 | 11 | 31,058 | 695,905 | 69,077 | 618,156 |

**Table 4.**Descriptive statistics with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ (target summary against random uniform sample) and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ (random uniform sample against random uniform sample). Lower is better. Sampling rate: 1 frame per second. Dataset in [1], videos from V71 to V80. Highlighted values correspond to the lowest standard deviation.

Youtube, Dataset | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{*})$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},\overline{\mathit{S}})$ | Visualization | |||||
---|---|---|---|---|---|---|---|---|

Video | # Frames | User | # Frames User | Std | Mean | Std | Mean | Plot (Std,Std) |

V71 | 277 | 1 | 18 | 16,916 | 319,975 | 35,173 | 330,114 | |

2 | 18 | 23,314 | 315,996 | 48,511 | 339,793 | |||

3 | 20 | 38,384 | 293,853 | 50,766 | 345,021 | |||

4 | 17 | 32,270 | 310,193 | 32,411 | 359,049 | |||

5 | 18 | 41,753 | 329,353 | 59,688 | 334,337 | |||

V72 | 536 | 1 | 18 | 15,842 | 187,019 | 32,676 | 194,820 | |

2 | 16 | 25,427 | 211,466 | 33,363 | 202,442 | |||

3 | 16 | 18,684 | 196,149 | 45,453 | 217,699 | |||

4 | 18 | 21,112 | 205,421 | 19,122 | 177,117 | |||

5 | 18 | 27,718 | 206,335 | 29,057 | 205,808 | |||

V73 | 201 | 1 | 11 | 64,802 | 538,239 | 116,284 | 484,970 | |

2 | 7 | 153,682 | 106,8305 | 211,124 | 704,655 | |||

3 | 8 | 113,805 | 661,992 | 135,899 | 653,041 | |||

4 | 8 | 83,387 | 856,406 | 248,619 | 689,301 | |||

5 | 7 | 111,767 | 899,150 | 241,947 | 794,828 | |||

V74 | 293 | 1 | 17 | 25,780 | 282,200 | 29,674 | 309,051 | |

2 | 16 | 18,954 | 273,776 | 51,670 | 331,322 | |||

3 | 15 | 36,714 | 322,833 | 24,961 | 335,618 | |||

4 | 13 | 41,327 | 363,665 | 55,543 | 369,875 | |||

5 | 16 | 30,798 | 289,135 | 38,881 | 353,928 | |||

V75 | 383 | 1 | 14 | 42,736 | 254,385 | 25,959 | 282,877 | |

2 | 13 | 41,632 | 263,431 | 39,826 | 337,124 | |||

3 | 10 | 59,083 | 315,531 | 39,925 | 330,766 | |||

4 | 17 | 37,954 | 227,411 | 28,843 | 250,314 | |||

5 | 12 | 49,908 | 278,966 | 63,236 | 312,366 | |||

V76 | 89 | 1 | 6 | 64,097 | 440,825 | 93,524 | 422,565 | |

2 | 4 | 53,727 | 536,138 | 123,009 | 464,922 | |||

3 | 1 | 566,208 | 843,799 | 485,614 | 878,793 | |||

4 | 6 | 40,356 | 382,643 | 78,354 | 424,418 | |||

5 | 6 | 39,194 | 395,906 | 60,916 | 401,751 | |||

V77 | 168 | 1 | 12 | 24,546 | 302,076 | 47,095 | 366,748 | |

2 | 9 | 52,176 | 339,285 | 61,880 | 385,056 | |||

3 | 9 | 61,623 | 355,883 | 54,390 | 413,118 | |||

4 | 10 | 39,765 | 349,207 | 90,313 | 400,379 | |||

5 | 7 | 70,562 | 440,656 | 90,468 | 451,833 | |||

V78 | 310 | 1 | 13 | 65,238 | 706,978 | 96,368 | 770,000 | |

2 | 14 | 100,771 | 672,121 | 112,412 | 807,250 | |||

3 | 3 | 410,792 | 159,3229 | 203,589 | 188,2757 | |||

4 | 9 | 149,063 | 839,743 | 213,286 | 106,1204 | |||

5 | 23 | 40,178 | 466,571 | 73,228 | 614,140 | |||

V79 | 49 | 1 | 7 | 56,918 | 831,057 | 124,249 | 835,575 | |

2 | 8 | 56,569 | 793,831 | 60,657 | 859,241 | |||

3 | 6 | 85,973 | 925,025 | 104,621 | 990,479 | |||

4 | 5 | 158,480 | 109,3141 | 179,902 | 109,9105 | |||

5 | 6 | 87,104 | 873,950 | 131,597 | 895,318 | |||

V80 | 159 | 1 | 18 | 66,585 | 529,875 | 67,019 | 572,836 | |

2 | 17 | 66,367 | 527,930 | 59,432 | 602,819 | |||

3 | 13 | 29,459 | 579,078 | 84,101 | 726,883 | |||

4 | 12 | 43,740 | 643,016 | 87,688 | 685,117 | |||

5 | 14 | 89,016 | 553,274 | 94,849 | 649,317 |

**Table 5.**VSUMM [1] comparison against baseline based on the Signature Transform for the first 20 videos of the dataset crawled from Youtube. Descriptive statistics with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ (target summary against random uniform sample) and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ (random uniform sample against random uniform sample). $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{10}$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{20}$ correspond to the baselines based on the Signature Transform using 10 and 20 random samples, respectively. Highlighted results are better than std ($\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$). Sampling rate: 1 frame per second. Highlighted results correspond to lowest standard deviation as described in Table 1.

Descriptive Statistics | VSUMM | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{*})$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},\overline{\mathit{S}})$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{{\mathit{u}}_{\mathit{m}\mathit{i}\mathit{n}}}{\left)\right|}_{10}$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{{\mathit{u}}_{\mathit{m}\mathit{i}\mathit{n}}}{\left)\right|}_{20}$ | |||||
---|---|---|---|---|---|---|---|---|---|---|

Video | # Frames | # Frames | Std | Mean | Std | Mean | Std | Mean | Std | Mean |

V11 | 48 | 11 | 25,981 | 185,959 | 37,907 | 175,031 | 16,343 | 148,128 | 18,343 | 159,157 |

V12 | 59 | 13 | 56,274 | 313,156 | 41,613 | 205,004 | 17,770 | 181,533 | 11,665 | 206,951 |

V13 | 59 | 19 | 7018 | 184,865 | 15,319 | 120,307 | 10,578 | 110,258 | 6655 | 134,846 |

V14 | 59 | 8 | 21,415 | 281,969 | 39,412 | 171,935 | 19,069 | 157,531 | 10,104 | 180,199 |

V15 | 57 | 10 | 20,159 | 271,197 | 46,041 | 219,182 | 27,536 | 192,667 | 27,765 | 218,787 |

V16 | 70 | 9 | 65,997 | 513,440 | 84,667 | 428,025 | 38,088 | 283,324 | 30,235 | 446,068 |

V17 | 59 | 15 | 10,697 | 255,666 | 41,831 | 197,136 | 17,625 | 197,944 | 19,102 | 227,646 |

V18 | 50 | 14 | 42,731 | 449,324 | 51,635 | 230,695 | 33,525 | 261,288 | 30,179 | 242,746 |

V19 | 65 | 16 | 3891 | 235,797 | 5739 | 121,766 | 5883 | 116,245 | 4582 | 111,766 |

V20 | 61 | 9 | 43,864 | 796,448 | 39,035 | 733,547 | 28,460 | 684,546 | 39,414 | 644,681 |

V71 | 277 | 17 | 20,840 | 383,945 | 43,176 | 341,779 | 14,908 | 352,365 | 20,657 | 327,732 |

V72 | 536 | 12 | 61,886 | 233,649 | 48,603 | 252,688 | 17,604 | 276,631 | 18,966 | 248,489 |

V73 | 201 | 10 | 40,261 | 717,107 | 156,051 | 533,457 | 64,344 | 681,064 | 38,361 | 711,039 |

V74 | 293 | 17 | 26,274 | 270,374 | 36,674 | 334,265 | 17,622 | 354,621 | 17,486 | 330,606 |

V75 | 383 | 10 | 37,516 | 272,804 | 38,026 | 366,510 | 23,163 | 339,078 | 21,295 | 360,216 |

V76 | 89 | 7 | 36,084 | 353,323 | 114,266 | 377,699 | 31,131 | 335,958 | 34,724 | 405,954 |

V77 | 168 | 9 | 26,653 | 361,516 | 67,134 | 422,612 | 33,214 | 407,085 | 27,562 | 480,795 |

V78 | 310 | 13 | 95,305 | 831,043 | 127,705 | 823,938 | 33,903 | 980,397 | 36,361 | 951,784 |

V79 | 49 | 7 | 67,052 | 965,267 | 101,325 | 878,917 | 42,513 | 818,629 | 47,401 | 885,023 |

V80 | 159 | 15 | 48,115 | 613,702 | 118,428 | 644,529 | 43,411 | 589,256 | 37,487 | 808,984 |

Average | 153 | 12 | 17/20 (85%) | 19/20 (95%) | 19/20 (95%) |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

de Curtò, J.; de Zarzà, I.; Roig, G.; Calafate, C.T.
Summarization of Videos with the Signature Transform. *Electronics* **2023**, *12*, 1735.
https://doi.org/10.3390/electronics12071735

**AMA Style**

de Curtò J, de Zarzà I, Roig G, Calafate CT.
Summarization of Videos with the Signature Transform. *Electronics*. 2023; 12(7):1735.
https://doi.org/10.3390/electronics12071735

**Chicago/Turabian Style**

de Curtò, J., I. de Zarzà, Gemma Roig, and Carlos T. Calafate.
2023. "Summarization of Videos with the Signature Transform" *Electronics* 12, no. 7: 1735.
https://doi.org/10.3390/electronics12071735