Next Article in Journal
Dragon_Pi: IoT Side-Channel Power Data Intrusion Detection Dataset and Unsupervised Convolutional Autoencoder for Intrusion Detection
Previous Article in Journal
Towards a Hybrid Security Framework for Phishing Awareness Education and Defense
Previous Article in Special Issue
A Structured Narrative Prompt for Prompting Narratives from Large Language Models: Sentiment Assessment of ChatGPT-Generated Narratives and Real Tweets
 
 
Article
Peer-Review Record

Advanced Techniques for Geospatial Referencing in Online Media Repositories

Future Internet 2024, 16(3), 87; https://doi.org/10.3390/fi16030087
by Dominik Warch, Patrick Stellbauer and Pascal Neis *
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Reviewer 4: Anonymous
Future Internet 2024, 16(3), 87; https://doi.org/10.3390/fi16030087
Submission received: 17 January 2024 / Revised: 16 February 2024 / Accepted: 29 February 2024 / Published: 1 March 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Good and relevanto work.
I would add more explanations on why a specific tool was used.

This is done for NER, not for others

Author Response

Thank you for your comment! We have added a short section in chapter "3.2 Analyzing the Visible Image" and "Chapter 3.3 Analyzing the Text in the Visible Image" explaining the choice of our specific tools, libraries and models.

Reviewer 2 Report

Comments and Suggestions for Authors

I actually do not understand the idea behind this paper. It was submitted as an original paper, but the only thing that Authors did was compilation of methods for NER and for landmark recognition. In my opinion, there is nothing new or original in this paper. Authors created a dataset, but performed methods that are generally known. I think the title does not reflect the article's content. Maybe changing the type of the paper would be appropriate? At this stage it looks like a student's work for some kind of university project.

I also have detected some minor issues while reading the paper:

1. Fig. 2 should be placed right after paragraph that ends in line 165.

2. There is some Authors comment that was left, I suppose by a mistake, in line 222.

3. In Tab. 1 precision, recall and F1 score are presented with 2 numbers after decimal point, while in Tab. 2 these measures are presented with 3 numbers after decimal point. Why? Can it be unified?

Author Response

We appreciate your comments and the opportunity to clarify the original contributions of our work. While we acknowledge the utilization of well-established methods for Named Entity Recognition (NER), landmark, and text recognition, the originality of our paper lies in the novel application and integration of these methods within the specific context of georeferencing video media libraries—a domain that has not been extensively explored in recent literature as discussed in our chapter Related Works.
The methods we employed may be known, but their combination and application to the specific challenge demonstrate originality in problem-solving and contribute to advancing the field. Our work aligns with the criteria for originality as it provides a  step forward in practical applications, which could be of great interest to both academic and industry practitioners but we clarified our goals and intentions in our introductory chapter.
Regarding the title, we aimed to concisely encapsulate our work's essence. We hope this explanation addresses your concerns and are open to further elaborating on any aspect should there be a need.

Also we addressed the minor issues you mentioned, thank you for pointing them out!

Reviewer 3 Report

Comments and Suggestions for Authors

The paper titled "Advanced Techniques for Geospatial Referencing in Online Media Repositories" by Dominik Warch, Patrick Stellbauer, and Pascal Neis, published in the "Future Internet" journal, explores innovative methods to extract and geocode geospatial references from video content using Artificial Intelligence (AI). The focus is on enhancing the search capabilities in video media libraries, which traditionally rely on basic metadata and are limited in their machine-readability.

 

The study introduces a multimodal methodology combining computer vision, natural language processing, and geospatial analysis. The authors analyze content from the ARD Mediathek media library, using techniques like image and text analysis with machine learning models and audio and subtitle processing using state-of-the-art linguistic models. The research addresses challenges like model interpretability and the complexity of geospatial data extraction.

 

Key findings include the potential for significantly advancing the precision of spatial data analysis within video content, which could enrich media libraries with more navigable and contextually rich content. This has implications for user engagement, targeted services, and applications in urban planning and cultural heritage.

 

The methodology involves several stages:

1. **Data Acquisition**: Collecting data from the ARD Mediathek using an API.

2. **Analyzing the Visible Image**: Using a machine learning model trained on the Google Landmarks Dataset v2 for landmark recognition.

3. **Analyzing the Text in the Visible Image**: Employing Tesseract OCR engine for text extraction and natural language processing methods for analyzing the text.

4. **Analyzing the Audio and Subtitles**: Converting audio to text using OpenAI's Whisper and analyzing it with named entity recognition (NER) and geocoding tools.

 

The study evaluates different NER tools for their effectiveness in extracting location information and geocoding accuracy. It presents a detailed analysis of the performance of various methodologies, including challenges faced in accurately identifying landmarks and text in video frames due to issues like motion blur, variable lighting, and the complexity of scenes.

 

The paper concludes with the recognition of the mixed results in enhancing the utility of video media libraries through this multimodal methodology. It acknowledges the potential for future improvements, especially in the areas of landmark recognition, temporal geospatial analysis, and the development of an integrated location scoring system. The study highlights the promise of AI in geospatial analysis while also noting the limitations and challenges that need to be addressed in future research.

To enhance the quality and impact of the paper "Advanced Techniques for Geospatial Referencing in Online Media Repositories," several improvements can be considered:

 

1. **Expanded Dataset Analysis**: Broadening the scope of data sources beyond the ARD Mediathek media library to include diverse video repositories could provide a more comprehensive understanding of the methodology's effectiveness across different types of content and metadata structures.

 

2. **Comparative Studies with Existing Methods**: Including a comparison with other existing geospatial referencing techniques would offer a clearer perspective on the advantages and limitations of the proposed methodology.

 

3. **In-Depth Error Analysis**: A more detailed examination of the errors and challenges encountered, particularly in landmark recognition and OCR, could offer insights into specific areas needing refinement.

 

4. **Improved Model Interpretability**: Efforts to enhance the interpretability of AI models used in the study, such as using explainable AI techniques, could provide a better understanding of why certain errors occur and how to address them.

 

5. **User Experience Studies**: Conducting user experience studies to gauge the practical usability and effectiveness of the enhanced search functionalities in real-world scenarios can provide valuable feedback for further improvements.

 

6. **Algorithm and Model Optimization**: The paper could benefit from a deeper exploration into optimizing the algorithms and models used, especially in handling complex scenes, diverse languages, and varying video qualities.

 

7. **Temporal and Contextual Analysis**: Incorporating more sophisticated temporal and contextual analysis could improve the accuracy of geospatial data extraction, especially in videos where location references are spread across different time frames.

 

8. **Handling of Diverse Video Formats**: Addressing the challenges posed by different video formats, resolutions, and quality, which can significantly impact the performance of computer vision and OCR techniques.

 

9. **Scalability and Efficiency**: Discussing the scalability and computational efficiency of the proposed methods for large-scale implementation, which is crucial for real-time processing in practical applications.

 

10. **Enhanced Visualizations and Case Studies**: Including more detailed visualizations and case studies demonstrating the application of the methodology in various scenarios could make the paper more engaging and relatable to a broader audience.

 

11. **Broader Implications and Applications**: Elaborating on the potential implications and applications of the methodology in fields beyond urban planning and cultural heritage, such as emergency response, environmental monitoring, and tourism.

 

12. **Future Work and Potential Developments**: Providing a clearer roadmap for future work, including potential technological developments and collaborations with industry partners, can offer a vision for how this research can evolve.

 

Implementing these improvements would not only strengthen the current research but also provide a more robust framework for future studies in the field.

 

Comments on the Quality of English Language

The quality of English in the paper "Advanced Techniques for Geospatial Referencing in Online Media Repositories" is generally good, with clear and structured sentences, appropriate technical vocabulary, and coherent presentation of ideas. However, like any academic paper, there is always room for improvement. Here are some suggestions:

 

1. **Clarity and Conciseness**: Some sentences are quite lengthy and complex, which might challenge the reader's comprehension. Simplifying and breaking down complex sentences can enhance clarity.

 

2. **Consistency in Terminology**: Ensure consistent use of technical terms and phrases throughout the paper. This consistency helps in maintaining a clear narrative.

 

3. **Active vs. Passive Voice**: While passive voice is common in scientific writing, occasional use of active voice can make the text more engaging and direct.

 

4. **Transitions and Flow**: Improving transitions between sections and within paragraphs can enhance the flow of the paper, making it easier for readers to follow the argument or narrative.

 

5. **Grammar and Syntax**: Although the grammar is generally good, a thorough review to catch any subtle errors in verb tense, subject-verb agreement, and sentence structure would be beneficial.

 

6. **Avoiding Repetition**: Check for and avoid unnecessary repetition of words or phrases, which can be redundant and detract from the readability.

 

7. **Technical Jargon**: While technical terms are necessary, ensuring they are well-explained or defined at their first occurrence helps in making the paper accessible to a broader audience, including those not specialized in the field.

 

8. **Proofreading and Editing**: A final round of meticulous proofreading can help catch minor errors and inconsistencies that might have been overlooked.

 

Overall, the language quality is quite robust, but attention to these details can elevate the overall readability and professionalism of the paper.

Author Response

Thank you very much for your detailed review!

Regarding 1 (Expanded Dataset Analysis):

We appreciate your suggestion to broaden the scope of data sources. While incorporating a diverse range of video repositories could indeed enhance our understanding of the methodology's effectiveness, the focus on ARD Mediathek was a deliberate choice for this study due to its accessibility and the rich metadata it provides. This targeted approach allowed us to develop and refine our multimodal methodology in a controlled environment, setting the stage for future research to extend our analysis to additional media libraries.

Regarding 2 (Comparative Studies with Existing Methods):

Your point on comparative studies is well-taken. We have included focused comparisons with NER tools and provided alternatives for OCR and Landmark Recognition models that are relevant to our study's context. A broader comparative analysis with other geospatial referencing techniques, while valuable, is beyond the scope of this paper but is certainly an area for future research if not already referenced in the Related Works chapter.

Regarding 3 (In-Depth Error Analysis):

The detailed case examples provided in the paper underscore the challenges encountered in our methodology, particularly in landmark recognition and OCR. We acknowledge the importance of a thorough investigation, which is planned for subsequent work. The scope of this article, however, centers on establishing a baseline for our methodology, which we will build upon with an in-depth error analysis in future studies.

Regarding 4 (Improved Model Interpretability):

Improving model interpretability is an important aspect of AI research. For the current study, our focus was to establish the efficacy of our approach. For now this is unfortunately not the focus of our research.

Regarding 5 (User Experience Studies):

Conducting user experience studies is indeed a valuable recommendation for assessing the practical impact of our methodology. While such studies are outside the scope of this article, they represent a critical component of our future work where the usability and effectiveness of our approach will be evaluated in real-world scenarios.

Regarding 6 (Algorithm and Model Optimization):

We recognize the importance of optimizing the algorithms and models used in our study. The current paper establishes the foundational work, and as we progress, optimization will be crucial, especially in handling complex scenes and diverse languages. This is an area we are keen to address in our continued research.

Regarding 7 (Temporal and Contextual Analysis):

Your suggestions regarding temporal and contextual analysis are on point, thank you! The article briefly introduces this concept, but based on your feedback, we expanded this part in the chapter Conclusion and Future Work

Regarding 8 (Handling of Diverse Video Formats):

We already discuss our approach to image normalization for Landmark Recognition and image preprocessing for OCR. Based on your feedback, we have further clarified this approach in the Methodology chapter.

Regarding 9 (Scalability and Efficiency):

The scalability and computational efficiency of our methodology, as you highlighted, are crucial for practical applications. We have elaborated on these aspects in the Methodology chapter, emphasizing the concurrent and optimized nature of our workflow suitable for large-scale video media libraries.

Regarding 10 (Enhanced Visualizations and Case Studies):
While we agree that detailed visualizations and case studies could enhance the paper, we must note that the scope of this article focuses on the foundational methodology. However, we will consider including such enrichments in future publications as our research progresses and more data becomes available.

Regarding 11 (Broader Implications and Applications):
The potential implications and applications of our methodology in fields like emergency response and environmental monitoring are indeed broad and significant. We have included a paragraph in the Conclusion and Future Work chapter to highlight these broader applications and the opportunities they present.

Regarding 12 (Future Work and Potential Developments):
We appreciate your recommendation for a detailed roadmap for future work. While the specific developments are beyond the scope of this paper, we have sharpened the Future Work chapter to reflect our enthusiasm for pursuing technological advancements and potential collaborations with industry partners.

Again, thank you for your constructive and detailed review. Unfortunately some of your points are beyond the scope of the current article, but we intend to consider them for our future work!

Reviewer 4 Report

Comments and Suggestions for Authors

The authors state that the “study presents a novel multimodal methodology that utilizes advances in Artificial Intelligence, including neural networks, computer vision, and natural language processing, to extract and geocode geospatial references from videos”.

My concern with this paper is that it uses a combination of three methodologies such as computer vision techniques, natural language processing, and geospatial analysis, to my understanding not by making a synthesis of the results from each methodology applied but separately as implied by the discussion at the end of each section. Hence, my question regarding the methodology adopted by the authors is this: if one method, e.g., NLP has given better results than computer vision, in the end of the process do they filter out errors (false positives and false negatives) from the method with the worst results, in this case computer vision, that have been correctly identified by the best one (NLP in this example)? So that they can reach an overall assessment of the pipeline at large or are they just highlighting the higher efficiency of one method compared with another?

Another concern has to do more with the scope of geospatial references that they deal with. In the introduction, they mention that they deal with the geospatial content of the video, this means that they want to extract what places/spaces the video refers to,  where the video is located, or both without making any distinction of this information? This is a matter of semantics to my understanding, since a video may be shot showing a person in a café in Stephansplatz, Vienna, Austria where in the video the street signs and the name of the café are visible while the person may narrate their early childhood years in Linz. In this case, how do they treat this information? I am asking this because form the computer vision techniques they may get correct reference to the location of the video shooting, while from the NLP techniques they may also get correct reference to what is narrated in the video, which to me refers to the concept of video content. If their interest is only the first (the video location) then the information extracted by the second technique may be treated as erroneous, and vice versa, while if they target both cases of geospatial references it remains to me unclear how they treat this in their results.

Nevertheless, the results the authors show are not very successful nor promising. The paper use techniques with different level of maturity and hence, with different level of efficacy, NLP is by far more advanced and well-established method for NER and extraction. The fact is that scholars belonging to the community of geographic information scientists and experts rely on advances on other fields such as computer science, AI, NLP to handle issues as the ones presented in the paper; the geocoding of geospatial reference in videos. Therefore, we are somehow reliant in that sense, to acquire results on research issues such as the ones presented above, which are compliant with the efficacy and power of methods and algorithms emanating from research in other disciplines.

This, as a general comment, does not imply that we should not take advantage of advances in other disciplines, on the contrary. Furthermore, a bigger necessity is to highlight the complexity of geospatial data and information that may oblige methods and techniques of AI such as the ones presented herein to fall short of expectations when references of geospatial nature are to be handled. Although the idea to combine different techniques coming from AI to extract geospatial references is appealing, maybe some particular methodologies are premature. On the other hand, I am not convinced as to what extent this is actually a combination of techniques or rather it is an analysis of the dataset and of the results of each methodology in isolation without integration and evaluation of the results in a constructive and synthesized way.

 

Author Response

> My concern with this paper is that it uses a combination of three  methodologies such as computer vision techniques, natural language processing, and geospatial analysis, to my understanding not by making a synthesis of the results from each methodology applied but separately as implied by the discussion at the end of each section. Hence, my question regarding the methodology adopted by the authors is this: if one method, e.g., NLP has given better results than computer vision, in the end of the process do they filter out errors (false positives and false negatives) from the method with the worst results, in this case computer vision, that have been correctly identified by the best one (NLP in this example)? So that they can reach an overall assessment of the pipeline at large or are they just highlighting the higher efficiency of one method compared with another?

Thank you for your valuable feedback regarding the integration of the methodologies applied in our research. We agree that a detailed interpretation and synthesis of the results are crucial to the paper's contribution. As you rightly pointed out, the individual results from computer vision, NLP, and geospatial analysis have been discussed in their respective sections. The proposed 'location scoring system' in the Future Work chapter is indeed intended to merge these partial results into a comprehensive assessment. This system will allow us to filter out inaccuracies by cross-validating findings across methods, thus refining the overall result. For now this is only intended as next step in our research.

 

> Another concern has to do more with the scope of geospatial references that they deal with. In the introduction, they mention that they deal with the geospatial content of the video, this means that they want to extract what places/spaces the video refers to,  where the video is located, or both without making any distinction of this information? This is a matter of semantics to my understanding, since a video may be shot showing a person in a café in Stephansplatz, Vienna, Austria where in the video the street signs and the name of the café are visible while the person may narrate their early childhood years in Linz. In this case, how do they treat this information? I am asking this because form the computer vision techniques they may get correct reference to the location of the video shooting, while from the NLP techniques they may also get correct reference to what is narrated in the video, which to me refers to the concept of video content. If their interest is only the first (the video location) then the information extracted by the second technique may be treated as erroneous, and vice versa, while if they target both cases of geospatial references it remains to me unclear how they treat this in their results.

We acknowledge the complexity that arises when different methods yield correct but distinct location references within the same frame. The scoring system should also consider these multifaceted results to ensure that our synthesis of geospatial data accounts for the inherent variability and richness of video content, rather than oversimplifying to a single 'correct' location. This nuanced approach will be essential for the  analysis that our research seeks to provide but is not in scope of the current article as mentioned before.

 

> Nevertheless, the results the authors show are not very successful nor promising. The paper use techniques with different level of maturity and hence, with different level of efficacy, NLP is by far more advanced and well-established method for NER and extraction. The fact is that scholars belonging to the community of geographic information scientists and experts rely on advances on other fields such as computer science, AI, NLP to handle issues as the ones presented in the paper; the geocoding of geospatial reference in videos. Therefore, we are somehow reliant in that sense, to acquire results on research issues such as the ones presented above, which are compliant with the efficacy and power of methods and algorithms emanating from research in other disciplines.

Thank you for your insights regarding the varied maturity levels of the methodologies we employed in our research. We agree that NLP is a more mature method for NER and extraction. The integration of these different methods reflects our interdisciplinary approach, drawing from the latest advancements in computer science, AI, and NLP. This is indeed indicative of the collaborative nature of geographic information science, which often relies on the evolution of these fields to enrich its own methodologies. We added a paragraph in the introductory chapter to acknowledge this reliance on neighboring disciplines.

 

> This, as a general comment, does not imply that we should not take advantage of advances in other disciplines, on the contrary. Furthermore, a bigger necessity is to highlight the complexity of geospatial data and information that may oblige methods and techniques of AI such as the ones presented herein to fall short of expectations when references of geospatial nature are to be handled. Although the idea to combine different techniques coming from AI to extract geospatial references is appealing, maybe some particular methodologies are premature. On the other hand, I am not convinced as to what extent this is actually a combination of techniques or rather it is an analysis of the dataset and of the results of each methodology in isolation without integration and evaluation of the results in a constructive and synthesized way.

We appreciate your comment on the necessity to highlight the complexity of geospatial data and its processing through AI methodologies. Your observation correctly points out that while the potential of combining different (AI) techniques for extracting geospatial references is promising, some methods may still be in the early stages of their application to this particular field. We understand your concern regarding the extent to which these techniques have been synthesized in our study. As we established a baseline for future studies and publication with this article, we intend to incorporate the 'location scoring system' while utilizing it with a much bigger dataset.

Thank you again for your detailed and constructive review. This was the most valuable review for us, as it dealt with discipline-specific details that we had not previously given sufficient consideration!

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Authors improved the quality of the paper and responded to my concerns, I do not have any more questions.

Reviewer 3 Report

Comments and Suggestions for Authors

author did required corrections

Comments on the Quality of English Language

just final proofread

Reviewer 4 Report

Comments and Suggestions for Authors

I am pretty satisfied that authors have responded to not only mine but also other reviewers' comments in an adequate and coherent manner both in their replies addressed to reviewers as well throughout the paper. 

The introduction is more detailed and section 3.3 includes alternatives for analyzing text in videos and provides justification on the application that was ultimately selected.

Finally, the conclusions section makes a better critique of the results, and gives a better account of the future research authors intend to pursue. Although I do not think that the results per se provide a groundbreaking basis, location extraction from videos is an open field for research and a very complex one so it is worth the effort of pursing such an endeavor.

 

Many issues still are not treated in this paper, but since this is an ongoing research, the manuscript as it stands, is worth publishing keeping in mind that the approach and results presented there in stem from early research stages.

Back to TopTop