Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

ConvGRU Hybrid Model Based on Neural Ordinary Differential Equations for Continuous Dynamics Video Object Detection

Electronics 2025, 14(10), 2033; https://doi.org/10.3390/electronics14102033

by Linbo Qian^1,2, Shanlin Sun^2,3,* and Shike Long^2,3

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Sulaymon Eshkabilov

Electronics 2025, 14(10), 2033; https://doi.org/10.3390/electronics14102033

Submission received: 27 March 2025 / Revised: 10 May 2025 / Accepted: 12 May 2025 / Published: 16 May 2025

(This article belongs to the Special Issue Artificial Intelligence-Based Guidance, Navigation, and Control Technologies for Multiple Mobile Robotic Systems)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This article proposed one ConvGRU Hybrid Model Based on Neural Ordinary Differential Equation for Continuous Dynamics Video Object. It is an interesting question, and experiments showed it could get better performance, too. But I still have small questions as following:

1) This study proposes a hybrid model that integrates Neural Ordinary Differential Equations (Neural ODEs) with Convolutional Gated Recurrent Units (ConvGRU) to achieve continuous dynamics in object detection for video data. The experiments showed it works. But I have a questions on hidden state transitions, i.e. how to define the HIDDEN states? and how to collect hidden states in real application. More details is expected.

2) How to define the words "dynamics" and "continunous" in video detection tasks? What's the real meaning of the two words. As we know, most video sequence are dynamics and continunous. So how to understand and explain these concept?

3) In table 3, the Detection accuracy table, we can find the proposed model owns better mAP, but authors do not provide data on computing cost. And how to keep balance between mAP and computations. I advise authors list more data to support the main idea.

4) Authors conduct solutions for video object detect on network and convlolution, but as stated in the article feature extraction plays important role in this kind of task. I advice the authors supply more discussion on traditional methods to extract multiple features, and reference to literatures in this fields, for example https://doi.org/10.3390/rs17020193 . More references need to make the discussion more complete.

Comments on the Quality of English Language

I think authors should ask for help from native English speakers to polish the text.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper presents a proposal for a method designed to detect objects in a video sequence. The solution uses an enriched proprietary approach to sequence processing using enriched GRU gates and a two-stage feature extraction method, including the use of attention mechanisms. However, I have found some minor errors in the manuscript text, which I am pointing out as a reviewer. Below I refer to specific sections.

Introduction

In the first paragraph, it is worth immediately expanding the topic of object detection, bringing this area closer to the reader. Starting with a literature review can be too hard for a less experienced reader on the subject.

Proposed method

The descriptions of steps 1-3 do not quite coincide with the labels in Figure 3. In particular, stage 2. It is worth unifying them a bit, which will allow you to catch a certain intuition of the operation of this approach at the beginning of this section, which is the main axis of the article.

Feature Pyramid Network - Up

Figure 4: It is not clear what the individual sizes of the tensors at the different levels of the pyramid mean. Please describe the entire process in more detail. At this point, the description of the stage itself is very general and at a much lower level of detail than the previous section.

Convolutional Block Attention Module CBAM

Figure 5: Not all objects in this image have signed labels. At this point, the picture itself is not fully readable and does not reflect the essence of the method.

It would be good to show here (or in the results) what the attention maps resulting from this operation look like and how they affect the final detections.

Experimental results

The description of the data set used to test the proposed method should be enriched with additional information on the classes of detected objects and the general nature of these images in terms of lighting conditions, background, etc.

It is not clear at the time of the first description of the dataset whether the same data subset was used to train the proposed method.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This is a fascinating and scientifically sound study with broad applications. The manuscript is well written and scientifically appealing. It requires a few minor points to be addressed.

(1) Add equation numbers for Precision, Recall, and mAP formulations

(2) Add reference source (s) for equation (6).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The experiments can not support the author's conclusion on the proposed algorithms' performance. As a formal academic article, I advise that it should cover at least three independent datasets. And especially as a well-known and opensource object detection dataset, ImageNet should be taken to validate your algorithms.
The KITTI dataset mainly was designed for 3D object dection in autonomous driving fields. It is not very suitable for general obeject detection algorithm's bencmark, otherwise you should state that your algorithms mainly works for auto-driving area. And another point is that KITTI contains objects including 8 classes, like ‘Car’, ‘Van’, ‘Truck’, ‘Pedestrian’, ‘Person (sitting)’, ‘Cyclist’, ‘Tram’ and ‘Misc’ . But the authors only list the result of Car, Cyclist and Pedestrian. How about the others? The authors should add the results on other objects.
The experiments only compare its result with RMf-SSD which is prposed in 2018, and SqueezeDet which is proposed in 2017. Both of the only two algorithms are really too old. In fact, authors DO NOT compare their proposed with methods published in recent three years, like YOLOv11. This article's novelty needs more support. I advise the author should add more other 5 methods at least and 3 of them are within 3 years.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

In the response letter, the authors list the coming plan for more experiments. It is reasonable, but I advise that authors should address these plans and their purpose in the article's disscussion, or in sub section 'feature works'.
The author also should list the reason why they do not cover other dataset. I dont think 'we don’t have much time before the deadline' is a good idea. I advise more technical explanations to support your decision in your final article text.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

ConvGRU Hybrid Model Based on Neural Ordinary Differential Equations for Continuous Dynamics Video Object Detection

Further Information

Guidelines

MDPI Initiatives

Follow MDPI