Next Article in Journal
Efficient Steel Surface Defect Detection via a Lightweight YOLO Framework with Task-Specific Knowledge-Guided Optimization
Previous Article in Journal
A 0.6 V 68.2 dB 0.42 µW SAR-ΣΔ ADC for ASIC Chip in 0.18 µm CMOS
 
 
Article
Peer-Review Record

ConvGRU Hybrid Model Based on Neural Ordinary Differential Equations for Continuous Dynamics Video Object Detection

Electronics 2025, 14(10), 2033; https://doi.org/10.3390/electronics14102033
by Linbo Qian 1,2, Shanlin Sun 2,3,* and Shike Long 2,3
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Electronics 2025, 14(10), 2033; https://doi.org/10.3390/electronics14102033
Submission received: 27 March 2025 / Revised: 10 May 2025 / Accepted: 12 May 2025 / Published: 16 May 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

 This article proposed one ConvGRU Hybrid Model Based on Neural Ordinary Differential Equation for Continuous Dynamics Video Object. It is an interesting question, and experiments showed it could get better performance, too. But I still have small questions as following:

1) This study proposes a hybrid model that integrates Neural Ordinary Differential Equations (Neural ODEs) with Convolutional Gated Recurrent Units (ConvGRU) to achieve continuous dynamics in object detection for video data. The experiments showed it works. But I have a questions on hidden state transitions, i.e. how to define the HIDDEN states? and how to collect hidden states in real application. More details is expected.

2) How to define the words "dynamics" and "continunous" in video detection tasks? What's the real meaning of the two words. As we know, most video sequence are dynamics and continunous. So how to understand and explain these concept?

3) In table 3, the Detection accuracy table, we can find the proposed model owns better mAP, but authors do not provide data on computing cost. And how to keep balance between mAP and computations. I advise authors list more data to support the main idea.

4) Authors conduct solutions for video object detect on  network and convlolution, but as stated in the article feature extraction plays important role in this kind of task. I advice the authors supply more discussion on traditional methods to extract multiple features, and reference to literatures in this fields,  for example  https://doi.org/10.3390/rs17020193 . More references need to make the discussion more complete.

Comments on the Quality of English Language

I think authors should ask for help from native English speakers to polish the text. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper presents a proposal for a method designed to detect objects in a video sequence. The solution uses an enriched proprietary approach to sequence processing using enriched GRU gates and a two-stage feature extraction method, including the use of attention mechanisms. However, I have found some minor errors in the manuscript text, which I am pointing out as a reviewer. Below I refer to specific sections.

Introduction

In the first paragraph, it is worth immediately expanding the topic of object detection, bringing this area closer to the reader. Starting with a literature review can be too hard for a less experienced reader on the subject.

Proposed method

The descriptions of steps 1-3 do not quite coincide with the labels in Figure 3. In particular, stage 2. It is worth unifying them a bit, which will allow you to catch a certain intuition of the operation of this approach at the beginning of this section, which is the main axis of the article.

Feature Pyramid Network - Up

Figure 4: It is not clear what the individual sizes of the tensors at the different levels of the pyramid mean. Please describe the entire process in more detail. At this point, the description of the stage itself is very general and at a much lower level of detail than the previous section.

Convolutional Block Attention Module CBAM 

Figure 5: Not all objects in this image have signed labels. At this point, the picture itself is not fully readable and does not reflect the essence of the method.

It would be good to show here (or in the results) what the attention maps resulting from this operation look like and how they affect the final detections.

Experimental results

The description of the data set used to test the proposed method should be enriched with additional information on the classes of detected objects and the general nature of these images in terms of lighting conditions, background, etc.

It is not clear at the time of the first description of the dataset whether the same data subset was used to train the proposed method.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This is a fascinating and scientifically sound study with broad applications. The manuscript is well written and scientifically appealing. It requires a few minor points to be addressed.

(1) Add equation numbers for Precision, Recall, and mAP formulations

(2) Add reference source (s) for equation (6).   

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors
  1. The experiments can not support the author's conclusion on the proposed algorithms' performance.  As a formal academic article, I advise that it should  cover at least three independent datasets. And especially as a well-known and opensource object detection dataset, ImageNet  should be  taken to validate your algorithms. 
  2. The KITTI  dataset mainly was designed for 3D object dection in autonomous driving fields. It is not very suitable for general obeject detection algorithm's bencmark, otherwise you should state that your algorithms mainly works for auto-driving area.  And another point is that  KITTI contains objects including 8 classes, like ‘Car’, ‘Van’, ‘Truck’, ‘Pedestrian’, ‘Person (sitting)’, ‘Cyclist’, ‘Tram’ and ‘Misc’ . But  the authors only list the  result of Car, Cyclist and Pedestrian.  How about the others? The authors should add the results on other objects. 

  3.  The experiments only compare its result with RMf-SSD which is prposed in 2018, and SqueezeDet which is proposed in 2017.  Both of the only two algorithms are really too old. In fact, authors DO NOT compare their proposed with methods published in recent three years, like YOLOv11. This article's novelty needs more support.  I advise the author should add more other 5 methods at least and 3 of them are within 3 years.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors
  1. In the response letter, the authors list the coming plan for more experiments. It is reasonable, but I advise that authors should address these plans and their purpose in the article's disscussion, or in sub section 'feature works'.
  2. The author also should list the reason why they do not cover other dataset. I dont think 'we don’t have much time before the deadline'  is a good idea. I advise more technical explanations to support your decision in your final article text.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop