Quantization of Faster R-CNN
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe article proposed a generic faster R-CNN quantization algorithm, and our implementation was open source and compatible with the PyTorch ecosystem.Our solution reduces model size by 67.2% and detection time by 50.4% while keeping the accuracy of measurements on test data within an 8.2% error margin with a standard deviation of ± 3.4%.It also allow the visualization of model errors by extracting the internal activation graph of the model, supporting a more efficient understanding of its behavior.It has certain application value, but there are the following problems.
- It is somewhat unreasonable for a scholar to apply multiple references.
- This article proposes a general faster R-CNN quantization algorithm, saying that the model is reduced by 67.2% and the detection time is reduced by 50.4%. How to measure it.
- What is the theoretical basis for proposing a general faster R-CNN quantization algorithm?
- Figures 6, 7 and 8 are not clear enough.
- For safety-critical applications, the interpretability and robustness of the general faster R-CNN quantization algorithm proposed in this paper are crucial. How to reflect it?
- It is proposed that the general faster R-CNN quantization algorithm should be compared with the existing model to reflect its advantages.
Author Response
Dear Reviewer
Thank you for your overall satisfaction with the publication and for answering your questions.
The Torch ecosystem is quite large, so it would be a challenge to find some references, which is why they are mentioned in different contexts a few times. We will fix this.
There are several methods to measure the memory requirements of a model. Torch CUDA functions can return the memory used. Another method could be the size of the model or the number and bit depth of its parameters. We measured the first one with batch sizes of 1, 2, 4, 32, 64.
And the inference time was only the prediction time. Overall, the time required was reduced by 50.4% under INT8 quantization.
The vision and fully connected parts of the model were quantized, so its size was reduced by 67.2%. If all parts were successful, 75% would be achieved. RoI pooling is similar to max pooling, only more complicated. That's why Faster R-CNN models are not usually quantized, but most of them could be quantized, as we did, and the results prove it. We made sure that the visualization part was not Faster R-CNN specific. It works for any simple CNN and YOLO models. It now also supports the ONNX format.
We are glad that you analyzed images 6, 7, 8, to put it more correctly, we measured the activation differences between the two models. Since the absolute difference is not necessarily telling, we also depicted the relative differences. Only the first layer is visible in the first image, while all convolutional layers are visible in the other two.
It is clear from this behavior that lower-level convolutions give larger, more detailed maps because they examine smaller areas, and due to the properties of matrix multiplication, if there are light (1) colors, such as the sky, it gives high activation for any filter, even though there is no class there. This is corrected by subsequent convolutions and biases.
Maintaining accuracy under quantization is key to many of its uses. Faster R-CNN is a large model, which, as can be seen in the table, has not been degraded by quantization. Although not included in the article, overparameterized, large language models can run even under INT4 with acceptable precision. If the problem is not so stochastic, the model is overparameterized and trained, it should not deteriorate in accuracy, although this is exactly what this system is for. A much smaller model trained on the same data set deteriorated much more under quantization, but with 16-bit quantization it lost 4% accuracy compared to the large model. It is worth considering training a much smaller model with knowledge transfer or retraining, if the accuracy of the large model does not deteriorate at all under quantization, because this will allow for faster inference.
The last question was:
"It is proposed that the general faster R-CNN quantization algorithm should be compared with the existing model to reflect its advantages."
The previously mentioned figures 6,7,8 are also a comparison of the original and the quantized model, as are the hexbin plots. We took two metrics, but any function can be interpreted. Here, the outputs of the original model are practically the basis of comparison. In most cases, they are very close to the original model, we focused on where they differ the most. (In the heat maps, even the red often does not represent a big difference, it is only the largest compared to the other parts.)
Thank you for the detailed analysis, we will upload the improved version soon.
Best regards:
Authors
Reviewer 2 Report
Comments and Suggestions for AuthorsSee attached PDF.
Comments for author File:
Comments.pdf
Author Response
Dear reviewer
Thank you for your letter, it was very heartfelt. We are glad that you are satisfied overall, answering your question in terms of future work, this already works in ONNX, with YOLO models and classic CNN models.
Although it is not included in the article, large language models also run under INT4, and we feel that the problem is not stochastic enough even for Faster R-CNN, so it is overparameterized. Their performance did not deteriorate at all due to quantization. This trend was also observed with YOLO models of different sizes, on other, for example, cytology datasets.
If the model does not deteriorate at all even under aggressive quantization, it is probably worth training a much smaller model and quantizing it at 16 bits, and this will not halve the inference time, but will result in multiple speedups. A model one-fifth the size of the one in the article lost 4% precision when quantized at 16 bits compared to the original large model. However, it went completely blind at 8 bits. An even smaller model could no longer learn the traffic problem.
Another observation was on the cytology dataset, where the white background gave a falsely high activation. We inverted the image, and the accuracy increased. This stems from the properties of matrix multiplication, high pixel intensity gives a high activation for all filters, which are corrected by later convolutions. It was also observed that the sky, especially in the early convolutions, gave a high activation.
Overall, more and more measurements support that the loss of quantization accuracy is much more influenced by how much the model is overparameterized for a given task and how well it learned the task. The fact that we have more layers rather improves this, as I mentioned in the previous example.
We will upload the article with the requested modifications soon.
Best regards
Authors
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper demonstrates a study on quantization of faster R-CNN network for size compression and speed up of inference.
I recommend the paper for publication as is, I think the paper is good and comprehendible, although the content may be on the shorter side, but the representation of the method and quantification / calibration methods is solid.
Author Response
Dear Reviewer Thank you for your analysis, we tried to be more concise and focused on the results, even if it is a bit shorter than most articles. We are glad that it does not require any modifications. We will also answer the questions of the other reviewers and plan to submit the article. Best regards: Authors
