Next Article in Journal
Connecting Risk and Resilience for a Power System Using the Portland Hills Fault Case Study
Next Article in Special Issue
A Novel Consensus Fuzzy K-Modes Clustering Using Coupling DNA-Chain-Hypergraph P System for Categorical Data
Previous Article in Journal
Design and Characterization of a Fluidic Device for the Evaluation of SIS-Based Vascular Grafts
Previous Article in Special Issue
Membrane System-Based Improved Neural Networks for Time-Series Anomaly Detection
 
 
Article
Peer-Review Record

A Representation of Membrane Computing with a Clustering Algorithm on the Graphical Processing Unit

Processes 2020, 8(9), 1199; https://doi.org/10.3390/pr8091199
by Ravie Chandren Muniyandi * and Ali Maroosi
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Processes 2020, 8(9), 1199; https://doi.org/10.3390/pr8091199
Submission received: 30 July 2020 / Revised: 5 September 2020 / Accepted: 18 September 2020 / Published: 22 September 2020
(This article belongs to the Special Issue Modeling, Simulation and Design of Membrane Computing System)

Round 1

Reviewer 1 Report

The paper proposed a new classification algorithm to organize dependent objects and membranes based on the communication rate associated with the defined weighted network and assign them to sub-matrices for execution by the same threads and thread blocks. In this way, it decreases communication between threads and thread blocks. The proposed algorithm was tested to boost the processing by 93 folds. This paper has potential in simulating biological processes using efficient algorithm. However, there are still some concerns need to be addressed. I would recommend major revision.

Motivation / Novelty 

It seems that the main contribution of this paper is to introduce blocks of matrices generated by communication rates to decrease the invocation workload. The processing time thus could be essentially decrease. And the original idea of introducing matrices is not proposed by this paper either. The paper put lots of efforts in presenting previous works and did not highlight enough the novelty / contributions of this paper. I would suggest trim the contents discussing previous works and maybe add some illustrative examples showing the advantages of using the proposed approach in the introduction section. Moreover, I am not sure the novelty of this paper is sufficient enough to appreciate the authors of this journal.

Numerical Results 

I have some questions related to the numerical results: 

    1. How the results are related to the structure of sub-matrices (blocks of matrices). I would be ideal if the paper could discuss how different matrix structures would affect the processing times / performance. What if the input is up to some errors and "Balancing_Occup_Sync_Approach" would results in different output than expected. How the processing time would be changed?
    2. How the advantages of processing time is related to the size / scalability? From the results shown in this paper, it seems that the advantages would be strengthened as we in crease the size. It would be ideal if the authors could dig deeper in understanding the advantages of the proposed algorithm under different scenarios. 

Specific concerns:

  1. The writing of the paper could be improved. The format of notations are consistent. Please correct accordingly. 
  2. The tables need to be reformatted.
  3. I would suggest changing Figure 6. The results of the ratio over other competing approaches could be present. The y axis could be changed to the scientific format or  "KB, MB, GB, TB" to make it more readable. 
  4. The reference need to be reformatted. 

 

Author Response

REVIEWER 1

  • Motivation / Novelty 
    • It seems that the main contribution of this paper is to introduce blocks of matrices generated by communication rates to decrease the invocation workload. The processing time thus could be essentially decrease.
    • And the original idea of introducing matrices is not proposed by this paper either.
      • We acknowledged that the original idea of introducing matrices in membrane computing is not proposed by this paper but we have developed a mechanism of implementing matrices on GPU to maximize the utilization of threads and thread blocks. We have highlighted the contribution of the paper in the introduction section (lines 93 -97). Thank you.
  • The paper put lots of efforts in presenting previous works and did not highlight enough the novelty / contributions of this paper.
    • We have presented some of the previous works to reflect the latest development in the theoretical aspects and real practicality of membrane computing and to relate them to the limitations as well as to indicate the necessities for improvements to be done on parallel implementation. We have improved the presentation of this part (lines 32-92). Thank you.
    • The main contribution of this paper is to develop a classification algorithm based on the communication rate to manage dependent objects and membranes in membrane computing implemented on GPU. We have defined a weighted network based on communication rate and assigned them to sub-matrices for execution by the same threads and thread blocks to improve the communication between thread blocks, reduce kernel invocations, and to maximize GPU occupancy. We have highlighted the contribution of the paper in the introduction section (lines 93 -97). Thank you.
  • I would suggest trim the contents discussing previous works and maybe add some illustrative examples showing the advantages of using the proposed approach in the introduction section.
    • We have presented some of the previous works to reflect the latest development in the theoretical aspects and real practicality of membrane computing and to relate them to the limitations as well as to indicate the necessities for improvements to be done on parallel implementation. We have improved the presentation of this part (lines 32-92). Thank you.
    • Thank you. We added some examples to show the advantages of our approach (lines 98-108): This approach can be applied to solve NP-hard problems such as traveling salesman, knapsack, and satisfiability. The traveling salesman problem is one of the combinatorial optimization problems and applied to network structure design, machine scheduling, cellular manufacturing, and frequency assignment. Boolean satisfiability problem is applied to solve problems such as in mathematics, artificial intelligence, data mining, and circuit design. The knapsack problem is another combinatorial optimization problem that aims to maximize the value of items under backpack capacity constraints and it is being applied to solve decision problems. With the increase in the number of inputs, the time complexity of solving these problems also will increase significantly. With the proposed classification algorithm, the efficiency and effectiveness of the parallelism of GPU will reduce the computational complexity. It is feasible to generate all solutions and then screen out the qualified solutions in polynomial time or even linear time.
  • Moreover, I am not sure the novelty of this paper is sufficient enough to appreciate the authors of this journal.

We have clarified that the main contribution of this paper is not the introduction of matrices in membrane computing for implementation on GPU. Rather our goal is to improve processing speeds of membrane computing applied on GPU with our proposed approach. We have highlighted the contribution of the paper in the introduction section (lines 93 -97). Thank you. 

  • Numerical Results -I have some questions related to the numerical results: 
    1. How the results are related to the structure of sub-matrices (blocks of matrices). I would be ideal if the paper

The dimension of sub-matrices depends on the number of objects in the membrane. When the number of objects increases, the dimension of sub-matrices will also increase. Table 1 and Figure 8 shows that when the number of objects increases, the performance of the proposed approach is also improving. Typically, a better parallel speedup is obtained for large problem sizes.   The efficiency of the parallel program increases with the size of the matrix. Added to lines 577 -581 section 4. Thank you. 

    1. could discuss how different matrix structures would affect the processing times / performance.

The block size is an important factor to attain high performance in algorithms-by-blocks. Independently of the techniques used to reduce data transfers, to improve data affinity, an incorrect election of this parameter lead to sub-optimal performance. In our case, the optimal value for the block size is a trade-off between several factors such as potential parallelism in which smaller block size is translated into a finer granularity, and a larger block dimension leads to less concurrency.

Given the pool of execution units available in the system, the size of the block greatly influences the degree of parallel execution on the sub-problems and reduces idle times. Small block size is translated into a larger number of data transfers of reduced dimension. Conversely, a larger block dimension translates into higher effective bandwidth, and thus benefits the final performance of the implementation. Thus, taking into account only the potential parallelism factor, the optimal block size partially depends on the total matrix dimension.

Added to section 4 (lines 502-506 & lines 520-525). Thank you.

    1. What if the input is up to some errors and "Balancing_Occup_Sync_Approach" would results in different outputs than expected.

Once the optimal block size is fixed for a given matrix size, the benefits of the different implementations of the runtime are mostly dictated by the number of objects and membranes necessary to perform a given operation. For “Balancing_Occup_Sync_Approach” (Algorithm 2), the inputs (the number of objects, membranes, and rules) are determined when the membrane system is designed. If there are errors in these inputs, especially on the rules, the system could behave abnormally and the expected output won’t be generated.  As a precaution to avoid errors on the input, we validated the algorithm before it was implemented on the GPU.

Added to section 4 (lines 559-565). Thank you.

    1. How the processing time would be changed?

In our case, the processing time depends on communication between objects and membranes that are triggered by the rules. As table 1 and figure 8 showed, when there is maximum occupancy with minimal communication between thread blocks, time efficiency and performance improved.

Added to section 4 (lines 565-568). Thank you.

    1. How the advantages of processing time is related to the size / scalability? From the results shown in this paper, it seems that the advantages would be strengthened as we increase the size. It would be ideal if the authors could dig deeper in understanding the advantages of the proposed algorithm under different scenarios. 

In section 4, the effects of the number of objects in each membrane and the number of membranes in the system on the GPU performance with previous approaches (Fig.3) are discussed. Section 4 updated (lines 604-637). Thank you.

  • Specific concerns:
  1. The writing of the paper could be improved. The format of notations are inconsistent. Please correct accordingly. 

We have corrected accordingly and made the notations consistent. Thank you.

  1. The tables need to be reformatted.

The table has been reformatted. Thank you.

  1. I would suggest changing Figure 6. The results of the ratio over other competing approaches could be present. The y axis could be changed to the scientific format or  "KB, MB, GB, TB" to make it more readable. 

The format has been changed. Thank you.

  1. The reference need to be reformatted. 

References reformatted. Thank you.

Author Response File: Author Response.pdf

Reviewer 2 Report

(1) Algorithms are better to be shown as separate Figures, rather than simple boxes. Or, you’d better to isolate some “Algorithm” notations.

(2) In the introduction section, you’d better to refer CUDA explicitly, for the GPU applications.

(3) Table 1 should be explained more precisely.

(3.a) There is no detailed explanation to the columns of the table.

(3.b) Also, no analysis of the experimental results are given.

(3.c) At the third column of Table 1, all execution time for sequential program on CPU is coincided to “100 sec”. It needs to be explained why all the CPU sequential executions mark the same execution time.

(3.d) What is the “previous method” in Table 1? Show it explicitly and also how to implement it.

 

(4) In the Table 1, the final execution speed of both the previous GPU-based method and the newly proposed method are too slow. According to the benchmark sites, your CPU Intel Core i7-3820 marks 3.86 GFlops per core. And, Your GPU GTX 660 marks 1,981 GFlops per card. (with 960 cores). So, your massively-parallel version can be accelerated at most 513 times. You’d better to explain the gaps between the theoretical speedup limitations and your experimental results.

 

(5) Some more practical explanations and real-world examples are needed for article readers. It was hard to understand the background works and future works.

 

Author Response

REVIEWER 2

(1) Algorithms are better to be shown as separate Figures, rather than simple boxes. Or, you’d better to isolate some “Algorithm” notations.

Algorithms 1 and 2 have been shown as Figures. Thank you.

(2) In the introduction section, you’d better to refer CUDA explicitly, for the GPU applications.

Referred to CUDA in the second paragraph of the introduction section. Thank you.

(3) Table 1 should be explained more precisely.

Table 1 is reformatted to align with the explanations in section 4. Thank you.

(3.a) There is no detailed explanation to the columns of the table.

We have updated the detailed explanation of the columns. Thank you.

(3.b) Also, no analysis of the experimental results are given.

Section 4 improved (lines 459-637). Thank you.

(3.c) At the third column of Table 1, all execution time for sequential program on CPU is coincided to “1000 sec”. It needs to be explained why all the CPU sequential executions mark the same execution time.

The host/CPU part of the code is generally responsible for controlling the program execution flow, allocating memory in the host or device (GPU), and obtaining results from the device. We have set 1000 seconds as CPU time consumed to sequentially complete this task. Explained in section 4 (lines 461-468). Thank you.

(3.d) What is the “previous method” in Table 1? Show it explicitly and also how to implement it.

Citations included referring to the previous approaches. Thank you

(4) In the Table 1, the final execution speed of both the previous GPU-based method and the newly proposed method are too slow. According to the benchmark sites, your CPU Intel Core i7-3820 marks 3.86 GFlops per core. And, Your GPU GTX 660 marks 1,981 GFlops per card. (with 960 cores). So, your massively-parallel version can be accelerated at most 513 times. You’d better to explain the gaps between the theoretical speedup limitations and your experimental results.

This study focused on comparing the implementation of membrane computing on GPU with previous approaches [31-33] and our proposed approach (figures 4 and 7).   Both implemented with GPU but our approach demonstrated that by managing dependent objects and membranes within a thread block and by maximizing GPU occupancy, the performance of membrane systems can be accelerated. We acknowledge the essentials of analyzing the gaps between theoretical speedup limitations and experimental results but unfortunately, it wasn’t under the scope of this study. Nevertheless, we’ll consider it in our future works.  Added to Conclusions/Future Works (lines 657-659). Thank you.

(5) Some more practical explanations and real-world examples are needed for article readers. It was hard to understand the background works and future works.

The introduction section has been updated with some examples (lines 98-108). Thank you:

This approach can be applied to solve NP-hard problems such as traveling salesman, knapsack, and satisfiability. The traveling salesman problem is one of the combinatorial optimization problems and applied to network structure design, machine scheduling, cellular manufacturing, and frequency assignment. Boolean satisfiability problem is applied to solve problems such as in mathematics, artificial intelligence, data mining, and circuit design. The knapsack problem is another combinatorial optimization problem that aims to maximize the value of items under backpack capacity constraints and it is being applied to solve decision problems.

With the increase in the number of inputs, the time complexity of solving these problems also will increase significantly. With the proposed classification algorithm, the efficiency and effectiveness of the parallelism of GPU will reduce the computational complexity. It is feasible to generate all solutions and then screen out the qualified solutions in polynomial time or even linear time.

 

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Thanks the authors for working on my previous comments. I don't have any further questions.

Back to TopTop