A Representation of Membrane Computing with a Clustering Algorithm on the Graphical Processing Unit

: Long-timescale simulations of biological processes such as photosynthesis or attempts to solve NP-hard problems such as traveling salesman, knapsack, Hamiltonian path, and satisﬁability using membrane systems without appropriate parallelization can take hours or days. Graphics processing units (GPU) deliver an immensely parallel mechanism to compute general-purpose computations. Previous studies mapped one membrane to one thread block on GPU. This is disadvantageous given that when the quantity of objects for each membrane is small, the quantity of active thread will also be small, thereby decreasing performance. While each membrane is designated to one thread block, the communication between thread blocks is needed for executing the communication between membranes. Communication between thread blocks is a time-consuming process. Previous approaches have also not addressed the issue of GPU occupancy. This study presents a classiﬁcation algorithm to manage dependent objects and membranes based on the communication rate associated with the deﬁned weighted network and assign them to sub-matrices. Thus, dependent objects and membranes are allocated to the same threads and thread blocks, thereby decreasing communication between threads and thread blocks and allowing GPUs to maintain the highest occupancy possible. The experimental results indicate that for 48 objects per membrane, the algorithm facilitates a 93-fold increase in processing speed compared to a 1.6-fold increase with previous algorithms.


Introduction
Membrane systems are a class of computational models that inspired their computation from cell biology. Membrane systems have been applied in different areas [1], including zooplankton migrate vertically in system biology [2], image processing [3,4], robot path planning problem [5], power systems as well as address highly-complex computational problems such as traveling salesman, knapsack, Hamiltonian path and satisfiability [6][7][8][9][10]. The main elements of a membrane system include (i) structure, including its delimiting compartments; (ii) multisets of objects; (iii) biochemically-inspired rules. Membrane systems are based on the parallel and distributed systems found in biological cells, with both the objects and membranes located within the system capable of being processed by the same set of rules [11,12].
While various attempts have been carried out and are currently ongoing to imitate membrane systems, some of these had been implemented on single-processor computers [13], neglecting the intrinsic parallelism of the systems being simulated. Efforts are now focused on increasing implemented on GPU. This study defined a weighted network based on communication rate and assigned them to sub-matrices for execution by the same threads and thread blocks to improve the communication between thread blocks, reduce kernel invocations, and maximize GPU occupancy.
This approach can be applied to solve NP-hard problems such as traveling salesman, knapsack, and satisfiability. The traveling salesman problem is one of the combinatorial optimization problems and applied to the design of network structure, machine scheduling system, the manufacturing of cellular, and the assignment of frequency. Boolean satisfiability problem is used to solve problems such as in mathematics, artificial intelligence, data mining, and circuit design. The knapsack problem is another combinatorial optimization problem that aims to maximize the value of items under backpack capacity constraints, and it is being applied to solve decision problems.
With the increase in the number of inputs, the time complexity of solving these problems also will increase significantly. With the proposed classification algorithm, the efficiency and effectiveness of the parallelism of GPU will reduce the computational complexity. It is feasible to generate all solutions and then screen out the suitable solutions in polynomial time or even linear time.

Active Membrane Systems
Multiple parts, such as a skin and the membranes, are compartmentalized inside an active membrane system. In each membrane, various sets of objects (comparable to biochemical materials) and groups of evolutionary rules (identical to responses) exist inside of membranes' compartments ( Figure 1). (1) m is the preliminary degree of the membrane system, equivalent to the number of membranes, m ≥ 1; (2) O is the object alphabet; (3) H is a predetermined group of membrane labels; (4) µ is a membrane arrangement that includes m membranes labeled 1,…,m, each having a preliminary neutral polarization and tagged with components from H; (5) , … , are strings over O, designating the multisets of objects positioned in the m compartments of µ; (6) R is a predetermined group of rules described as: (1) m is the preliminary degree of the membrane system, equivalent to the number of membranes, m ≥ 1; (2) O is the object alphabet; (3) H is a predetermined group of membrane labels; (4) µ is a membrane arrangement that includes m membranes labeled 1, . . . ,m, each having a preliminary neutral polarization and tagged with components from H; (5) ω 0 , . . . , ω m are strings over O, designating the multisets of objects positioned in the m compartments of µ; (6) R is a predetermined group of rules described as: Evolution rules for the object, [a → u] α h , in which h ∈ H; α ∈ {+, −, 0} are electrical charges, a ∈ O, and u is a string over O that designates a multiset of objects connected with membranes that depend on the label and the charge related with the membranes; (b) "In" communication rules for the object from an environment entering into a membrane, β h , in which h ∈ H; α, β ∈ {+, −, 0}; a, b ∈ O, when an object enters a membrane, it is likely that this object modifies, where the preliminary charge, α, is transformed to β; (c) "Out" communication rules for the object from a membrane entering into an environment, [a] α h → [ ] β h b , in which h ∈ H; α, β ∈ {+, −, 0}; a, b ∈ O, when an object is discharged from a membrane, it is likely that this object modifies, where the preliminary charge, α, is transformed to β; (d) Dissolving rules for membrane, [a] α h → b, in which h ∈ H; α ∈ {+, −, 0}; a, b ∈ O, a membrane with a particular charge is dissolved in a reaction with a probably altered object; (e) Division rules for membrane, γ h , in which h ∈ H; α, β, γ ∈ {+, −, 0}; a, b, c ∈ O, in response to an object, the membrane is divided into two membranes; the label remains unceasing, but the charge could modify, and the objects inside the membrane are duplicated, except for a, which may be altered in each membrane.
These rules are implemented in accordance with the ensuing conceptions: Every rule is implemented in a maximally parallel mode, denote that one membrane object is consumed by at the most one rule in every phase. If an object is able to be consumed by beyond one rule, it must be picked in a non-deterministic means and then consumed by that rule. Whichever object that able to evolve based on any rule should evolve in one step. Rules (b) through (e) are not able to be implemented concurrently in one computational step, and the rules connected with membranes with the label, h, are implemented merely to those membranes. Active membrane systems are explained in detail by Rozenberg et al. [37].

Compute Unified Device Architecture (CUDA)
Processes on threads, thread blocks, and kernels were programmed by using CUDA [34]. Cores are consumed to execute threads, in which groups of thread blocks apportioned to one streaming multiprocessor (SM) and are considered a kernel to be performed on the SMs. Every thread block is implemented on a single SM. The same shared memory able to be consumed by all threads within a thread block. In comparison, mutual access to shared memory is unacceptable between thread blocks. Kernel re-invocation is ensued between thread blocks to apply thread synchronization. Meanwhile, thread synchronization within a thread block transpires through barrier synchronization.
Atomic operations in the same kernel are applied for synchronization between thread blocks [21,38,39] and demonstrated ineffectiveness. Every thread acquires fast memory across registers and slow memory across local memory. Global memory is also slow, though it is reachable through threads in every part of the thread block.

Applied Methods in Previous Research
Suppose that there are m membranes and n objects. Assume that a ij signifies the ith object in the jth membrane. Previously the methods to GPU simulation of active membrane systems represented every membrane to a single thread block and every membrane object to related threads within that thread block (Figures 2 and 3, line 0) [31][32][33]. If the quantity of objects within every membrane is insignificant, then the quantity of active threads in every thread block is negligible relative to the quantity of available threads. This leads to the inefficient use of computational resources and minimizes GPU occupancy.

Figure 2.
A membrane is assigned to one thread block, and objects are assigned to threads within each thread block [31][32][33].
Mapping individual membranes to single thread blocks will develop a mechanism in which the communication among membranes requires interaction among thread blocks. Thread-block synchronization necessitates the invocation of separate kernels. Inter-block synchronization without launching a second kernel has been attempted using atomic operations and the threadfence() function; however, the overhead of this approach is high [21,38,39]. Therefore, distinct kernel invocation is required for thread-block synchronization, making it impossible to perform division and communication rules in a single kernel with no invocation of a second ( Figure 3, lines 9, 11, and 13). Previously, division rules could not be executed without the invocation of a separate kernel, given that a newly produced membrane can be allocated to another thread block ( Figure 3, line 13). The algorithm presented here introduces a new representation of membrane systems in GPUs. The dissolution, communication, and dividing rules are executable within the same kernel without the necessity of invoking a separate one. A comparison between the two algorithms can be found in Section 3.2.

Figure 2.
A membrane is assigned to one thread block, and objects are assigned to threads within each thread block [31][32][33].

Figure 2.
A membrane is assigned to one thread block, and objects are assigned to threads within each thread block [31][32][33].
Mapping individual membranes to single thread blocks will develop a mechanism in which the communication among membranes requires interaction among thread blocks. Thread-block synchronization necessitates the invocation of separate kernels. Inter-block synchronization without launching a second kernel has been attempted using atomic operations and the threadfence() function; however, the overhead of this approach is high [21,38,39]. Therefore, distinct kernel invocation is required for thread-block synchronization, making it impossible to perform division and communication rules in a single kernel with no invocation of a second ( Figure 3, lines 9, 11, and 13). Previously, division rules could not be executed without the invocation of a separate kernel, given that a newly produced membrane can be allocated to another thread block ( Figure 3, line 13). The algorithm presented here introduces a new representation of membrane systems in GPUs. The dissolution, communication, and dividing rules are executable within the same kernel without the necessity of invoking a separate one. A comparison between the two algorithms can be found in Section 3.2. Mapping individual membranes to single thread blocks will develop a mechanism in which the communication among membranes requires interaction among thread blocks. Thread-block synchronization necessitates the invocation of separate kernels. Inter-block synchronization without launching a second kernel has been attempted using atomic operations and the threadfence() function; however, the overhead of this approach is high [21,38,39]. Therefore, distinct kernel invocation is required for thread-block synchronization, making it impossible to perform division and communication rules in a single kernel with no invocation of a second ( Figure 3, lines 9, 11, and 13).
Previously, division rules could not be executed without the invocation of a separate kernel, given that a newly produced membrane can be allocated to another thread block ( Figure 3, line 13). The algorithm presented here introduces a new representation of membrane systems in GPUs. The dissolution, communication, and dividing rules are executable within the same kernel without the necessity of invoking a separate one. A comparison between the two algorithms can be found in Section 3.2.

Proposed Approach
Here, we represent membrane systems as matrices that can be divided into sub-blocks to balance the number of threads used in GPU thread blocks [35,36]. The objects in the membranes are subsequently assigned to matrix entries (Figure 4), thereby increasing the efficiency with which the matrix allocates the threads in the thread blocks. In this condition, the quantity of threads in each thread block is not reliant on the quantity of objects inside each membrane. Still, rather objects in a single membrane can be allocated to threads in dissimilar thread blocks, and objects from distinctive membranes can be allocated to threads in one thread block. The new representation and the proposed method have the following advantages: (1) The computational workload associated with the membranes is distributed between thread blocks such that the workload is balanced, and higher performance is achieved. (2) It is possible to decrease communication between threads within one thread block.
(3) It is probable to automatically allocate membranes that demand to communicate with each other to the same thread block. In some instances, membranes with more significant dependencies are assigned to the same thread blocks to decrease inter-block communication, despite the possibility that this could decrease GPU occupancy.

Proposed Approach
Here, we represent membrane systems as matrices that can be divided into sub-blocks to balance the number of threads used in GPU thread blocks [35,36]. The objects in the membranes are subsequently assigned to matrix entries (Figure 4), thereby increasing the efficiency with which the matrix allocates the threads in the thread blocks. In this condition, the quantity of threads in each thread block is not reliant on the quantity of objects inside each membrane. Still, rather objects in a single membrane can be allocated to threads in dissimilar thread blocks, and objects from distinctive membranes can be allocated to threads in one thread block. The new representation and the proposed method have the following advantages: 1) The computational workload associated with the membranes is distributed between thread blocks such that the workload is balanced, and higher performance is achieved.
2) It is possible to decrease communication between threads within one thread block.
3) It is probable to automatically allocate membranes that demand to communicate with each other to the same thread block. In some instances, membranes with more significant dependencies are assigned to the same thread blocks to decrease inter-block communication, despite the possibility that this could decrease GPU occupancy.   This paper proposes mapping membrane objects to different matrices (Figure 4), allowing several matrices (sub-matrices) to be assigned to threads in thread blocks. It is more efficient to map objects that share dependencies to the same entry positions within different matrices, given that matrix entries at the same position are assigned to the same thread. Therefore, objects that share dependence are mapped to the same thread, decreasing the communication between threads and reducing synchronization time.
An example of a mapping involving three matrices is illustrated in Figure 4 and described as follows. Threads in a thread block can be represented as a two-dimensional matrix with NTx × NTy threads within N Bx × N By thread blocks. Objects a i,j , b i,j , and c i,j (I = 0, . . . , n-1; j = 0, . . . , m-1) share dependence and are mapped to three different matrices in the same entry position, (i,j), for execution within a single thread, (i,j). Membranes, m, having n objects of dependent variables, a i,j , b i,j , and c i,j , and map those objects by assigning them to a single row of a given matrix. Consequently, for m membranes with n dependent objects, a i,j , b i,j , and c i,j , each dependent object is mapped to three m × n matrices in the same entry position, (i,j), for execution on a single thread, (i,j). To avoid initiating multiple matrix operations simultaneously, it is possible to consider these matrices as a single matrix virtually divided into three sub-matrices.
To maximize occupancy, active threads, and the warps per SM should be equal to the maximum warps per SM, MaxWarp sm , and the maximum number of threads per SM, MaxThr sm [34]. The maximum number of resident thread blocks per SM, MaxBlk sm , is also limited depending upon GPU computational capability [34]. Therefore, the minimum number of threads per thread block necessary to achieve high or full occupancy, MinThrBlk HighOccup , is calculated as: With MinThrBlk HighOccup , the threads in each thread block achieve high occupancy; however, when the number of threads per thread block increases, more dependent threads can be allocated to a thread block, decreasing inter-block communication. In this case, the active threads per thread block can be maximized to the number of resident threads per thread block allowed, MaxThrBlk, based on GPU computational capability [34]. Therefore, to achieve high occupancy, the number of active threads per thread block, Thr Blk , equal to N Tx × N Ty (Figure 4) can be determined as follows: The proposed approach for balancing occupancy and synchronization between threads and thread blocks (Balancing_Occup_Sync_Approach) is described below.
Step 1: Assign objects and membranes to a matrix and assign each sub-matrix, N Tx × N Ty , in Equation (2) (Figure 4) to each thread block to achieve high occupancy.
Step 2: After assigning the membrane system to a matrix, if the communication rate between threads or thread blocks is high: According to the object-and membrane-weighted network (described later), dependent objects are assigned to the same entry positions within different sub-matrices, and more than one sub-matrix is assigned to each thread block (Figure 4), allowing objects having shared dependencies to be assigned to the same thread. The quantity of sub-matrices per thread block, NumSubMat Blk , is equal to the quantity of objects per thread, Obj Thr , and can continue to increase without lowering occupancy. This number is represented as: The MaxDep Obj represents the maximum number of objects that share dependence, and ObjThr HighOccup represents the maximum quantity of objects that can be allocated to every thread while maintaining high occupancy. The number of membranes assigned to the thread block should be close to the estimated number of membranes capable of being assigned to the same thread block while maintaining 100% occupancy.
The ObjThr HighOccup and the MembBlk HighOccup will be determined in Equations (5) and (9). To improve the assignment of dependent objects and membranes, the weighted network described here allows determination of the degree of dependency between objects and membranes to identify the best assignment.
Step 3: If the communication rate between thread blocks (in this step, communication between threads within the same thread block is not considered) remains high after the previous two steps: Assign additional objects sharing dependency exceeding ObjThr HighOccup from Equation (3) to the same threads to enable the assignment of more dependent membranes to the same thread blocks and potentially decrease communication rates. Assigning objects exceeding ObjThr HighOccup to each thread decreases occupancy. It decreases the rate of communication and synchronization time between thread blocks. However, this sacrificing occupancy is useful to improve performance because the thread block synchronization is a too time-consuming process. However, decreasing the communications between threads inside the same thread blocks by sacrificing occupancy is not recommended.
Step 4: If Step 3 does not decrease communication rate and synchronization between thread blocks, then kernel re-invocation is required. In this case, it is more advantageous to keep occupancy high, using Step 2 to decrease the communication rate as much as possible without sacrificing occupancy. However, to maintain adequate synchronization and communication between thread blocks, kernel re-invocation is necessary.
The ObjThr HighOccup in Equation (3) represents the maximum number of objects per thread when occupancy remains high and is determined by: The number of objects capable of being assigned per thread to achieve high occupancy (Equation (5)) is constrained by the available shared memory, ObjThr ConsSh , and the limited amount of shared memory per SM, MaxSh sm . The maximum number of threads per SM necessary to achieve high occupancy, MaxThr sm [34], and the average amount of shared memory utilized by each object, AveSh Obj , is represented as: If the objects exceed the number of SM registers (spilled registers) stored in local memory, performance will decrease relative to the slow local-memory access. Therefore, it is necessary to consider the limitations associated with the number of registers per SM. The number of objects capable of being assigned per thread to achieve high occupancy (Equation (5)) is constrained by the available registers, ObjThr ConsReg , and the limited number of registers per SM, MaxReg sm . The number of threads per SM necessary to achieve the high occupancy, MaxThr sm , depends upon GPU computational capability [34]. The average number of registers utilized by each object, AveReg Obj , is represented as: The number of objects capable of being assigned per thread to achieve high occupancy (Equation (5)) is constrained by the available local memory, ObjThr ConsLoc , and the limited amount of local memory per thread, MaxLoc sm [34]. The maximum number of threads per SM necessary to achieve high occupancy, MaxThr sm [40], and the average amount of local memory utilized by each object, AveLoc Obj , is represented as: The AveSh Obj from Equation (6), the AveReg Obj from Equation (7), and the AveLoc Obj from Equation (8) are estimated by assigning each membrane object to a single thread, every membrane to a single thread block, and compiling the code using CUDA-C following management of compiler parameters. These parameters include the number of registers per thread, AveReg Obj , the spilled registers stored in local memory per thread, AveLoc Obj , and the amount of shared memory per SM, which should be divided by the number of objects assigned to each SM to obtain AveSh Obj . For MembBlk HighOccup , the same calculations can be undertaken from Equation (4) through Equation (8) for ObjThr HighOccup from Equation (3) as: The maximum shared memory per SM, MaxSh sm , and the maximum number of registers per SM, MaxReg sm , depend upon GPU computational capability [40]. Blk SM can be obtained by dividing the maximum possible resident threads per SM, MaxThr sm , by the number of active threads per thread block, Thr Blk , giving N Tx × N Ty from Equation (2). The average amount of shared memory consumed per membrane, AveSh Memb , and the average quantity of registers utilized per membrane, AveReg Memb , can be estimated by assigning each membrane object to one thread, every membrane to a single thread block, and compiling the code using CUDA-C following management of compiler parameters.
Object-and membrane-weighted network: A weighted network is employed to appropriately assign objects having shared dependencies and membranes from Steps 2 and 3 of the Balancing_Occup_Sync_Approach. This network, a p → b, represents a reaction on the object a that occurs at the rate, p, and affects the amount of object b. For example, if object a evolves according to the evolution rule, a 100 → b 3 assumes that in each step, the number of objects a increases by one, and after 100 steps (with a rate of 0.01 per step), there would be 100 copies of the object a and the rule would evolve and generates three copies of object b. Therefore, in the weighted network, the transition is written as a 0.01 − − → b. In Step 2, each NumSubMat Blk or Obj Thr of objects from Equation (3) is assigned to each thread to improve performance. The Obj Thr objects with higher levels of shared dependency within the weighted network are assigned to the same entry of sub-matrices and assigned to the same thread ( Figure 5). These parameters include the number of registers per thread, AveRegObj, the spilled registers stored in local memory per thread, AveLocObj, and the amount of shared memory per SM, which should be divided by the number of objects assigned to each SM to obtain AveShObj. For MembBlkHighOccup, the same calculations can be undertaken from Equation (4) through Equation (8) for ObjThrHighOccup from Equation (3) as: The maximum shared memory per SM, MaxShsm, and the maximum number of registers per SM, MaxRegsm, depend upon GPU computational capability [40]. BlkSM can be obtained by dividing the maximum possible resident threads per SM, MaxThrsm, by the number of active threads per thread block, ThrBlk, giving NTx × NTy from Equation (2). The average amount of shared memory consumed per membrane, AveShMemb, and the average quantity of registers utilized per membrane, AveRegMemb, can be estimated by assigning each membrane object to one thread, every membrane to a single thread block, and compiling the code using CUDA-C following management of compiler parameters.
Object-and membrane-weighted network: A weighted network is employed to appropriately assign objects having shared dependencies and membranes from Steps 2 and 3 of the Balancing_Occup_Sync_Approach. This network, → , represents a reaction on the object a that occurs at the rate, p, and affects the amount of object b. For example, if object a evolves according to the evolution rule, → assumes that in each step, the number of objects a increases by one, and after 100 steps (with a rate of 0.01 per step), there would be 100 copies of the object a and the rule would evolve and generates three copies of object b. Therefore, in the weighted network, the transition is written as . ⎯ . In Step 2, each NumSubMatBlk or ObjThr of objects from Equation (3) is assigned to each thread to improve performance. The ObjThr objects with higher levels of shared dependency within the weighted network are assigned to the same entry of sub-matrices and assigned to the same thread ( Figure 5). To organize related objects within the same groups, begin with those objects having the highest communication rates in the weighted network ( Figure 5) and assign them to the same group until the number of objects per thread does not exceed that from Equation (3). For example, according to the rates associated with the objects in Figure 5, a and f (rate 0.95) and f and c (0.9) are assigned to the first group, then b and x (0.9) are assigned to the second group, given that there is no link between b and f. This is followed by d and k (0.85) being assigned to the third group. Subsequently, f and z (0.8) and To organize related objects within the same groups, begin with those objects having the highest communication rates in the weighted network ( Figure 5) and assign them to the same group until the number of objects per thread does not exceed that from Equation (3). For example, according to the rates associated with the objects in Figure 5, a and f (rate 0.95) and f and c (0.9) are assigned to the first group, then b and x (0.9) are assigned to the second group, given that there is no link between b and f. This is followed by d and k (0.85) being assigned to the third group. Subsequently, f and z (0.8) and c and e (0.8) will be assigned to the first group. Up to this point, b and x were assigned to the second group, and d and k were assigned to the third group. Now, object b from the second group and object d from the third group (0.75) are classified in the same group, and the third group is dissolved. This continues until all objects are classified according to their communication rates in the weighted network and the limitation to the number of objects per thread from Equation (3) has been reached.
We have also defined a membrane-weighted network ( Figure 6). Here, [ ] i p → [ ] j denotes a membrane, i, communicating with a membrane, j, at the rate, p. Similar to the object-weighted network, membranes that have higher communication rates are organized within the same group until Equation (4) is satisfied. Each group is then assigned to sub-matrices, N Tx × N Ty , from Equation (2) to be executed in the same thread block (Figure 4). We have also defined a membrane-weighted network ( Figure 6). Here, [ ] → [ ] denotes a membrane, i, communicating with a membrane, j, at the rate, p. Similar to the object-weighted network, membranes that have higher communication rates are organized within the same group until Equation (4) is satisfied. Each group is then assigned to sub-matrices, NTx × NTy, from Equation (2) to be executed in the same thread block (Figure 4).  (4), and the membranes assigned to the sub-matrices, NTx × NTy, will be executed in the same thread block.
Previous approaches mapped one membrane to one thread block without considering the communication between membrane objects [30][31][32][33] (Figure 3). Given that each membrane was assigned to one block, communication among membranes necessitated re-invocation of separate kernels and decreased performance. According to Balancing_Occup_Sync_Approach, a trade-off is made between maintaining high occupancy and reducing communication rates. Object-and membrane-weighted networks are also proposed to automatically classify objects and membranes having shared dependencies within the same threads and thread blocks. These methods enhance algorithm performance by balancing occupancy, decreasing communication rates between thread blocks, and executing dissolution and dividing rules within the same kernel, thereby minimizing the need to re-invoke separate kernels.

Proposed Algorithm
The proposed algorithm for GPU simulation of active membrane systems is presented below (Figure 7). This algorithm represents membrane systems as matrices (Figure 7, line 0). Membrane objects are assigned to one matrix, with sub-matrices, NTx × NTy, from Equation (2), allowing maintenance of high occupancy levels due to their being grouped within individual thread blocks ( Figure 4). The proposed approach uses Balancing_Occup_Sync_Approach to appropriately classify objects and membranes according to their communication rates by applying methods associated with an object-and membrane-weighted networks to the sub-matrices. This allows assignment of membranes having shared dependencies to the same thread block, enabling division and communication rules to be performed within the same kernel without needing to invoke others (Figure 7, lines 9, 14, and 19 are performed in the same kernel for selection and evolution rules, i.e., Kernel_Sel_Exe_allKindRules). When limited resources preclude assignment of all membranes having shared dependence on the same thread block, portions of the membranes can be assigned in ways  (4), and the membranes assigned to the sub-matrices, NTx × NTy, will be executed in the same thread block.
Previous approaches mapped one membrane to one thread block without considering the communication between membrane objects [30][31][32][33] (Figure 3). Given that each membrane was assigned to one block, communication among membranes necessitated re-invocation of separate kernels and decreased performance. According to Balancing_Occup_Sync_Approach, a trade-off is made between maintaining high occupancy and reducing communication rates. Object-and membrane-weighted networks are also proposed to automatically classify objects and membranes having shared dependencies within the same threads and thread blocks. These methods enhance algorithm performance by balancing occupancy, decreasing communication rates between thread blocks, and executing dissolution and dividing rules within the same kernel, thereby minimizing the need to re-invoke separate kernels.

Proposed Algorithm
The proposed algorithm for GPU simulation of active membrane systems is presented below (Figure 7). This algorithm represents membrane systems as matrices (Figure 7, line 0). Membrane objects are assigned to one matrix, with sub-matrices, NTx × NTy , from Equation (2), allowing maintenance of high occupancy levels due to their being grouped within individual thread blocks ( Figure 4). The proposed approach uses Balancing_Occup_Sync_Approach to appropriately classify objects and membranes according to their communication rates by applying methods associated with an object-and membrane-weighted networks to the sub-matrices. This allows assignment of membranes having shared dependencies to the same thread block, enabling division and communication rules to be performed within the same kernel without needing to invoke others (Figure 7, lines 9, 14, and 19 are performed in the same kernel for selection and evolution rules, i.e., Kernel_Sel_Exe_allKindRules). When limited resources preclude assignment of all membranes having shared dependence on the same thread block, portions of the membranes can be assigned in ways that enhance thread-block communication rates. In these cases, inter-block communication occurs using separate kernels (Figure 7, lines 28, 30, and 32), but at decreased rates relative to previous approaches (Figure 3). When q inter-block communications are occurring for MaxIter steps, the time consumed for these processes is kept in an array (Array_Int_Blk_Com_Iter[q] = {Iteration of first inter-block communication, Iteration of the second inter-block communication, . . . , Iteration of the q th inter-block communication}) [41]. Notice that for inter-block communication, the accurate time would be established by tracking the communication, aside from the dissolution and division rules that ensue among thread blocks. Entire thread blocks would be dismissed and synchronized by kernel invocation before initiation of thread-block communication. When it is not possible to determine the number of iterations necessary for thread-block communication, kernels invocation in any iteration are recommended.  GPUs execute threads in warps in which each warp is a group of 32 threads. Thread discrepancy appears while threads in a warp follow dissimilar implementation paths as compared to others. Thus, when allocating objects to groups in which every group allocated to one thread, every group of 32 having a similar path would be allocated to 32 threads in a single warp whenever achievable to prevent branch deviation. In cases where branch deviation is inevitable, several approaches could be implemented to decrease its consequence, including branch distribution [42,43] and runtime data remapping across multiple warps [44]. In the algorithm presented here, dependent objects are assigned to the same thread whenever possible (Figure 7, line 0). This decrease both communication rates and GPUs execute threads in warps in which each warp is a group of 32 threads. Thread discrepancy appears while threads in a warp follow dissimilar implementation paths as compared to others. Thus, when allocating objects to groups in which every group allocated to one thread, every group of 32 having a similar path would be allocated to 32 threads in a single warp whenever achievable to prevent branch deviation. In cases where branch deviation is inevitable, several approaches could be implemented to decrease its consequence, including branch distribution [42,43] and runtime data re-mapping across multiple warps [44]. In the algorithm presented here, dependent objects are assigned to the same thread whenever possible (Figure 7, line 0). This decrease both communication rates and synchronization time between threads compared to previous approaches that failed to consider shared dependence between objects before the thread assignment. Initial parameters are also assessed before execution to ensure that sufficient memory is allocated (Figure 7).
The initial information required concerning the active membrane system, π = (O, H, µ, w 0 ,..., w m , R), includes initial multisets, (w 0 ,...,w m ), the number of regions or membranes, m, membrane structure, µ, a group of rules, R, and MaxIter, a maximum number of iterations,. Halted conditions are encountered when NumIter, the number of iterations achieves MaxIter, or the configuration achieves a termination position (a condition in which no additional rules can be implemented).
Inputs are transmitted from host (CPU) to device (GPU), where the device begins execution of computations in parallel. This process comprises two main steps: rules selection and execution. When rules are selected (Figure 7, lines 5 and 6), the relevance of every rule is examined, including the accessibility of every object participating in the left-hand side of the rule (the reaction object). Additionally, for those rules requiring single or additional mutual objects, a single rule is selected non-deterministically, given that each membrane object can be used by at most one rule. Finally, all rules that are applicable are added to a list, ListOfSelectedRules, for the application in the rule-execution step. In the rule-execution step, all relevant rules are activated, and the quantity of objects, membranes, and other variables are revised consistent with the outcomes associated with rule implementation (while there is no inter-block communication and no demand for kernel invocation, this step implemented from lines 7 to 24 ( Figure 7); else, this step is implemented from lines 7 to 34 (Figure 7).
Using this method, it is feasible to implement membranes and objects having shared dependencies (membranes that interact amongst each other or membranes produced through division rules) from within a similar thread block. Consequently, communication between thread blocks and kernel invocation is unnecessary, allowing the execution of rules incorporating evolution, communication ("in", "out"), dissolving, and division to be implemented using the similar kernel (Figure 7, Kernel_Sel_Exe_allKindRules) and intensifying execution. Both selection and execution steps for all rules can be accomplished in Kernel_Sel_Exe_allKindRules (Figure 7, lines 4 to 24) until a termination criterion is reached. Previous methods required the invocation of separate kernels before executing communication, dissolution, and division rules (Figure 3).
For dissolving rules (Figure 7, lines 18 to 20), the algorithm estimates the number of resources that will be discharged following membrane dissolution. While a single membrane is dissolved, the assigned threads will be idle. If the quantity of unemployed threads following dissolution is sizeable, it is advisable to re-invoke the kernel with no launching of unemployed threads or blocks. Else, the program resumes with no kernel re-invocation. The benefit of the proposed method is that the assignment of multiple membranes to one thread block enables other membranes to continue to exist in a given thread block following the dissolution of others. Previous methods wherein one membrane was assigned per thread block resulted in thread blocks being unused following the dissolution of the associated membrane.

Results and Discussion
The simulations designated in this section were implemented on a computer with an Intel Core i7-3820 and an NVIDIA GeForce GTX680 GPU. The code was designed using NVIDIA CUDA 4.1 and Visual C++ 2008. The program comprises of two parts: the host (CPU) and the device (GPU). The host/CPU portion of the code is responsible for controlling program execution, allocating memory in the host or device, and obtaining results from the device. GPU processing is scaled in seconds, and MaxIter is set high enough to assure transient factors do not affect GPU speed. Simulation timescales for different numbers of objects with the same iterations vary from milliseconds to hours. Therefore, various variable iterations are used to achieve the same CPU (sequential) time for all object combinations that have been considered (Table 1). However, a similar speedup is observed with MaxIter set to 10,000 for object combinations on the GPU, indicating that speed does not depend on MaxIter.
Simulations were undertaken to show the effect of mapping objects to different matrices (Figure 4) to decrease communication rates between threads. To perform this simulation, previous approaches were followed for defining a membrane system having objects with shared dependencies [31]. The membrane system with active membranes is defined as π = (O, H, µ, ω 0 , . . . , ω m , R), with m elementary membranes located within the skin along with the set of rules and: The preliminary 3 × n objects (specifically, a i , b i , c i ; i = 0, . . . , n − 1) inside of the membrane ω i , 0 ≤ i < m, are produced arbitrarily, the quantity of objects within the skin, ω m , is zero, and dependencies exist between objects a i , b i , c i ; i = 0, . . . , n − 1. This membrane system allows control of the number of rules, n, and the number of different types of objects, 3n, in each membrane to measure algorithm performance against these variables. Furthermore, it is also possible to change the number of membranes, m, for simulations. The initial multiplicities for objects, a i , b i , c i ; i = 0, . . . , n − 1, are randomly generated in each membrane, e.g., a 2 1 denotes that two objects from a 1 existing, meaning that the multiplicity of a 1 is 2. The number of each object is a random integer between 0 and U nOb , where U nOb is a user-defined integer (for these simulations, U nOb = 10). For example, given six objects, ., a 0 , b 0 , c 0 , a 1 , b 1 , c 1 ), can be available in each membrane ( Table 1). The initial multiplicity of each object is a random number between zero and U nOb , where a multiplicity equal to zero signifies that the object is not available in the membrane in the initial step. The number of membranes, m, is 16,382 but is capable of being set differently. This system contains dependencies between objects, a i , b i , c i , since producing objects a 2 i b 4 i c 6 i according to the rules, i h , depends upon consuming a i b 2 i c 3 i . These rules can be applied in each step, given that the objects required for their application are available and the communication rates between a 0 , b 0 , c 0 , a 1 , b 1 , c 1 and a i , b i , c i are one. Simulating membrane systems over small, sequential timespans enables discovery of the average application rate associated with each rule per step, which can subsequently be used to obtain the communication rate between objects and measure the classification for parallel implementation.

Comparison Between Previous and Proposed Methods
Comparisons between previous methods (Figures 2 and 3) and the proposed method are undertaken by mapping several matrices (Figures 4 and 7) and using the classifications of dependent objects within the active membrane systems as illustrated (Table 1, Figure 8).
The size of the block is a vital aspect of achieving improved performance in algorithms-by-blocks. Sub-optimal performance occurs when the unconventional techniques are utilized to minimize data transfers, to enhance data affinity and an incorrect selection of parameters. In this study, the maximum value for the block size is a trade-off between several aspects such as the capability of parallelism in which smaller block size is transformed into a better concurrency, and a larger block component results in less concurrency.  (10).

Iterations, Objects, CPU Time
Previous Approaches on the GPU (Figure 3)   According to Table 1, for 6, 12, and 24 objects (threads), one warp was required, for 48 objects, two warps were necessary, for 96 objects, three warps were needed, and for more than 96 objects, four warps were required, resulting in occupancies of 25%, 50%, 75%, and 100%, respectively. With one membrane allocated to each thread block, when the quantity of objects per membrane increases, both the active threads per thread block and the occupancy increase due to the increased speed. Assumed that the collection of implementation units accessible within the system, the size of the block significantly affects the degree of parallel implementation on the sub-problems and decreases the unoccupied times. In data transfer effectiveness, a small block size is transformed into a larger quantity of data transmissions of a condensed dimension. On the contrary, a greater block dimension transforms into better efficient bandwidth and consequently benefits the ultimate performance of the application. Therefore, considering merely the possible parallelism aspect, the optimum block size somewhat relies on the entire matrix dimension.
Since there is no inter-block communication present in the membrane system described in Equation (10), the number of active threads per thread block, ThrBlk, and the dimensionality of the sub-matrices according to Table 1, Equations (1) and (2) is NTx × NTy = ThrBlk = MinThrBlkHighOccup = 128. The number of active threads per thread block does not depend on the quantity of objects per membrane. When the number of objects is low, more than one membrane is assigned to each thread block to achieve MinThrBlkHighOccup = 128 active threads per thread block. For this reason, occupancy, regardless of the quantity of objects is 100% (when adequate membranes are available), as opposed to results seen from previous approaches.
The number of objects per thread, ObjThr, equal to NumSubMatBlk can be obtained from Equation  Previous approaches assigned membrane objects to one thread block. Therefore, the number of assigned membranes per thread block for all objects was 1, and the maximum number of active membranes per thread was equal to the number of objects per membrane (Table 1). For a GPU with a computational capability of 3, the maximum number of threads per SM is 2048 (MaxThr sm = 2048), the maximum warp per SM is 64 (MaxWarp sm = 64), and the maximum number of thread blocks per SM is 16 (MaxBlk sm = 16). Therefore, the MinThrBlk HighOccup from Equation (1) is 128. Each warp contains 32 threads. The occupancy is the ratio of active warps per SM to the highest achievable warps that can be resident to the SM. To achieve maximum occupancy, at least four warps should be active per thread block (MaxWarpsm/MaxBlk sm = 4).
According to Table 1, for 6, 12, and 24 objects (threads), one warp was required, for 48 objects, two warps were necessary, for 96 objects, three warps were needed, and for more than 96 objects, four warps were required, resulting in occupancies of 25%, 50%, 75%, and 100%, respectively. With one membrane allocated to each thread block, when the quantity of objects per membrane increases, both the active threads per thread block and the occupancy increase due to the increased speed. Assumed that the collection of implementation units accessible within the system, the size of the block significantly affects the degree of parallel implementation on the sub-problems and decreases the unoccupied times. In data transfer effectiveness, a small block size is transformed into a larger quantity of data transmissions of a condensed dimension. On the contrary, a greater block dimension transforms into better efficient bandwidth and consequently benefits the ultimate performance of the application. Therefore, considering merely the possible parallelism aspect, the optimum block size somewhat relies on the entire matrix dimension.
Since there is no inter-block communication present in the membrane system described in Equation (10), the number of active threads per thread block, Thr Blk , and the dimensionality of the sub-matrices according to Table 1, Equations (1) and (2) is NTx × NTy = Thr Blk = MinThrBlk HighOccup = 128. The number of active threads per thread block does not depend on the quantity of objects per membrane. When the number of objects is low, more than one membrane is assigned to each thread block to achieve MinThrBlk HighOccup = 128 active threads per thread block. For this reason, occupancy, regardless of the quantity of objects is 100% (when adequate membranes are available), as opposed to results seen from previous approaches.
The number of objects per thread, Obj Thr , equal to NumSubMat Blk can be obtained from Equation (3). In Equation (3), ObjThr HighOccup is calculated from Equation (5). In the simulation, when a membrane with 128 objects is assigned to 128 threads in every thread block, the amount of consumed shared memory per thread block is 2560 bytes (Sh Blk = 2560), the number of used registers per thread is 10 (Reg Thr = 10), and the amount of used local memory per thread is zero (Loc Thr = 0) (values were obtained by compiling the code in CUDA using the PTXAS option). The used shared memory per thread, Sh Thr , is obtained by dividing Sh Blk by the number of threads per thread block, giving Sh Thr = 2560/128 = 20 bytes. For the estimated memory usage and since one thread is assigned per object, AveLoc Obj = Loc Thr = 0, AveSh Obj = Sh Thr = 20 bytes, and AveReg Obj = Reg Thr = 10. The values MaxThr sm = 2048, MaxLoc sm = 512 KB, MaxReg sm = 64 KB, and MaxSh sm = 48 KB were obtained from [40], corresponding to a GPU computational capability of 3. By applying these values to Equations (6)-(8), ObjThr ConsSh = 2.4 KB, ObjThr ConsLoc = ∞ (unconstrained by this factor), and ObjThr ConsReg = 32. According to Equation (5), ObjThr HighOccup = 32. The objects sharing dependencies in the membrane system from Equation (10)  Given these results, n objects from each membrane are assigned to n/3 threads. According to Step 2 of Balancing_Occup_Sync_Approach (Figure 7, line 0), all three dependent objects are assigned to one thread. For a case involving six objects within a membrane, all six would be assigned to N Tx = 2 threads, meaning that N Ty = 64 membranes are needed for assignment to the same thread block, resulting in NTx × NTy ≈ 128 threads according to Equation (2). Communication between threads is a very time-consuming process. Since objects having shared dependencies can be executed on the same thread using the method described here, the communication rate between threads decreases, thereby increasing performance. For example, given 1536 objects, the increase in speed achieved by previous approaches and relative to sequential-computation methods is 8.4-fold. In contrast, the method described in this paper utilizing multiple matrices results in a 130-fold increase.
When the most favorable block size is restored for a particular matrix size, the advantages of the several implementations of the runtime are predominantly determined by the quantity of objects and membranes necessary to perform a given operation. For "Balancing_Occup_Sync_Approach" (Figure 7), the inputs (the number of objects, membranes, and rules) are determined when the membrane system is designed. If there are errors in these inputs, especially on the rules, the system could behave abnormally, and the expected output won't be generated. As a precaution to avoid errors on the input, we validated the algorithm before it was implemented on the GPU. In this study, the processing time depends on communication between objects and membranes that are triggered by the rules. As Table 1 and Figure 8 demonstrated, when there is maximum occupancy with minimal communication between thread blocks, time efficiency and performance improved.
In cases involving large numbers of objects per membrane, occupancies resulting from the application of either method eventually reach 100% (Table 1). In these situations, the impact associated with decreasing communication rates between threads through the use of matrices, wherein objects having shared dependencies are assigned to the same threads can be observed. Given 384 objects, the increase in speed relative to sequential processing methods and using previous algorithms is 8.2-fold compared to a 106-fold increase using the method described here. This translates to a 12.9-fold increase in processing speed achieved by this algorithm relative to the algorithm used previously, mainly based upon improving communication between threads without adversely impacting occupancy. GPUs are low-cost, low-power consumption, and high performance concerning conventional multiprocessors. Many current desktop computers have equipped with the GPU enable graphics cards, which can improve the performance of processing without additional costs [28,45]. Thus achieving speed up even around 5× by GPU can be valuable work.
The results are quantified in the last two columns of Table 1 and confirm our prediction for improvement because of decreasing communication between threads inside each membrane discussed in this paragraph.
The proposed object-weighted network, membrane-weighted network, and classification algorithms are not constrained and limited to the specific type of membranes. Membranes can have different objects with different sizes and various types of rules. Object classification can be applied to each membrane individually. The membrane classification can be applied between membranes with different sizes. Thus, proposed approaches are not only for some particular class and regular membrane systems.
The dimension of sub-matrices depends on the number of objects in the membrane. When the number of objects increases, the dimension of sub-matrices will also increase. Table 1 and Figure 8 shows that when the number of objects increases, the performance of the proposed approach is also improving. Typically, a better parallel speedup is obtained for large problem sizes. The efficiency of the parallel program increases with the size of the matrix.

Effects of the Number of Objects and Membranes
In this section, the significance of the quantity of objects in every membrane and the quantity of membranes in the system on the GPU effectiveness with previously implemented approaches ( Figure 3) are discussed. It is disclosed that while the quantity of objects per membrane and the quantity of membranes in an active membrane system are sizable, the performance in terms of time efficiency attained by the GPU related to a CPU is high. The reason is that once the quantities of objects and membranes increase, the utilization of the computational resource of the GPU also surges. Additionally, by seizing the benefit of the shared memory, the simulation speediness is amplified, because by gaining access to the shared memory, it is significantly quicker compared to the global memory. The quantity of membranes and the quantity of objects in every membrane can be altered.
Consequently, the proportion of the utilization of computational resources could also be altered. To use all the computational resources of the GPU, the quantity of objects and the quantity of membranes should be big. Simulation results (milliseconds) for a defined benchmark for 1000 iterations (MaxIter) on a computer with an Intel Core-i7-3820, 3.60-GHz CPU with 8 GB RAM and an NVIDIA GTX 680 GPU are illustrated in Tables 2 and 3. Table 3. Comparison of CPU and GPU performance for different numbers of membranes for a model with 512 objects [41]. As disclosed in Table 2, while the quantity of objects n in every membrane m, is small, for instance, n = 2 and m = 8192, the CPU demonstrates better performance with shorter implementation time compared to the GPU. When the quantity of objects is increased to, for instance, n = 512, subsequently, the GPU has improved performance compared to CPU. This investigation utilizes the benefit of shared memory that faster compared to when operating merely with global memory. With shared memory, improved performance can be attained. Objects, intermediary outcomes for regulating the relevance of rules, and consequences of rule implementations that have an effect on objects are kept in shared memory. Notice that while accessible shared memory is not sufficient for keeping objects and intermediary outcomes of employing rules in a thread block, it is prone to distribute data in global memory into smaller sub-blocks so that in each phase, one sub-block loaded into shared memory can utilize the fast-memory-access benefit of shared memory. Table 3 illustrates the effect of the number of membranes on GPU performance. When the number of membranes increases, then the usage of GPU computational resources increases also, and the GPU has better performance than the CPU.

Conclusions
Previous studies mapped each membrane and its corresponding objects to one thread block and the associated threads, resulting in decreased efficiency when the number of membrane objects was smaller than the number of threads per thread blocks required to achieve maximum occupancy. Kernel invocation, which is a time-consuming process, was needed when communication and division rules happen between membranes. In previous approaches, each membrane was assigned to one thread block, and synchronization required re-invocation of separate kernels. Here, active membrane systems were modeled as matrices divided into sub-blocks according to the described Balancing_Occup_Sync_Approach and using object-and membrane-weighted networks to match the number of threads in each thread block, ensuring full thread utilization and decreased communication rates between threads and thread blocks. This approach alleviates problems encountered in previous studies by assigning several membranes to one thread block when the number of membrane objects is small. Our method increases processing speeds 93-fold as compared to sequential methods simulating active membrane systems with 48 objects (Table 1). Given the many similarities between membrane systems, the algorithm presented in this paper for active membrane systems involving GPUs should be transferable for application in other membrane-system variations. Moreover, the concepts described and associated with weighted-network classification can also be used in different fields to improve GPU simulations. We acknowledge the essentials of analyzing the gaps between theoretical speedup limitations and experimental results of GPU implementation of membrane computing. Still, unfortunately, it wasn't under the scope of this study, and it will be considered in our future works.