Parallelization of Modified Merge Sort Algorithm

Modern architectures make possible development in new algorithms for large data sets and distributed computing. The newly proposed versions can benefit both from faster computing on the multi core architectures, and intelligent programming techniques that use efficient procedures available in the latest programming studios. Frequently used algorithms to sort arrays of data in NoSQL databases is merge sort, where as NoSQL we understand any database without typical SQL programming interpreter. The author describes how to use the parallelization of the sorting processes for the modified method of sorting by merging for large data sets. The subject of this research is the claim that the parallelization of the sorting method is faster and beneficial for multi-core systems. Presented results show how the number of processors influences the sorting performance. The results are presented in theoretical assumptions and confirmed in practical benchmark tests. The method is compared to other sorting methods like quick sort, heap sort, and merge sort to show potential efficiency.


Introduction
Computer technology is constantly developed and new architectures are introduced to the market.The design of machines with multiple cores involves programming for many logical processors working independently.The software is oriented on devoted separation of concerns for efficient information processing.These aspects make it possible to manage even more information in various data bases.Algorithms that are very helpful in any data base system are various sorting methods.Classic versions that began research on possible developments were presented in [1,2].We can distinguish three main types of sorting algorithms: quick sort, heap sort, and merge sort.Along with new processors the new, modified versions of these algorithms were presented.In the developments, we can find some particular modifications for selected data, improved procedures to avoid deadlocks, and new structures that made it possible to increase the speed of sorting.
Quick sort is composed using divisions of the data stack, in which each part is processed to organize elements in relation to a selected divider.Many versions of this method show possible improvements for speeding up the process and preventing deadlocks.A devoted pivot mechanism for faster stack exchange was presented in [3].Possibilities to use different position of partitioning were discussed in [4].Also, some derivatives from these methods were introduced.The introduction of a median value for divisions exchange was presented in [5].This method was tested on various architectures and chipsets, with some results for the Sun Microsystems, Inc. machine were presented in [6].The quick sort was examined extensively, and the research benefitted in some new versions.A non-quadratic version of this method was proposed in [7].In [8] was presented some modification of pivot procedure for sort alignments, which was presented as the new version of the quick sort.Multi-pivot version of the quick sort was presented in [9].Heap sort is using the multilevel structure of data storage, where introduced relations between the following levels influence the speed of sorting.Each change in the structure requires the procedure to insert elements into the heap.Mathematical Symmetry 2017, 9, 176 2 of 18 models of the relations between levels of the heap were discussed in [10,11].Discussion of the efficiency of various possibilities for this procedure was presented in [11].Research on possible changes between levels of the heap was presented in [12], while two swap procedure for arranging elements into the heap was discussed in [13].Tests on access efficiency were presented in [14], and a parallel version of this method was discussed in [15].Merge sort is using the "divide and conquer" rule, which suggests the division of the input data into smaller parts that are sorted during following operations of merging into one string.Sublinear merging was presented in [16].A theoretical approach to composition of the first parallel version of merge sort was presented in [17].Derivative of this method for partially sorted strings was presented in [18].Tests on practical implementations were discussed in [19], while tests on memory usage were presented in [20,21].A derivative for more efficient input-output management was presented in [22], and tests on improved memory management by dynamic division assignment were discussed in [23].Research on possible improvements in buffering and reading were presented in [24].Possible enhancements to the merging procedure were discussed in [25].The research on possible improvements were also based on extensive comparison to other methods.In [26] was presented an analysis of the merge sort in comparison to the bubble sort.In [27] was given an introduction to some new interesting ideas for possible implementation of merging procedures that can boost the algorithm.
These three sorting methods were also mixed and combined to compose some new algorithms for devoted purposes.One of the very important aspects is memory management.In [28,29] it was discussed how the usage of virtual memory influences the speed of sorting.Benchmark tests on cash usage were discussed in [30].Research on devoted method for skewed strings was presented in [31].Also, some propositions of adaptive methods were introduced [32].Tests on possible variations in sorting rules were also discussed in [18].In [33] it was discussed enhanced information management for big data systems, while in [34] discussion of parallel approaches to sorting for big data systems were proposed.Markov chain rules for adaptive self-sorting were discussed in [35].In [36] was presented a proposition of the comparison free method for sorting.Also, other sorting methods are permanently developed.Research on the insertion sort method was presented in [37], where instead of classic version was proposed bidirectional insertion method of sorting.

Related Works
Our research on faster and more efficient methods for sorting was started with a proposition for a new method of assigning divided strings for quick sort.In [38] we have shown that the dynamic assignment of position in quick sort can improve sorting by speeding up the method of about 10%.Also, this change prevents deadlocks so often visible for a classic version.The results of the improved composition of the heap structure were presented in [39].Proposed change of the alignment of the levels in the heap speed up the method of about 5% to 10%.First results of improved merge sort we presented in [40].We have shown that the dynamic application of "divide and conquer" rule to various parts of the initial string can speed up the process of about 10%.Further research on derivatives of merge sort were discussed in [41].We have proved that non-recursive version of sorting is more flexible for the implementations on various architectures.Results of our research were examined on Hadoop architectures [42].
In this article I would like to propose a derivative of the newly presented method.In [43] was described a non-recursive fast sort algorithm.This method was proved to have faster sorting from merge sort of about 10% to 15%.Fast sort algorithm was composed as a new method for large data sets.In this work I would like to present application of some idea from fast sort to parallel merge sort algorithm.The novelty of this approach introduces separation of concerns for implementation of sorting on multi core architectures.Some introduction to this idea was presented in [18].However that was only theoretical proposition of the division of tasks between various processors by the use of binary trees.This article presents practical separation of concerns for parallel merge sort algorithm.In [44] it was discussed how to possibly parallelize the merge sort in a classic version, however that proposition was given for double strings.Proposed in this article a parallelized version of the modified Symmetry 2017, 9, 176 3 of 18 merge sort algorithm is developed for the dynamically assigned processors.Therefore, discussed here are improvements to make the algorithm scalable to various multiprocessor architectures.
The research presented in this algorithm benefits from a model composed for parallel sorting that is practically realized in C# MS Visual 2015 on MS Windows Server 2012.Proposed in this work algorithm allows the sorting for n elements in time 2n-log 2 n-2 using n processors.To analyze the time complexity of this parallel algorithm was used a model of Parallel Random Access Machine (PRAM) that allows an access to read and write in the memory cell for only a single processor.In the same way as in [43], tasks will be divided so that each processor will perform operations on the allocated memory in the most efficient way.

Data Processing in NoSQL Database and Parallel Sort Algorithms
Big data describe a wide range of data collected in computer networks.Despite the fact that these data have their own unique features, for the processing is required their unification through the corresponding collecting pipe.Standard data is stored in the collections of data in the form of records of fixed or variable length, but the analysis of the data set is stored in the form of columns.In memory, columnar data is fragmented into smaller units distributed among participating cores, so that parallelization is possible when running a query on the overall data, see Figure 1.Modified parallelization of merge sort algorithm allows to speed up the process of organizing large sets of data.To check the acceleration of this process when using multiple processors, some tests have been made with the measurement of the Central Processing Unit (CPU) clock cycles (clock rate) and sorting time.The tests allows for the evaluation of the efficiency of the process.
Symmetry 2017, 9, 176 3 of 18 algorithm is developed for the dynamically assigned processors.Therefore, discussed here are improvements to make the algorithm scalable to various multiprocessor architectures.The research presented in this algorithm benefits from a model composed for parallel sorting that is practically realized in C# MS Visual 2015 on MS Windows Server 2012.Proposed in this work algorithm allows the sorting for n elements in time 2n-log2n-2 using n processors.To analyze the time complexity of this parallel algorithm was used a model of Parallel Random Access Machine (PRAM) that allows an access to read and write in the memory cell for only a single processor.In the same way as in [43], tasks will be divided so that each processor will perform operations on the allocated memory in the most efficient way.

Data Processing in NoSQL Database and Parallel Sort Algorithms
Big data describe a wide range of data collected in computer networks.Despite the fact that these data have their own unique features, for the processing is required their unification through the corresponding collecting pipe.Standard data is stored in the collections of data in the form of records of fixed or variable length, but the analysis of the data set is stored in the form of columns.In memory, columnar data is fragmented into smaller units distributed among participating cores, so that parallelization is possible when running a query on the overall data, see Figure 1.Modified parallelization of merge sort algorithm allows to speed up the process of organizing large sets of data.
To check the acceleration of this process when using multiple processors, some tests have been made with the measurement of the Central Processing Unit (CPU) clock cycles (clock rate) and sorting time.
The tests allows for the evaluation of the efficiency of the process.

Statistical Approach to the Research on Algorithm Performance
For the statistical tests on the performance of this parallel method have been used measures similar to other works [38][39][40][41]43].The arithmetic mean of all of the observed measures for CPU clock, and sorting time can help to estimate performance.Statistically, this measure is equal to the mean value: the standard deviation is defined by the formula: where n is the number of measurements , , … , , ̅ is the arithmetic mean of the sample.The analysis for sorting time and CPU clock was carried out in 100 benchmark tests for each of the fixed dimensions on the input.The algorithm's stability in a statistical sense is best described on the basis

Statistical Approach to the Research on Algorithm Performance
For the statistical tests on the performance of this parallel method have been used measures similar to other works [38][39][40][41]43].The arithmetic mean of all of the observed measures for CPU clock, and sorting time can help to estimate performance.Statistically, this measure is equal to the mean value: the standard deviation is defined by the formula: where n is the number of measurements x 1 , x 1 , . . ., x n , x is the arithmetic mean of the sample.The analysis for sorting time and CPU clock was carried out in 100 benchmark tests for each of the fixed dimensions on the input.The algorithm's stability in a statistical sense is best described on the basis of the coefficient of variation.The coefficient of variation is a measure that allows the determination of the value of diversity in the research.It is determined by the formula: where we use arithmetic mean (1) and standard deviation (2).The coefficient of variation reflects the stability of the method in a statistical sense.The study was performed on a collection of data containing from 100 elements up to 100 million elements, increasing the number of elements ten times each new comparison.The results are presented in figures and discussed in the following sections.

Parallel Modified Merge Sort Algorithm
Big data require algorithms with low computational complexity with the possibility of division of tasks between multi processors.A special role is played by the sort algorithms used to categorize information.The most frequently used algorithm to sort data in NoSQL databases is merge sort.
The primary issue in NoSQL databases is the ability to quickly organize information, which is necessary for the analysis of the information collected and the compilation of relevant reports.Custom information obtained from equal sources is stored in the form of records of fixed or variable length.Here arises a fundamental problem of data verification, for example, duplication of the same records and the removal of the duplicates.Figure 2 presents a simplified method of collecting information on the disk and searching for the required records.Information entered by the user is subjected to initial verification, based on the already ordered and saved records.To search for the desired information by the user, the records are sorted using a parallel-modified merge sort.Then, proposed modified division search for the location of wanted records and the preparation of a specific report is performed.Let us look at proposed modified merge algorithm for parallel processing.The proposed modification will be that at each step of sorting, we will merge four sorted strings into one ordered sequence of elements, see Figures 3-5.
Symmetry 2017, 9, 176 4 of 18 of the coefficient of variation.The coefficient of variation is a measure that allows the determination of the value of diversity in the research.It is determined by the formula: where we use arithmetic mean (1) and standard deviation (2).The coefficient of variation reflects the stability of the method in a statistical sense.The study was performed on a collection of data containing from 100 elements up to 100 million elements, increasing the number of elements ten times each new comparison.The results are presented in figures and discussed in the following sections.

Parallel Modified Merge Sort Algorithm
Big data require algorithms with low computational complexity with the possibility of division of tasks between multi processors.A special role is played by the sort algorithms used to categorize information.The most frequently used algorithm to sort data in NoSQL databases is merge sort.
The primary issue in NoSQL databases is the ability to quickly organize information, which is necessary for the analysis of the information collected and the compilation of relevant reports.Custom information obtained from equal sources is stored in the form of records of fixed or variable length.Here arises a fundamental problem of data verification, for example, duplication of the same records and the removal of the duplicates.Figure 2 presents a simplified method of collecting information on the disk and searching for the required records.Information entered by the user is subjected to initial verification, based on the already ordered and saved records.To search for the desired information by the user, the records are sorted using a parallel-modified merge sort.Then, proposed modified division search for the location of wanted records and the preparation of a specific report is performed.Let us look at proposed modified merge algorithm for parallel processing.The proposed modification will be that at each step of sorting, we will merge four sorted strings into one ordered sequence of elements, see Figures 3-5.To merge four strings, we use logic indexation of processors.The first processor performs the merge of the first two strings and save the result in temporary array.The index of merged string shall be the same as the index of the first element of the first string.A second processor at the same time starts merging of the third and fourth string.In the same way as the first processor, the second processor will keep the merged sequence of numbers in the temporary array beginning writing at the index of the first element of the third string.All of the processors operate independently from each other and do not share the same memory resources.Separation of concerns for the execution of the parallel merge is established by the use of end-of-cycle assumption located in the parallel for loop.The duration of the process can be defined by the formula: where T i is the sorting time of the i-th processor, and p is the number of processors participating in the parallel merge.
An example of the process in the first stage of the first step of sorting is shown in Figure 3.In this step, the merge use n/2 processors for parallel execution of sorting task.
Symmetry 2017, 9, 176 5 of 18 be the same as the index of the first element of the first string.A second processor at the same time starts merging of the third and fourth string.In the same way as the first processor, the second processor will keep the merged sequence of numbers in the temporary array beginning writing at the index of the first element of the third string.All of the processors operate independently from each other and do not share the same memory resources.Separation of concerns for the execution of the parallel merge is established by the use of end-of-cycle assumption located in the parallel for loop.
The duration of the process can be defined by the formula: where is the sorting time of the -th processor, and p is the number of processors participating in the parallel merge.
An example of the process in the first stage of the first step of sorting is shown in Figure 3.In this step, the merge use n/2 processors for parallel execution of sorting task.In the next step of the algorithm, stored in the temporary array information is sorted for every two rows and stored again as the merged row in the input array.Parallelization of the process of merging n/4 strings is shown in Figure 4.In the next steps of the algorithm we merge in the same way all of the enlarged strings, in each iteration, four times, see Figure 5.In the next step of the algorithm, stored in the temporary array information is sorted for every two rows and stored again as the merged row in the input array.Parallelization of the process of merging n/4 strings is shown in Figure 4.In the next steps of the algorithm we merge in the same way all of the enlarged strings, in each iteration, four times, see Figure 5.
Symmetry 2017, 9, 176 5 of 18 be the same as the index of the first element of the first string.A second processor at the same time starts merging of the third and fourth string.In the same way as the first processor, the second processor will keep the merged sequence of numbers in the temporary array beginning writing at the index of the first element of the third string.All of the processors operate independently from each other and do not share the same memory resources.Separation of concerns for the execution of the parallel merge is established by the use of end-of-cycle assumption located in the parallel for loop.
The duration of the process can be defined by the formula: where is the sorting time of the -th processor, and p is the number of processors participating in the parallel merge.
An example of the process in the first stage of the first step of sorting is shown in Figure 3.In this step, the merge use n/2 processors for parallel execution of sorting task.In the next step of the algorithm, stored in the temporary array information is sorted for every two rows and stored again as the merged row in the input array.Parallelization of the process of merging n/4 strings is shown in Figure 4.In the next steps of the algorithm we merge in the same way all of the enlarged strings, in each iteration, four times, see Figure 5.Let us first notice that the tree sequences ≤ ⋯ ≤ and ≤ ⋯ ≤ of elements, can be merged into one sequence ≤ ⋯ ≤ , using merge algorithm, making no more than 2 − 1 comparison of the elements of sequences and .In the first iteration = 1 the first 2 ⁄ processors perform in a concurrent merge two one element strings by doing no more than 2 • 1 − 1 = 1 comparisons for each processor.In fact, the time to complete the entire operation is such as the duration of sorting for one processor.Then /4 processors perform in a concurrent merge two one element strings by doing no more than 2 • 2 − 1 = 3 comparisons for each processor.
Let's now think about the maximum time-of-operation formula, which can be derived by summing the maximum running time of each iteration of the sorting method.In each iteration , in the first step 2 ⁄ , and in the second step 2 ⁄ , processors perform the integration of four strings 4 elements by doing no more than: comparisons.All operations performed in a simultaneous way we can save in the form of Let us first notice that the tree sequences x 1 ≤ . . .≤ x t and y 1 ≤ . . .≤ y t of t elements, can be merged into one sequence z 1 ≤ . . .≤ z 2t , using merge algorithm, making no more than 2t − 1 comparison of the elements of sequences X and Y.
In the first iteration t = 1 the first n/2 processors perform in a concurrent merge two one element strings by doing no more than 2 • 1 − 1 = 1 comparisons for each processor.In fact, the time to complete the entire operation is such as the duration of sorting for one processor.Then n/4 processors perform in a concurrent merge two one element strings by doing no more than 2 • 2 − 1 = 3 comparisons for each processor.
Let's now think about the maximum time-of-operation formula, which can be derived by summing the maximum running time of each iteration of the sorting method.In each iteration t, in the first step n/2 2t−1 , and in the second step n/2 2t , processors perform the integration of four strings 4 t−1 elements by doing no more than: comparisons.All operations performed in a simultaneous way we can save in the form of  4 = log 2 n and 2 2k = 4 k = n, therefore we get: which was to prove.Modified parallel merge algorithm was implemented in C# MS Visual 2015 on MS Windows Server 2012.In the implementation of the algorithm was used the class System.Threading.Tasks, giving the possibility of implementing parallel loops.The parallel loop automatically assigns tasks to the subsequent processors and stops all tasks to complete the loop.It should be remembered that it is not the usual iterative loop, and the indicator of iteration number of the selected processor is performed in the other way.The algorithm was designed to iteratively allocate information to processors by the implemented modified merge.In Algorithm 1 and 2 we can see the code of the methods, while in Figures 6 and 7 we can see block diagrams to explain proposed implementation.a lower number to a merged sequence.In the case of equal numbers, the implementation is using the number of the first stack.After emptying one of the stacks it overwrites the number of the stack containing elements.The proposed parallel sorting algorithm is using merge in each step.The four strings of numbers are merged into one ordered string of numbers, see Figure 7.Each step of the algorithm is divided into two stages.First, in the iteration is merge of all subsequent pairs of numeric strings to the temporary array.Then, there is the repetition of the merging of all subsequent pairs of numeric strings in the input array.The merge process has been parallelized in every step in such a way that the first stage of the merge is used to write information to the temporary array, from which it will be executed by the specified number of processors.
Also, the second stage may be performed simultaneously for a fixed number of processors.Proposed paralleled merge processes four numeric strings to use more CPUs than simple selection algorithm.The largest element of four strings is simply rewritten to output string.The practical parallel calculation process uses a loop Parallel For in C# Visual Studio.Which is assigning iteratively tasks for further processors.In the implemented algorithm available processors are numbered from zero to the total number of allocated processors minus one.In this way, each processor can calculate indexes from all the stacks and the number of elements on the stacks.

The Study of the Parallelized Modified Merge Sort
A comparative study of the speed of the presented method was performed on MS Windows Server 2012 with Opteron Advanced Micro Devices (AMD).Processor 8356 8p produced by AMD, Inc.The algorithm was implemented in C# in Visual Studio Ultimate 2015.Statistical analysis for 100 samples generated randomly was conducted for each dimension of input, starting from the input set of 100 elements and increasing the size of the task 10 times each test to 100 million elements.Among the examined layouts of the elements were various combinations.Similarly to [9] were examined permutations of the random elements of the integers from 1 to n, randomized selections of the elements from 1 to sqrt(n), decreasing strings of integers from n to 1, increasing strings of integers 1 to n, same n equal integers in a string, but also special layouts that are hard to sort.A deeper analysis on these special examples of critical strings for quick sort was presented and discussed in our previous research [38].
Each sorting operation by an examined method was measured in time [ms] and CPU usage represented in tics [ti] of the CPU clock.These results are averaged for 100 sorting samples.Benchmark comparison is described in Tables 1 and 2, Figures 8 and 9.In the tables are presented only mean times.These comparison show the analysis of the sorting time.Worst Case Execution Time (WCET) analysis is not necessary since the given and proved theorem clearly show WCET, which is when all the comparison instructions are done for the sequence of integers sorted inversely.The proposed method does not suffer from any critical strings as quick sort [38].The best sorting time will be for the ordered sequence.The method will then carry out only half of the comparisons of those that are performed in the worst case.This is due to the proposed implementation of the algorithm that merges two numbers.At maximum, the proposed algorithm performs 2n-1 comparisons for merging two n element strings, but must have at least n comparisons to merge them.Therefore, we have the upper and lower limit of the sorting time of the algorithm, and in this case, it is justified to perform statistical surveys on the mean time analysis, as to show how the algorithm behaves in the meantime.1.00E+00 1.00E+01

sorting time [ms]
sample size

sorting time [ms]
sample size

CPU operations [ti]
sample size 1-processor 2-processors 4-processors 8-processors Analyzing results we can see that each new processor can give some additional power of computing.Most visible changes in operations are between 2 and 4 processors.With the usage of each new processor, the number of operations in the system is lower.Similarly, the operation time is shorter.These confirm the theoretical assumptions proven in Theorem I.
Comparison of coefficient of variation for parallelized modified merge sort is presented in Tables 3 and 4. Studies have shown the statistical stability of the sort algorithm for large data collections.Some changes in the coefficients for small size inputs stemmed from the fact that the system automatically exceed operations what caused longer sorting time in this cardinality.

Comparison and Analysis
Time analysis of sorting is an important element in identifying the effectiveness of the methods of sorting large data sets and NoSQL data bases.Let us compare the algorithm.For this we can assume that the duration of the method will be compared in a respect to one processor.Let us examine if the duration of sorting is shorter for the method using multiple processors.The results are shown in Figures 10 and 11.
We can see that with the introduction of each new processing unit it is possible to efficiently increase the performance of the method.Just two processors can shorten the time of sorting of about 20% to 30%.Each new processor gives additional possibility to decrease sorting time.This can be very important for multi-core architecture with several processors.This result is very important for large data sets.
If we compare the usage of processing power similar conclusions can be drawn.Just two processors can shorten time of sorting by about 20% to 30%.Each new processor gives additional possibility to divide tasks and therefore speed up the sorting.In Figure 12 is a comparison of sorting times for various methods.The results are presented for a quick sort presented in [38], heap sort presented in [39], and merge sort presented in [40].As a version for comparison was selected proposed parallel modified merge sort on 8 processors (PMMS8).All the methods were presented in relation to the heap sort.The difference in presented in percentage change of sorting time in [ms].Comparing results from Figure 12 we can see that the proposed PMMS8 is much more efficient for sorting large data sets than other methods.The results have shown that quick sort algorithm is the less efficient among examined algorithms.An additional disadvantage of using the quick sort algorithm is the possibility of as shown in [38].Merge sort, however of up to 100,000 elements sorts inputs in a very similar way to quick sort, above this number becomes more efficient and above 1,000,000 elements is faster than heap sort algorithm.Additional merge sort does not have deadlocks, which is what makes this method more efficient.Proposed in this article is that a parallel version of merge sort is the most efficient algorithm.The results have shown that this algorithm can sort large collections of data about 40% faster if we use just 8 processors.An additional advantage is that there is no possibility of deadlocks, as this method comes as a parallel derivative from the modified merge presented in [40].We theoretically estimate the results for additional processors, from which we can conclude the bounds to execution time with each new processor.Sorting time for one processor is = log .In the theorem we have proved that the maximum for n/2 processors is = 2 − log − 2. Therefore, we are not able to decrease sorting time freely with each new processor.There is always a question of the computing power of the machine used for sorting operations.

Conclusions
Studies have shown the effectiveness of the presented method for large data sets.Reductions of sorting time are clearly visible in the yield from under ten thousand elements, making it easier to sort the data sets in NoSQL databases.An additional advantage of the proposed method is no deadlocks, as presented in this article method is a parallel version of modified merge sort for which it was proved to have no deadlocks.Introduction of any additional processor gives a big advantage.As we have seen from the results, each additional processor can boost the method, which is what is very important for large data sets.Since in the proposed application was used, a separation of concerns the method can be implemented on architectures with multiple cores.Therefore, its practical efficiency will be visible for cloud computing, big data sets, and NoSQL system, etc.
Presented results from the research show that with new processors the algorithm performs better.Compared to other sorting methods, the proposed PMMS from 0% to 80% faster in each Comparing results from Figure 12 we can see that the proposed PMMS8 is much more efficient for sorting large data sets than other methods.The results have shown that quick sort algorithm is the less efficient among examined algorithms.An additional disadvantage of using the quick sort algorithm is the possibility of deadlocks, as shown in [38].Merge sort, however of up to 100,000 elements sorts inputs in a very similar way to quick sort, above this number becomes more efficient and above 1,000,000 elements is faster than heap sort algorithm.Additional merge sort does not have deadlocks, which is what makes this method more efficient.Proposed in this article is that a parallel version of merge sort is the most efficient algorithm.The results have shown that this algorithm can sort large collections of data about 40% faster if we use just 8 processors.An additional advantage is that there is no possibility of deadlocks, as this method comes as a parallel derivative from the modified merge presented in [40].We theoretically estimate the results for additional processors, from which we can conclude the bounds to execution time with each new processor.Sorting time for one processor is T max = n log 2 n.In the theorem we have proved that the maximum for n/2 processors is T max = 2n − log 2 n − 2. Therefore, we are not able to decrease sorting time freely with each new processor.There is always a question of the computing power of the machine used for sorting operations.

Conclusions
Studies have shown the effectiveness of the presented method for large data sets.Reductions of sorting time are clearly visible in the yield from under ten thousand elements, making it easier to sort the data sets in NoSQL databases.An additional advantage of the proposed method is no deadlocks, as presented in this article method is a parallel version of modified merge sort for which it was proved to have no deadlocks.Introduction of any additional processor gives a big advantage.As we have seen from the results, each additional processor can boost the method, which is what is very important for large data sets.Since in the proposed application was used, a separation of concerns the method can be implemented on architectures with multiple cores.Therefore, its practical efficiency will be visible for cloud computing, big data sets, and NoSQL system, etc.
Presented results from the research show that with new processors the algorithm performs better.Compared to other sorting methods, the proposed PMMS from 0% to 80% faster in each extended cardinality.Unfortunately, we are not able to decrease sorting time freely.There is always a question about the performance of the machine.Also, another important matter is the operating system, programming studio, compiler, and programming language itself.These of all matters influence the algorithm.In the research, MS Windows Server 2012 was used with Opteron AMD Processor 8356 8p, and the PMMS algorithm was implemented in C# in Visual Studio Ultimate 2015.These software issues were chosen as most commonly used at the market.Computer architecture used for the research was the most powerful, as it was within the author's research capabilities.
In the proposed analysis, creating a probabilistic model and counting the regression curve was not performed for a special reason.Because the division of the information into aggregated blocks and allocations of processors is predetermined and does not subjected to any changes during the operation of the algorithm, we do not have to discuss a classic statistical analysis of the research.In our case presented discussion it is enough to draw conclusions.The proposed method is independent of input unlike in the quick sort method for which deeper statistical analysis is necessary due to the various deadlocks and sequences that are hard to sort.In our case, as proven in the theorem, for any number of input sequences and available number of processors there is a top and bottom estimation of the algorithm's running time.This is confirmed by the study as the coefficient of variation stabilizes and is about 20%, which is in line with the theory of computational complexity for the numerical sequences algorithm.

Final Remarks
The article presents Parallelized Modified Merge Sort algorithm for rapid sorting of large data sets.A proposed method is based on a model of PRAM (Parallel Random Access Machine) that allows the efficient access to read and write information in the memory cell for each single processor.Therefore, the implementation was done using a separation of concerns to increase the efficiency and enable the dynamic association of processors to performed operations.The method was implemented for a parallel processing of information on machines with many processors.Practical realization of the PMMS was done in C# MS Visual 2015 on MS Windows Server 2012.
In the article was presented a theoretical analysis of the efficiency and practical verification.The proposed algorithm was proved to sort n elements in the maximum time 2n−log 2 n−2 using n processors.Comparison tests have shown that the method is more efficient than other sorting methods, especially for big data sets.In the study, it was shown that the statistical stability of the proposed method is on a very good level.The results of benchmark tests confirmed theoretical computational complexity.Presented parallelized sorting algorithm can be successfully used in database applications, especially in situations where a number of processors can be used for speeding up the sorting process.
In the future research is planned a further development in sorting performance.The research will involve the parallelization of the developed versions of heap sort and quick sort algorithms.

Figure 1 .
Figure 1.Sample sketch of data processing in NoSQL databases.

Figure 1 .
Figure 1.Sample sketch of data processing in NoSQL databases.

Figure 2 .
Figure 2. Sample presentation of parallel processing of the request between the user and the server in the NoSQL data bases.To merge four strings, we use logic indexation of processors.The first processor performs the merge of the first two strings and save the result in temporary array.The index of merged string shall

Figure 2 .
Figure 2. Sample presentation of parallel processing of the request between the user and the server in the NoSQL data bases.

Figure 3 .
Figure 3. Parallel merge of the first two numeric strings in the first step of modified merge sort.

Figure 4 .
Figure 4. Parallel merge of the two string located in the temporary array.

Figure 3 .
Figure 3. Parallel merge of the first two numeric strings in the first step of modified merge sort.

Figure 3 .
Figure 3. Parallel merge of the first two numeric strings in the first step of modified merge sort.

Figure 4 .
Figure 4. Parallel merge of the two string located in the temporary array.Figure 4. Parallel merge of the two string located in the temporary array.

Figure 4 .
Figure 4. Parallel merge of the two string located in the temporary array.Figure 4. Parallel merge of the two string located in the temporary array.

Figure 7 .
Figure 7. Sample block diagram of the proposed parallel sorting procedure.

Figure 7 .
Figure 7. Sample block diagram of the proposed parallel sorting procedure.

Symmetry 2017, 9 , 176 14 of 18 Figure 10 .
Figure 10.Comparison of the method using multiple processors in terms of operational time [ms].

Figure 11 .Figure 10 . 18 Figure 10 .
Figure 11.Comparison of the method using multiple processors in terms of operational time [ti].

Figure 11 .Figure 11 .
Figure 11.Comparison of the method using multiple processors in terms of operational time [ti].

Figure 12 .
Figure 12.Comparison of sorting time for heap sort, quick sort, merge sort and proposed in this article parallel modified merge sort on 8 processors.

Figure 12 .
Figure 12.Comparison of sorting time for heap sort, quick sort, merge sort and proposed in this article parallel modified merge sort on 8 processors.

Table 1 .
The results of parallelized modified merge sort in [ms].

Table 2 .
The results of parallelized modified merge sort in[ti].

Table 2 .
The results of parallelized modified merge sort in[ti].

Table 2 .
The results of parallelized modified merge sort in[ti].

Table 3 .
Comparison of the coefficient of variation for [ms].

Table 4 .
Comparison of the coefficient of variation for [ti].