PCDM and PCDM4MP: New Pairwise Correlation-Based Data Mining Tools for Parallel Processing of Large Tabular Datasets

: The paper describes PCDM and PCDM4MP as new tools and commands capable of exploring large datasets. They select variables based on identifying the absolute values of Pearson’s pairwise correlation coefﬁcients between a chosen response variable and any other existing in the dataset. In addition, for each pair, they also report the corresponding signiﬁcance and the number of non-null intersecting observations, and all this reporting is performed in a record-oriented manner (both source and output). Optionally, using threshold values for these three as parameters of PCDM, any user can select the most correlated variables based on high magnitude, signiﬁcance, and support criteria. The syntax is simple, and the tools show the exploration progress in real-time. In addition, PCDM4MP can trigger different instances of Stata, each using a distinct class of variables belonging to the same dataset and resulting after simple name ﬁltering (ﬁrst letter). Moreover, this multi-processing (MP) version overcomes the parallelization limitations of the existing parallel module, and this is accomplished by using vertical instead of horizontal partitions of large ﬂat datasets, dynamic generation of the task pattern, tasks, and logs, all within a single execution of this second command, and the existing qsub module to automatically and continuously allocate the tasks to logical processors and thereby emulating with fewer resources a cluster environment. In addition, any user can perform further selections based on the results printed in the console. The paper contains examples of using these tools for large datasets such as the one belonging to the World Values Survey and based on a simple variable naming practice. This article includes many recorded simulations and presents performance results. They depend on different resources and hardware conﬁgurations used, including cloud vs. on-premises, large vs. small amounts of RAM and processing cores, and in-memory vs. traditional storage.


Introduction
Recently, many concerns regarding the replicability of scientific findings as results of experiments and data analysis reported in various publications emerged. There are many cases in which other researchers have to re-implement and adapt to validate the findings or even replicate the data analysis or the computation using the same data, procedure, methodology, and even code or script sequences [1,2].
Nowadays, many statistical tools (SPSS, R, Matlab, Minitab, SAS, Stata, etc.) encourage replicability through consistent support for data analysis, statistical calculations, visualizations, advanced tests, and automatic reporting of results and aid for community contributions and versioning. The latter concerns both the main software version for which a certain command was written (https://www.stata.com/features/overview/in tegrated-version-control/, accessed on 1 June 2022) and the release marker telling the program's version in the proprietary tracking scheme (https://www.stata.com/support/ faqs/programming/release-marker-versus-version-number/, accessed on 1 June 2022). Stata (https://www.stata.com, accessed on 1 June 2022) benefits from all of these [3][4][5][6][7] and it successfully combines a friendly user interface with support for power users and programmers. There are many new Stata programs and commands introduced to serve different purposes. Among them, those used for data mining (as a crucial component of business intelligence [8,9] or even dedicated Cross-Industry Standard Process model-CRISP-DM [10]) and variable selection such as: stepwise [11] with forward and backward selection, or the LASSO package [12]. The latter has different components. For instance, CVLASSO can perform cross-validations on random subsamples. More, RLASSO places a high priority on controlling overfitting [13,14]. In addition, the calculation of shrinkage statistics to measure overfitting using overfit.ado [15]. Moreover, PCA stands for Principal Components Analysis [16] and allows the estimation of parameters for principal-component models. Even more, it is worth mentioning here the Bayesian Model Averaging (BMA) and weighted-average least-squares (wals) for estimating linear regression models with uncertainty about the choice of the explanatory variables [17]. The Boosting technique for decision tree classifiers [18] also has a well-defined place in the list of exploratory ones. Still, the boost plugin in Stata is too time-consuming in terms of execution, and it has limited capabilities such as automatic variable selection and treatment of missing values [19]. It is worth mentioning tools able to compute maximum probability thresholds in some visual representations known as risk-prediction nomograms generated using the nomolog command [20].
In terms of parallel approaches, we mention early contributions focused on computing the "information gain" using MapReduce jobs executed on Hadoop Clusters [21,22] or the open-source distributed machine learning library, namely MLib [23] and other more recent methods and techniques in Apache Spark [24,25] and Mahout [26][27][28]. In addition, it is worth referring to other new approaches that focus in particular on computing Pearson's correlation coefficients, such as ForkJoinPcc, which uses the parallel MATLAB APIs [29] to mimic the well-known parallel programming model, namely the fork-join model.
In this paper, we describe new exploratory tools, namely PCDM and PCDM4MP. They serve data-mining and variable selection purposes being also two new dedicated commands for Stata. They rely on pairwise correlation computation and printing easyto-copy and filterable results in the console. Their design enables them to support the rapid selection of most correlated variables with the one specified right after the command name (the target), and this is even without knowing and stating the name of the rest of the variables in any dataset. The latter is an advantage that makes them reliable data-mining tools. For PCDM, we also considered a direct but more complex filtering scenario. The latter considers a set of three most important values as parameters. The first corresponds to certain thresholds for the correlation coefficients (minimum accepted absolute value). The second describes the minimum accepted number of valid observations at the intersection of every single pair of two variables meaning the target one and each of the remaining. The third is afferent to the maximum accepted p-value [30,31]. For PCDM4MP, the focus was on speed via multi-processing.
A dataset from the World Values Survey (WVS) (The Data Availability Statement at the end of this manuscript and the video instructions in the 1st recorded simulation, namely, 1.download test-data from WVS(TS-v1.6).mp4, https://drive.google.com/u/0/uc?id=1 wiwHo1gYrmccZYoJB4y1kjgcVQdVfZwE&export=download, accessed on 1 June 2022) proves the usefulness of PCDM in real-world scenarios with large amounts of historical data [34][35][36]. The WVS is one of the biggest cross-national, non-commercial, time-series empirical surveys of human beliefs and values ever conducted. It is also a representative comparative social investigation conducted globally in over 100 countries and includes seven waves applied once every five years (from 1981 to 2020). WVS served many kinds of research and studies [37][38][39][40][41][42][43]. The starting point was the entire set of variables (1045) and observations (426,452) in this dataset which was also loaded and exported as .csv using Stata (line 8- Figure 1). In addition, a simple binary derivation of the variable to analyze (C033, Job satisfaction) considering the symmetric split of the original scale. C033 (original scale of 1 = Dissatisfied up to 10 = Satisfied) was the starting point to generate C033_bin. This binary form has the value of 1 for all not null initial values greater than or equal to 6 and 0 otherwise (but still not null original values-lines 3-5, Figure 1-preProcessingScript.do (https://drive.google.com/u/0/uc?id=1sQNtMANw M3DzP5CAl2u3Io-xD6f5_4xW&export=download, accessed on 1 June 2022).  Availability Statement at  the end of this manuscript and the video instructions in the 1st recorded simulation,  namely,  1.download  test-data  from  WVS(TS-v1.6).mp4, https://drive.google.com/u/0/uc?id=1wiwHo1gYrmccZYoJB4y1kjgcVQdVfZwE&ex-port=download, accessed on 1 June 2022) proves the usefulness of PCDM in real-world scenarios with large amounts of historical data [34][35][36]. The WVS is one of the biggest cross-national, non-commercial, time-series empirical surveys of human beliefs and values ever conducted. It is also a representative comparative social investigation conducted globally in over 100 countries and includes seven waves applied once every five years (from 1981 to 2020). WVS served many kinds of research and studies [37][38][39][40][41][42][43]. The starting point was the entire set of variables (1045) and observations (426,452) in this dataset which was also loaded and exported as .csv using Stata (line 8- Figure 1). In addition, a simple binary derivation of the variable to analyze (C033, Job satisfaction) considering the symmetric split of the original scale. C033 (original scale of 1 = Dissatisfied up to 10 = Satisfied) was the starting point to generate C033_bin. This binary form has the value of 1 for all not null initial values greater than or equal to 6 and 0 otherwise (but still not null original values-lines 3-5, Figure  1-preProcessingScript.do (https://drive.google.com/u/0/uc?id=1sQNtMANwM3DzP5CAl2u3Io-xD6f5_4xW&ex-port=download, accessed on 1 June 2022). The first thing to do was to intersect the results obtained using both variable selections corresponding to those two forms of the outcome and the method based on PCDM and PCDM4MP in Stata (versions 16.0 and 17.0, MultiProcessing, x64, StataCorp, College Station, TX, USA). It meant computing and filtering on absolute values of pairwise correlation coefficients, their significance, and the corresponding number of observations. Additional filters served the latter after copying the results from the console in a spreadsheet tool. The alternative, also considered, was to demonstrate the use of optional arguments. Further selections relied on the LASSO pack and BMA (in both Stata versions above).
In addition, we tried to find the most resilient predictors by using another method based on the Adaptive Boosting technique and to show which of them are among those obtained using the approach based on PCDM and PCDM4MP. Therefore, we first loaded the .csv dataset in the Rattle (https://rattle.togaware.com, accessed on 1 June 2022) (version 5.4.0) interface from R. Then, we used this technique for decision tree classifiers as an alternative data-mining round, considering the following default settings: Trees-50, Max Depth-6, Min Split-20, Complexity-0.01, Learning Rate-0.3, Threads-2, Iterations-50, Objective-binary logistic/logit). It benefited from the support of a virtual machine available in a private cloud described below.
Other correlation commands (e.g., correlate) further served each tested regression model (logit and OLS-Ordinary Least Squares). This time, they are critical concerning just the resulting and intersecting predictors as maximum absolute values from their matrices The first thing to do was to intersect the results obtained using both variable selections corresponding to those two forms of the outcome and the method based on PCDM and PCDM4MP in Stata (versions 16.0 and 17.0, MultiProcessing, x64, StataCorp, College Station, TX, USA). It meant computing and filtering on absolute values of pairwise correlation coefficients, their significance, and the corresponding number of observations. Additional filters served the latter after copying the results from the console in a spreadsheet tool. The alternative, also considered, was to demonstrate the use of optional arguments. Further selections relied on the LASSO pack and BMA (in both Stata versions above).
In addition, we tried to find the most resilient predictors by using another method based on the Adaptive Boosting technique and to show which of them are among those obtained using the approach based on PCDM and PCDM4MP. Therefore, we first loaded the .csv dataset in the Rattle (https://rattle.togaware.com, accessed on 1 June 2022) (version 5.4.0) interface from R. Then, we used this technique for decision tree classifiers as an alternative data-mining round, considering the following default settings: Trees-50, Max Depth-6, Min Split-20, Complexity-0.01, Learning Rate-0.3, Threads-2, Iterations-50, Objective-binary logistic/logit). It benefited from the support of a virtual machine available in a private cloud described below.
Other correlation commands (e.g., correlate) further served each tested regression model (logit and OLS-Ordinary Least Squares). This time, they are critical concerning just the resulting and intersecting predictors as maximum absolute values from their matrices with correlation coefficients (maxAbsVPMCC). In addition, the highest ones for the computed Variance Inflation Factor or OLSmaxComputVIF were subject to assessment against (no more than) the maximum acceptable ones (Equation (1)) or OLSmaxAcceptVIF [44][45][46] for each OLS regression model. The measurements also concerned accuracy, as AUCROC (better for larger values). The latter means the Area Under the Curve of Receiving Operators Characteristics [47,48]. The same for the information gain and model fitness via AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) values [49][50][51], meaning more information gain and a better fit for lower such values.
PCDM (https://drive.google.com/u/0/uc?id=1hRBn0tv5wSXFjUbzVumIfqvRasOC WGcY&export=download, accessed on 1 June 2022) is installable (download and copy to one of the ado directories (https://www.stata.com/manuals13/u17.pdf, accessed on 1 June 2022, Section 17.5.2 of the previous online .pdf manual)-e.g., C:\ado\personal). The source script and syntax of PCDM (Listing A1, Appendix A) are easy to understand and allow two main types of use. In a more advanced scenario, the PCDM command (which should work on many platforms, only depending on the location of the personal .ado directory (https://www.stata.com/support/faqs/mac/personal-ado-directory/, accessed on 1 June 2022) appears as invoked inside another one (PCDM4MP, https://drive.google.com/u/0/uc?id=1_Gz37zgyfkKWoZO0J8JuEZ7ei-Q4mwaG&ex-port=download accessed on 1 June 2022). The latter was designed for multi-processing purposes (The video instructions in the 2nd recorded simulation, namely, https://drive.google.com/u/0/uc?id=14_M-LdWMEtcfw75z1gl8a541VSs7brk6&ex-port=download, accessed on 1 June 2022) in Stata but only on a Windows physical or virtual machine (reading the number of existing logical cores only considered the case of a Windows OS-local nproc: env NUMBER_OF_PROCESSORS). Using the latter (Figure 3) involves only the target variable and two optional parameters (number of logical cores and destination disk for temporary results) without the rest of the variables and the optional arguments of PCDM. PCDM4MP is optimized for Windows, and it invokes the qsub parallel processing module in Stata [52]. Therefore, qsub is a prerequisite in the sense that Simple usage scenario involving a single logical processing core (PCDM) with the real-time reporting of execution progress for a 1045 variables dataset (WVS). Notes: The asterisk (*) stands for all variables in the dataset. The first dot (.) is automatically generated by Stata after entering the command (pcdm C033 *). The subsequent occurrences of dots (PCDM's feedback in Stata's console) followed by numerical values indicate zeros (0) followed by their decimal parts (e.g., .065 is actually 0.065 while -.1175 is actually −0.1175). The "e" followed by the minus ("−") and numbers indicate the E notation corresponding to the scientific one (1.4e-16 is actually 1.4 × 10 −16 ).
The first is simple meaning without optional parameters by specifying only the variable considered for analysis and the rest of the variables available in the dataset using a generic symbol (e.g., PCDM C033 *) or explicitly-e.g., PCDM C033 A170 C006 C031. The second means a more complex scenario (e.g., if_plus_mix_of_if_and_3arg.do, https://drive.google .com/u/0/uc?id=17HNpLZypindqT8hZarZjv3O1B80jn0z3&export=download, accessed on 1 June 2022) when benefiting from the use of the if data subset filtering option (supported by PCDM-lines 6 and 52-Listing A1) for filtering the dataset (e.g., on a certain country code: PCDM C033 * if S003==840) and three optional parameters (line 6-Listing A1, between square brackets, and Figure A1, Appendix A), namely • minacc-the minimum accepted absolute value (lines 21-29 and 59, Listing A1) of the correlation coefficient (its default value is 0-line 18, Listing A1); • minn-the minimum accepted number of observations (lines 30-38 and 59, Listing A1) for each response-predictor pair (its default value is 1-line 19, Listing A1); • maxp-the maximum tolerated p-value (lines 39-47 and 59, Listing A1) for a significance threshold, usually 0.05 or less (therefore, its default value is 0.05-line 20, Listing A1).
A simple use case relies on a single logical processing core. It also involves the real-time reporting of the number of execution steps out of the total number (the same as the total number of variables in the dataset) along with the execution percentage (lines 61 and 62, Listing A1) and printing of all results or only the ones satisfying those three constraints above (if specified as arguments) in the Stata console ( Figure 2).
In a more advanced scenario, the PCDM command (which should work on many platforms, only depending on the location of the personal .ado directory (https://www.st ata.com/support/faqs/mac/personal-ado-directory/, accessed on 1 June 2022) appears as invoked inside another one (PCDM4MP, https://drive.google.com/u/0/uc?id=1_Gz37z gyfkKWoZO0J8JuEZ7ei-Q4mwaG&export=download accessed on 1 June 2022). The latter was designed for multi-processing purposes (the video instructions in the 2nd recorded simulation, namely, https://drive.google.com/u/0/uc?id=14_M-LdWMEtcfw75z1gl8a5 41VSs7brk6&export=download, accessed on 1 June 2022) in Stata but only on a Windows physical or virtual machine (reading the number of existing logical cores only considered the case of a Windows OS-local nproc: env NUMBER_OF_PROCESSORS). Using the latter ( Figure 3) involves only the target variable and two optional parameters (number of logical cores and destination disk for temporary results) without the rest of the variables and the optional arguments of PCDM. PCDM4MP is optimized for Windows, and it invokes the qsub parallel processing module in Stata [52]. Therefore, qsub is a prerequisite in the sense that it must be installed (ssc install qsub, replace) first. PCDM4MP first displays the starting time (Listing A2, Appendix A, lines 7 and 8). The same when finishing (Listing A2, lines 152 and 153). In addition, it checks many things. One is the number of the total existing (Listing A2, line 33) vs. allocated logical cores (Listing A2, lines 6, 34-43, and 136-138). The latter is optimized (if 'xc' > 'k' local xc = 'k', https://drive.google.com/u/0/uc?id=1_Gz3 7zgyfkKWoZO0J8JuEZ7ei-Q4mwaG&export=download, accessed on 1 June 2022) in order not to overpass the number of vertical splits of the dataset (k groups of variables, according to the starting letter, upper or lower case in their names). Other checks mean verifying whether more than one variable/no variable is used in the command call or simply checking the number of variables in the dataset, its path, and the path of the Stata tool. PCDM4MP also creates a structure of folders on the root of a specified partition/disk (by default C-Listing A2, lines 6, and 44-53, C:\StataMPtasks, C:\StataMPtasks\queue; C:\StataMPtasks\logs-Listing A2, lines 56-64) and a template file (C:\StataMPtasks\main_do_file.do-Listing A2, lines 65-93) working with two arguments: 1-the task number (Listing A2, lines 72, 124 and 128) in maximum two digits, 2-the starting capital or small letter (Listing A2, lines 80, 89, 124, and 128) for a group of variables to consider in a PCDM correlation command. At runtime, PCDM4MP ( Figure 3) will start from this template and will also dynamically generate as many .do files/tasks (maximum 52 in the "queue" subfolder-Listing A2, lines 112-133) as there are variable groups starting with a given letter (upper or lower case), and this was considered because there are many other organizations collecting large datasets (e.g., SHARE-ERIC, http://www.share-project.org/home0.html, accessed on 1 June 2022, e.g., all the variables about work quality start with "wq".) that use category coding of variables that start with a particular letter or combination of letters. All these tasks will be managed by qsub, which is automatically used (Listing A2, line 139) by PCDM4MP. Consequently, there is no need for further user/custom scripts or setups to generate the template and the tasks, as indicated in the documentation of qsub. When generating tasks, PCDM4MP will also include log generation commands (Listing A2, lines 68-76, and 92), which are necessary to retrieve the results obtained in a parallel manner. Finally, PCDM4MP will Mathematics 2022, 10, 2671 6 of 27 print all the logs (previously generated in the logs subfolder-Listing A2, lines [140][141][142][143][144][145][146][147][148][149][150][151] in the Stata console. Any user should further copy all this content into a spreadsheet tool, split it into columns using the programmatically generated space separator, and filter it to keep only the correlation results, including additional conditions for minacc, minn, and maxp. This fact (the user is already being asked to copy and filter) is the reason why these three were no longer considered arguments (not even optional) when dealing with multi-processing tasks (PCDM4MP). PCDM4MP is not optimized to support filtering on data subsets (if) either, but this option remains easily available with the aid of a simple script pattern, namely use_filtering_script.do (https://drive.google.com/u/0/uc?id=1yj GsW0fwUi-PZgvlnlaKMy9GX9U40SaK&export=download, accessed on 1 June 2022) (as six simple command lines) able to extract, export, and reload only a data subset starting from the initial dataset and depending on one or more conditions.  Figure 3. More advanced usage scenario involving its version for multi-processing (PCDM4MP) and six logical processing cores on the 2nd hardware configuration described in this paper (Table 1, 4th column). Note: Only the first two command lines (starting with "use" and "pcdm4mp") are the responsibility of the user, while the rest is feedback from the PCDM4MP command in Stata's console. Otherwise, the same notes as in Figure 2. More advanced usage scenario involving its version for multi-processing (PCDM4MP) and six logical processing cores on the 2nd hardware configuration described in this paper (Table 1, 4th column). Notes: Only the first two commands (those two lines starting with "use" and "pcdm4mp") are the responsibility of the user, while the rest is feedback from the PCDM4MP command in Stata's console. Otherwise, the same notes as in Figure 2.

Results and Discussion
The goal here is to demonstrate the usefulness of the PCDM and PCDM4MP commands, and this is mostly in terms of simplicity and increased support for variable selection. These are based on the results of some tests with both PCDM and PCDM4MP intersected with the ones obtained using other tools and techniques.
Although essentially based on pwcorr (https://www.stata.com/manuals/rcorrelate.p df, accessed on 1 June 2022) (the pairwise correlation command starting from Pearson's product-moment method [53,54]-line 52, Listing A1), PCDM has clear advantages over the already existing correlate or pwcorr. The latter is due to its filterable results in a tabular format (Listing A1-the space separators programmatically generated at lines 17 and 59 using the display/di command, and Figure 2) vs. matrices with two headers ( Figure A3, Appendix A). This applies in all cases, meaning when considering two or more variables for these already existing correlation commands.
Another advantage of PCDM over other selection methods (e.g., Stepwise, CVLASSO, RLASSO, or BMA) is given even by its specific way of taking pairs of two variables (the chosen one-e.g., C033 or C033_bin and each of the remaining ones). By doing this and reporting and filtering on the number of not null intersecting observations, PCDM can avoid an annoying error, namely No Observations, r(200) other methods confront ( Figure A2, Appendix A). The latter is clearly due to non-existent cases/observations at the intersection of all included variables. Therefore, the impossibility of performing statistical computations and the resulting error is expectable. In such cases, PCDM skips the pairs with such problems by using the error capture clause and error type checking with the aid of the _rc (http://www.stata.com/manuals/perror.pdf, accessed on 1 June 2022) (return code) built-in variable (lines 52 and 53, Listing A1).
The first tests of PCDM concerned a simple scenario (the command used was: pcdm C033 *) with those three hardware configurations already mentioned using only a single logical processor core. The whole exploration of the same WVS dataset took between 85 and 124 s, depending on the hardware used (the second line in Table 1).
PCDM also resisted some tests in another more advanced scenario with the command invoked inside the other one, which is optimized for multi-processing (PCDM4MP, Figure 3) on the same three hardware configurations above. PCDM4MP uses PCDM many times (different sessions of Stata) and consequently involves many data loads. This means that PCDM4MP keeps track of the original location and number of variables of the last dataset loaded in the main session of Stata (Listing A2, Appendix A, lines [18][19][20][21][22][23][24][25][26][27][28][29][30][31], and it will send these details (the "main_do_file.do" multi-processing template/pattern-Listing A2, Appendix A, lines 77 and 84) to automatically triggered sessions. The whole parallel exploration of the same dataset comprised 15 distinct unbalanced tasks/jobs (first column and last line in Table 1 and third column in Table 2) corresponding to the same number of variable groups starting with a distinct letter. It took between 36 and 112 s, between 29 and 38, or between 380 and 421 s, depending on the hardware used and the number of logical processor cores allocated (nalc, lines 3-10, and columns 2, 4, and 6 in Table 1). In most cases, this took more than the theoretical nalc part of the previous amount consumed in the single-core approach (the second line in Tables 1 and 2). The exception was the unexpected speed-up (more than double) when going from one to two logical cores for the first two configurations. However, the parallel processing was fast enough. For instance, when using the first configuration (Xeon Gold 6240 CascadeLake, on a VM- Table 1, second column), the execution in the best performing parallel approach (four or six logical cores) was almost 3.5 times (=124/36) faster than using a single-core. A lower ratio (~3) was recorded (=85/29) for the second configuration (Core i7 4710HQ, on a PM- Table 1, fourth column). Moreover, we tried to find out if the specific optimum of four or six logical cores is also due to lower transfer speeds beyond a certain number of concurrent reads for the SSD, SCSI, or SATA storage devices used in these tests. The dataset used occupies 553 MB on all NTFS partitions, and we previously optimized the algorithm behind PCDM4MP to load only each vertical chunk (group of variables) used for computing the correlation coef-ficients and not the entire dataset ("use <var.-list> using <path/dataset-file>" (Listing A2, lines 78-85) instead of just "use <path/dataset-file>" for each different job running on a particular logical core). We noticed that for simultaneous uses of the same storage device (when loading a different part of the same data source into RAM) by each logical CPU (in all tested configurations), an unexpectedly increasing processing time corresponds to increasing parallelism (six or more logical processing cores used). This was more pronounced for the VM (lower CPU frequency and storage devices that involve rotating disks- Table 1, second column) than for the second configuration with a PM (higher CPU frequency and SSDs- Table 1, fourth column). For the latter (based on SSD), the load speed (from disk to RAM) is theoretically divided by the number of concurrent reads, while for the former, this division rule does not apply. This is primarily due to the impossibility of simultaneous access of a read head to several areas on a specific platter of the rotating disc. This translates into dramatic decreases in data loading speed and processing delays. However, in order to eliminate these differences while benefiting from the maximum possible reading speed, we also tested PCDM4MP on the first two configurations using one of the fastest RAM Disk tools, namely ImDisk (https://sourceforge.net/projects/imdisk-toolkit, accessed on 1 June 2022), and two so-called "in RAM" partitions (first-R, of 640 MB, hosting the WVS dataset file, and meant for improving the read speed, and second-Z, of 64 MB, hosting the StataMPtasks temporary folder containing the .do task pattern file, the queue subfolder, and the one with log files, meant for improving the write speeds). As expected, some improvements in the processing time are easily noticeable ( Table 1, third and fifth columns). However, its evolution with the increase in the logical parallelism invalidates, beyond a certain threshold (six logical cores, as reported in Table 1), the inverse relationship between the two. To demonstrate that this evolution is not substantially influenced by the behavior of the qsub command on which PCDM4MP is based, we performed an additional simulation (The 6th recorded simulation, namely 6.pcdm-RaaS-IS(15x)RAMdisks-own MPsim without QSUB(same increased time).mp4 (https://drive.google.com/u/0/uc?id=1ij-C4H LXVlAUO-f9yF5Ne4Sr7KrtxLdd&export=download, accessed on 1 June 2022)). It used 15 cores simultaneously (the first hardware configuration and using ImDisk), each for every task of those 15 corresponding to the variable groups. This time the corresponding scripts (namely own_sim-autorun15do_files.do (https://drive.google.com/u/0/uc?id=1rc B1MFN5gDMRKaff11KzFwesrMy8k5Qr&export=download, accessed on 1 June 2022), and own_sim-print15logs.do (https://drive.google.com/u/0/uc?id=1UTNFQb75dFn2oOkE mwUP61t9NEBvTPv9&export=download, accessed on 1 June 2022) together with the folder structure to copy on the target disk (in the archive StataMPtasks.zip (https://drive.google .com/u/0/uc?id=1ZXvnGSPQT4Qi-cTkkpfxBMyl3lplsezh&export=download, accessed on 1 June 2022)) were generated without relying on qsub. The results were comparable to those ( Table 1, the last line for columns 2 and 3) obtained using PCDM4MP, which finally invokes qsub. Still, for this case (15 logical cores working in parallel and covering all 15 tasks in one execution round), they are far from the theoretical optimum (the maximum of 28 s for the most consuming job/the last one that ends-task no.5, the fourth column and sixth line in Table 2). The closest value when using the same hardware configuration (1st) is obtained with just four cores ( Table 1, the third column and fourth line, namely 32 s). ogle.com/u/0/uc?id=1sQNtMANwM3DzP5CAl2u3Io-xD6f5_4xW&export=download, accessed on 1 June 2022, in Figure 1). The alternative selection stage based on Adaptive Boosting and some tuning parameters [56,57] in the Rattle library of R served the triangulation [58] as a scientific principle. It discovered in a ranked way ( Figure 4) the most important variables related to the one to analyze in its binary form (C033_bin).  Additional filters ( Figure 5 and the practical example at the end of the fifth recorded simulation, namely 5.pcdm4mp-RaaS-IS(16x).mp4 (https://drive.google.com/u/0/uc?id=1iMdiIwDR_iiVv0C-Le1vF0lROJmvNjJ7&ex-port=download, accessed on 1 June 2022)) on the results obtained in the console further served analysis purposes. Such results came after simple invocations of PCDM for both forms of the outcome ("pcdm C033 *" and "pcdm C033_bin *") and were previously copied in a spreadsheet file. The first was the exclusion of C033 and C033_bin from the list of values for the input (general common-sense condition). Next came the specification of the first constraint, namely ≥ 0.2 for ACC [59,60]. Another restriction (≥10,000) was a subjective one for the number of observations, meaning ~2/3 or more of the entire support for the variable to analyze (15,968 valid records-top of Figure 6). More, an additional one for the p-values (≤0.001) followed. After checking the results, only the following list of seven strong intersecting influences and corresponding variables emerged: A008, A170, A173, C006, C031, C034, and D002. Additional filters ( Figure 5 and the practical example at the end of the fifth recorded simulation, namely 5.pcdm4mp-RaaS-IS(16x).mp4 (https://drive.google.com/u/0/uc?id =1iMdiIwDR_iiVv0C-Le1vF0lROJmvNjJ7&export=download, accessed on 1 June 2022)) on the results obtained in the console further served analysis purposes. Such results came after simple invocations of PCDM for both forms of the outcome ("pcdm C033 *" and "pcdm C033_bin *") and were previously copied in a spreadsheet file. The first was the exclusion of C033 and C033_bin from the list of values for the input (general common-sense condition). Next came the specification of the first constraint, namely ≥ 0.2 for ACC [59,60]. Another restriction (≥10,000) was a subjective one for the number of observations, meaning 2/3 or more of the entire support for the variable to analyze (15,968 valid records-top of Figure 6). More, an additional one for the p-values (≤0.001) followed. After checking the results, only the following list of seven strong intersecting influences and corresponding variables emerged: A008, A170, A173, C006, C031, C034, and D002.
The nearest equivalent of the PCDM commands ("pcdm C033 *, minacc(0.2) minn(10000) maxp(0.001)" and "pcdm C033_bin *, minacc(0.2) minn(10000) maxp(0.001)") for the user-mode visual filters ( Figure 5) is also available, and this comes together with the corresponding results. For instance, in Figure 6, both resulting lists except the autocorrelation in the first reported line with values (just below the header as the second one printed in the console, after invoking the PCDM command) and the seventh line with printed values (bottom of Figure 6) showing the correlation with the source variable, namely between C033_bin and C033. The difference of one unit (1045 vs. 1046) between the total amounts of steps needed (Figure 6-corresponding to the total number of variables) is due to the derivation of the binary form of the outcome (lines 3-5 in Figure 1) performed between the first and the second invocation of PCDM. The reason for an additional test clause (if) to remove autocorrelation that is not available in the source script of PCDM is an efficiency-related one. The latter means not oversizing the processing time (already large when using just a single logical processing core and large datasets such as this time series version of WVS).
Mathematics 2022, 10, x FOR PEER REVIEW 14 of 29 Figure 5. Seven intersecting results after two selection rounds using PCDM in its simple format for both forms of the outcome and further visual filters in spreadsheet tools (Microsoft Office Excel). Note: The "E" followed by the minus ("−") and a value indicates the E notation corresponding to the scientific one (2.45E-269 is actually 2.45 × 10 −269 ).
The nearest equivalent of the PCDM commands ("pcdm C033 *, minacc(0.2) minn(10,000) maxp(0.001)" and "pcdm C033_bin *, minacc(0.2) minn(10,000) maxp(0.001)") for the user-mode visual filters ( Figure 5) is also available, and this comes together with the corresponding results. For instance, in Figure 6, both resulting lists except the autocorrelation in the first reported line with values (just below the header as the second one printed in the console, after invoking the PCDM command) and the seventh line with printed values (bottom of Figure 6) showing the correlation with the source variable, namely between C033_bin and C033. The difference of one unit (1045 vs. 1046) between the total amounts of steps needed (Figure 6-corresponding to the total number of variables) is due to the derivation of the binary form of the outcome (lines 3-5 in Figure  1) performed between the first and the second invocation of PCDM. The reason for an additional test clause (if) to remove autocorrelation that is not available in the source script of PCDM is an efficiency-related one. The latter means not oversizing the processing time (already large when using just a single logical processing core and large datasets such as this time series version of WVS).
If further applying CVLASSO and RLASSO selection techniques in many rounds until no loss (2xLASSO.do, https://drive.google.com/u/0/uc?id=1Lw4mjmX1Ua2QDL-aVZxRWxE9dwNi0aRJ&export=download, accessed on 1 June 2022) for both forms of the outcome (C033 and C033_bin), they converge to a shortlist of just five intersecting variables, namely A170, C006, C031, C034, and D002. All these five are also available in the list returned after using the Adaptive Boosting technique in the Rattle library of R (Figure 4). . Seven intersecting results after two selection rounds using PCDM in its simple format for both forms of the outcome and further visual filters in spreadsheet tools (Microsoft Office Excel). Note: The "E" followed by the minus ("−") and numbers indicate the E notation corresponding to the scientific one (2.45E-269 is actually 2.45 × 10 −269 ).
If further applying CVLASSO and RLASSO selection techniques in many rounds until no loss (2xLASSO.do, https://drive.google.com/u/0/uc?id=1Lw4mjmX1Ua2QDL -aVZxRWxE9dwNi0aRJ&export=download, accessed on 1 June 2022) for both forms of the outcome (C033 and C033_bin), they converge to a shortlist of just five intersecting variables, namely A170, C006, C031, C034, and D002. All these five are also available in the list returned after using the Adaptive Boosting technique in the Rattle library of R ( Figure 4). If additionally using BMA (2xBMA.do, https://drive.google.com/u/0/uc?id=1 j8uK8EGxLEcroLWIrUsxIiPKEZS9h_Yb&export=download, accessed on 1 June 2022) as Bayesian Model Averaging [17] and considering A008 and A173 as auxiliary predictors, the posterior inclusion probability (pip) for these two seems to be lower than 5%, while for the rest of the predictors is close to 99.99% when considering both forms of the outcome. When considering their pairwise correlation with both formats of the variable to analyze, these two auxiliary predictors are the only ones having an absolute value of the correlation coefficient below 0.3 from the previous list of seven common possible predictors ( Figure 5). This value above is considered by many authors [61][62][63] a low one. ering A008 and A173 as auxiliary predictors, the posterior inclusion probability (pip) for these two seems to be lower than 5%, while for the rest of the predictors is close to 99.99% when considering both forms of the outcome. When considering their pairwise correlation with both formats of the variable to analyze, these two auxiliary predictors are the only ones having an absolute value of the correlation coefficient below 0.3 from the previous list of seven common possible predictors ( Figure 5). This value above is considered by many authors [61][62][63] a low one. Figure 6. Similar intersecting results using PCDM on a single logical processing core for both forms of the outcome and all three optional arguments for specifying the minimum/maximum limits. Notes: The same as in Figure 2.
Of course, for obtaining robust regression models, further checks are required to eliminate reverse causality (Table A3, rev_cause_checks_logit.do, https://drive.google.com/u/0 /uc?id=1cWajOE8ylkoy3gzKdpPBwR4mqqBnuO00&export=download, accessed on 1 June 2022) and collinearity issues (Table A4, collin_rem_and_comp_perf_checks.do, https://driv e.google.com/u/0/uc?id=181SbannhNIjr9vgrE6JmsycmNxwL_Fgf&export=download, accessed on 1 June 2022), but not before performing additional derivations for these five variables (Table A1, additional_processing_script.do, https://drive.google.com/u/0/u c?id=1WRwIdmiBM3uBC66c6y-WMwAX74BqSkcV&export=download, accessed on 1 June 2022, and Table A2). However, all previous selections are easy to perform with the aid of PCDM, and this is obvious if comparing this with the scenario when starting from all variables and using only CVLASSO, RLASSO, or BMA in Stata. In the latter case, the corresponding commands will return the same error mentioned above (No observations-BMA_and_LASSO_NoObsErrs.do, https://drive.google.com/u/0/u c?id=1ZHT4Ge7WhPjD8y8qUFK1BYuNK5jyLN7n&export=download, accessed on 1 June 2022). Moreover, PCDM also supports cross-validations on well-established criteria [64] or targeted ones [65,66] via the mix of using both the if statement for filtering the dataset and those three arguments presented above and meant for filtering the correlation results obtained (if_plus_mix_of_if_and_3arg.do, https://drive.google.com/u/0/uc?id=17HNp LZypindqT8hZarZjv3O1B80jn0z3&export=download, accessed on 1 June 2022).
The methodology used in this paper also stands on the scientific principle of triangulation [58,67,68]. The latter means to use various methods, techniques, and tools and obtain results that agree across all of them, for instance, data mining based on pairwise correlation coefficients, Adaptive Boosting, BMA, LASSO variable selection techniques, reverse causality and collinearity checks, different regressions, post-estimations of accuracy and goodness of fit, maximum absolute values for correlation coefficients among influences, and predictors, and dynamic thresholds for variance inflation factors.

Conclusions
Although essentially based on pairwise correlations, PCDM and its version for multiprocessing (PCDM4MP) are new tools compared with the existing ones (not just in Stata but also in R or Python). Both bring additional functionalities and serve for selecting the most important influences to include in regression and classification models. They also report the exploration progress in real-time depending on the hardware processing power (in most cases, the CPU specifications, and the RAM and storage amount and speed) together with the number of variables existing and specified from a dataset. PCDM4MP also supports parallelism and emulates a cluster environment up to a certain level by triggering different instances of Stata using distinct classes of variables resulting from intuitive name filtering (first letter). The paper also describes this parallel version supporting an approach oriented towards time-consuming data-mining tasks in Stata and some benchmark results against different hardware configurations used for processing. The description includes the automatic generation of a dynamic task pattern, tasks, and logs. The main consequence is that these tools reduce the time needed to generate filterable tabular results based on absolute values of correlation coefficients and their corresponding significance and support, all reported in a record-oriented and transparent manner. In addition, they successfully overcome annoying errors such as "No observations" by their pairs of variables-oriented nature. The paper describes both tools and brings real-world examples of using large datasets to prove the support provided by PCDM and PCDM4MP for exploring reliable influences and even determinants of different variables to analyze.    Figure 2. The "e" followed by the minus ("-") and a value indicates the E notation corresponding to the scientific one (4.3e+05 is actually 4.3 × 10 5 ). Notes: The same as the first two in Figure 2. The "e" followed by the plus ("+") and numbers indicate the E notation corresponding to the scientific one (4.3e+05 is actually 4.3 × 10 5 ). Table A1. The outcome and the most resilient five possible predictors selected after using PCDM, LASSO, and BMA.   Source: Own calculations in Stata. Notes: Robust standard errors are between parentheses; all raw coefficients above parentheses emphasized using *** are significant at 1‰; green vs. red means better comparative performance and variables that are more likely to be predictors (green in the 1st column) rather than response ones (red).