An E ﬃ cient and Unique TF / IDF Algorithmic Model-Based Data Analysis for Handling Applications with Big Data Streaming

: As the ﬁeld of data science grows, document analytics has become a more challenging task for rough classiﬁcation, response analysis, and text summarization. These tasks are used for the analysis of text data from various intelligent sensing systems. The conventional approach for data analytics and text processing is not useful for big data coming from intelligent systems. This work proposes a novel TF / IDF algorithm with the temporal Louvain approach to solve the above problem. Such an approach is supposed to help the categorization of documents into hierarchical structures showing the relationship between variables, which is a boon to analysts making essential decisions. This paper used public corpora, such as Reuters-21578 and 20 Newsgroups for massive-data analytic experimentation. The result shows the e ﬃ cacy of the proposed algorithm in terms of accuracy and execution time across six datasets. The proposed approach is validated to bring value to big text data analysis. Big data handling with map-reduce has led to tremendous growth and support for tasks like categorization, sentiment analysis, and higher-quality accuracy from the input data. Outperforming the state-of-the-art approach in terms of accuracy and execution time for six datasets ensures proper validation.


Introduction
Document gathering is a process to achieve the gathering of data from big data [1][2][3]. The partitionbased algorithms, such as K-means, EM, and sGEM and the rule mining-based algorithms such as Apriori, FPGrowth, and FP-Bonsai are useful methods for document gathering. Additionally, partition-based and rule mining-based algorithms are used to group the relevant data into the same cluster. However, these algorithms have some drawbacks in document gathering. To rectify these drawbacks, this research paper discusses the techniques for the data group using the custom-made TF/IDF algorithm on the Reuters-21578 and 20 Newsgroups dataset [4]. In addition to the categorization and clustering used for appropriate document gathering [5,6], the categorization decides which set of

Literature Review
This section describes various Map-Reduce-based document clustering with various techniques and methodologies. All of these systems focus on Map-Reduce-based document clustering with a considerable amount of data that is most likely to be big data [5,[14][15][16]. Dawen et al. proposed a new Map-Reduce-based Nearest Neighbor Approach called optimization classifier, which is used for traffic flow prediction on big data datasets [14]. They use the Hadoop platform model for traffic flow prediction concerning offline distributed training (ODT) and online parallel prediction (OPP). Their proposed system is useful for improving data observation and classification. Finally, the prediction approach called the leave-one-out cross-validation method is used to improve the accuracy of the particular dataset concerning the Map-Reduce-Based Nearest Neighbor Approach. In addition to this, an improvement in OPP and ODT helps analyze our research for the prediction of the concerned text-based prediction. Andriy et al. (2016) ran into issues, such as scarce storage, capturing replays, classifying and formatting, inadequate tooling for processing, scalable analysis, and storing logs files, and so they proposed "Challenges and Solutions Behind the Big Data Systems Analysis" [15] to solve the outlined issues. Their survey went on to discuss improvements in security and time analysis, as well as velocity, volume, variety, veracity, and value, which works against the log files. Finally, they store and efficiently process the logs. These criteria facilitate us in improving the maintenance and handling of streaming data in an efficient way.
A prominent researcher, Ge Song, with his group members, analyzed "Big Data on Map Reduce." They worked on the implementation of five published algorithms using experimental settings to control the large volume and dimensions of data [17] because as the volume and dimensions increase, it will be costly and time consuming to perform. They overcome the drawbacks in existing Map-Reduce programming by comparing k-nearest neighbor (kNN) on Map-Reduce for analyzing time, space complexity, and accuracy. Additionally, kNN is used to evaluate the classification performance of ten different datasets. Raw data were applied to three generic steps: (a) prepositioning, partitioning, and computation with inputs in terms of load balancing, the accuracy of results, and overall complexity: (a) pivots and projections, (b) distance-based and size-based, and (c) Round Map-Reduce. Finally, in large volume of data, they proved that kNN with Map-Reduce handled the problem in three different steps, such as preprocessing, partitioning, and actual computation with the kNN approach. Hao Wang et al. introduced the Map-Reduce Checkpoint with the BeTL for handling the massive amount of data concerning the following contribution, such as Map Task Checkpointing, Combiner Cache, Enhanced Speculation, Resilient Checkpoint Creation, and Comprehensive Evaluation [18].
The above contribution enables the authors to create the Map-Reduce framework with the BeTL functions. Finally, the relevant and redundant large number of features is reduced from the large dataset for handling various viewpoints such as no failures, diverse density of failures, heterogeneous environments. Wasi et al. discussed the Map-Reduce and its study. They focus on YARN Map-Reduce for the high-performance cluster as well as works with multiple concurrent jobs [12]. Finally, they performed detailed optimization with the clusters. They used priority-based dynamic detection for investigating the applicability of similar design to big data processing. In 2016 Kun Gao, proposed a continuous function for remembering and forgetting and deep data stream analysis model based on remembering [19]. They simulated human thinking with various data stream analysis algorithms, such as WIN, Streaming Ensemble Algorithm, AWE, and ACE. Depending on the classification accuracy, data stream analysis, efficiency, and prediction stability criteria their proposed DDSA algorithm is well organized. Iwendi et al. provided a basis for key management techniques that use ant colony optimization for sensor data collection. Their path planning model for wireless sensor network nodes can be used to improve and safeguard the data collected from the base station to the nodes in an intelligent sensor system [20].
Meanwhile, the authors in [21] used intelligent data analysis to solve the problems of eHealth systems. Their research framework explored the influence of socio-technical factors that are affecting the user's adoption of eHealth functionalities to improve the public health system. An intelligent system for the data collection used shows accurate prediction experimental results for improving eHealth systems for Chinese and Ukraine users based on raw data collected from both countries.
Judith discussed the distributed storage and analysis data [1]. They proposed document clustering analysis, namely optimal centroids for K-Means clustering based on Particle Swarm Optimization (PSO). It is used to cluster documents with accuracy by using Hadoop and the Map-Reduce framework. Their proposed methods are applied to the Reuter's and RCV1 document dataset; the final result shows that the accuracy and execution time are maintained efficiently. Leonidas explains query processing via big data streams. They process large-scale queries through incremental data analysis [22]. The distributed stream processing engine (DSPE) is used to query evaluation lifetime concerning the novel incremental evaluation technique. Their proposed technique might be able to handle the massive number of queries, even sophisticated collections. They implemented the query processing description on MRQL Streaming in Spark for effective query processing.

Implementation for Proposed Research Methodology
The proposed research methodology mainly focuses on the document gathering process and extensive data analysis [10,22,23]. The document gathering process composed of three main techniques, such as the frequent item-set-based method, FP-growth, and FP-Bonsai. All these processes are applied to massive data. After applying these three techniques to massive data, the adjacency matrix for each input document is formed. Each input document has a connection to all other documents by a particular repeated word; if the input document is a single line, the adjacency matrix was void for that case [24][25][26]. Here the document similarity was calculated based on the document correlation. Document correlation is measured by using input relay streaming data. The relay streaming data uses sources, such as 20 Newsgroups (20K News information, 20NG), citation network (6M users articles, ISI database), LinkedIn social network (21M user ids, LinkedIn), mobile phone networks (4M station, 100M customer), Reuters-21578 (21578 data, Reuters), Twitter social network (2.4M communities 38M user ids, Twitter). All these data are handled by the Louvain method. It is used for finding communities in large networks. In our case, the Louvain method is an efficient way of identifying documents that result in the massive data set. Regarding the test case, the result of the streaming data is calculated based on the Louvain community detection method. In our test case, Reuters-21578, the detection method is performed for detecting particular words and the most relevant solutions from the data set [3,27].

Temporal Louvain Method in Proposed Summarization
When we apply the Louvain method to Reuters-21578 information corpus, it will manage the individual words by detecting and extracting from a large amount of data [28].
The reasons behind the Louvain method are: 1.
Generally, all detections and extractions are taken into the relation of similarity weights.

2.
Accurate input streaming data count should be accessed with mathematical computation.

3.
Computation might not be as per the approximate number of results.

4.
To resolve the troubles in processing schemes as well as validate the Louvain method by using the application verification process i.e., first, the methods are applied to the predefined dataset corpus. Later, live streaming data is to be processed, as shown in Figure 1. network (2.4M communities 38M user ids, Twitter). All these data are handled by the Louvain method. It is used for finding communities in large networks. In our case, the Louvain method is an efficient way of identifying documents that result in the massive data set. Regarding the test case, the result of the streaming data is calculated based on the Louvain community detection method. In our test case, Reuters-21578, the detection method is performed for detecting particular words and the most relevant solutions from the data set [3,27].

Temporal Louvain Method in Proposed Summarization
When we apply the Louvain method to Reuters-21578 information corpus, it will manage the individual words by detecting and extracting from a large amount of data [28].
The reasons behind the Louvain method are: 1. Generally, all detections and extractions are taken into the relation of similarity weights. 2. Accurate input streaming data count should be accessed with mathematical computation. 3. Computation might not be as per the approximate number of results.
4. To resolve the troubles in processing schemes as well as validate the Louvain method by using the application verification process i.e., first, the methods are applied to the predefined dataset corpus. Later, live streaming data is to be processed, as shown in Figure 1. This research investigation investigates Reuters-21578. The results of our word detection are mapped [18,29] in Table 1. This research investigation investigates Reuters-21578. The results of our word detection are mapped [18,29] in Table 1. Where the data set community group illustrates, in Reuters-21578, the total number of finalized documents are scheduled into the community group (CG). Each community group's forms are based on similar streaming data [30,31]. Whereas the word occurrence ratio is found based on three criteria such as: Words Occurance Ratio = Total no. of Doc Analyzed per mins − Total Community Group Total no. of Doc id Per Mins − Total no. of Doc Analyzed per mins (1) Perhaps the result of the words occurrence ratio is directed to calculate the words detection ratio for the streaming data review.
The words detection ratio is calculated based on an algorithm that correctly sorted the word detection ratio from all the community groups (CG) in the time interval between the approximate time interval per analysis (duration of processing) and the approximate time interval per analysis (duration of detection) [19].
Words Detection Ratio = Duration of Processing[Words Occurance Ratio] duration of detection (2) It also takes into consideration the approximation of the final result concerning the duration of processing and duration of detection. In this stage, the result automatically manages the realignment concerning the words detection ratio, shown in Table 2.

1.
Community group (CG)-7: there are fewer amounts of data to be tested for grouping concerning the user comment. In that case, there are 1000 comments applied to the word detection ratio test.
Seemingly it gets 26.2% with the 80 document identification.

2.
Community group (CG)-9: there are more data to be tested for the grouping concerning the user comment. In that case, there are 9000 comments applied to the word detection ratio test. Seemingly, it gets 1.1% with the 85-document identification.
In both cases, the word detection ratio is slightly varied for each word detection ratio. The reason behind this process is the duration of processing (word occurrence ratio) and duration of detection. Finally, the community group (CG) is reshuffled concerning the word detection ratio. After that, the comments are rearranged by the user comments. This scheme is illustrated in the Tables 3 and 4.

Temporal Similarity and Comparison Method
Concerning the community group (CG) illustration, the actual process returns the following comparison result. From Table 4, our research defines access, update schema, structure, integrity, and speed as the state following the dataset and its possible standards such as occurrence and detection.
For this consideration, Table 4 shows the six datasets; each dataset was analyzed concerning access medium, update schema, structure, integrity, and speed.

1.
Each time similarity has been checked with the streaming data (LinkedIn, Twitter).

2.
In another case, the comparison has been checked (20 Newsgroups, Reuters-21578) by the source dataset. 3.
The similarity and comparison are both verified by the (mobile phone network) input data.
Concerning the above three steps of verification, comparison gives a higher value in terms of the input streaming data as well as the fixed data set. Depending on this higher similarity and lower streaming data input, the occurrence and detection should be encountered by each dataset if the dataset is reshuffled concerning the occurrence in the streaming data. Moreover, the rate occurrence and the amount of data to be analyzed are calculated with the help of Tables 3 and 4.
In the same manner, this research analyzes the data to obtain another source (streaming data or fixed data).

Datasets and Setup
Reuters-21578 has a collection of 21,578 real-world news stories and news-agency headlines in the English language under 135 different categories. Reuters-21578 has 22 files, each moderately consisting of 1000 documents. The citation and detail about all entries are available for each document, which includes date, topics, author, title, content part of this Reuters-21578 is www.daviddlewis.com/ resources/testcollections/reuters21578/readme.txt.

News Group
The 20 Newsgroups is an accessible data set; it has approximately 20,000 documents partitioned with different topics. Originally collected and assembled data for each category. Although 20-Newsgroup is less popular than Reuters-21578, it is still used by many researchers (Baker and McCallum, 1998;McCallum and Nigam, 1998). The articles in this data set are posted to some newsgroups, unlike Reuters-21578, which are taken from the newswire. Another big difference between 20-newsgroup and Reuters-21578 is that the category set has a hierarchical structure.
The citation network dataset is collected for getting the global analysis of research. Moreover, all the process is collected from the DBLP, ACM sources as shown in Table 5. There are nine versions available with the 16,725,563 paper. Furthermore, those papers have 16,725,563 citations. The citation contains the title, authors, year, abstract, and venue. This data set is used to make the cluster and mapped for relevant data as well as arranged concerning the network and side information [14,31,33]. This modeling analysis cluster and the mapped process is useful to discover significant title, authors, year, abstract, and venue of papers [34,35].

LinkedIn Network
LinkedIn network dataset is the social-scientific research-related data for visualizing and analyzing usage methods and usage of end-user. The user can verify and shortlist by using their activities and interaction with others. It winds up works based on the visualization and network metrics that connect to sociological research. Here the user interaction might be connected to the server concerning the interpretation.

Performance Evaluation and Discussion
The essential concerns in cluster analysis [14,17,36] on big data is the evaluation of the clustering and its dataset handling consequences [5]. Evaluating the colossal amount of data is the analysis of the output to understand how well it reproduces the original structure of the data. However, the estimation of clustering results in big data is the most complicated task within the whole workflow. Furthermore, to evaluate the performance of the proposed model, three performance metrics, such as analysis of clustering accuracy, analysis of execution time, and comparison of quality with the existing method.

Investigation of Accuracy and Execution Time
Let "E n(c) " denote the number of elements lying in a selected data set (DS n ) and let "E n(i) " be the number of elements of class (i m ) in the selected data set (DS n ). Then, the purity investigated accuracy (S n ) of the selected data set (DS n ) is defined as follows: Accordingly, the overall accuracy, namely the clustering quality of the selected data set, is defined as follows.
Clustering Quality(S n ) = N S=1 E n(i) + E n(c) n(i) + n(c) ·Acc(S n ) From Equations (1) and (2), the investigation of accuracy gives higher accuracy. Then it is compared with the traditional k-means, K-Means + PSO, on the distributed data input, centralized data input, and streaming data input. When we compare the traditional methods with the proposed method in terms of accuracy, our proposed system gives a higher accuracy rate on big data analysis. Table 6 shows that different technique for clustering and analyzing. Our proposed system produces higher accuracy than traditional k-means and PSO. From Figure 2, it is observed that a streaming-based system should provide higher accuracy based on the proposed method with the different data sets. Figure 2 shows that the proposed algorithm provides 27.24% higher accuracy when compared to traditional k-means and PSO on the different data set.
Proposed Method 93 89 89 From Figure 2, it is observed that a streaming-based system should provide higher accuracy based on the proposed method with the different data sets. Figure 2 shows that the proposed algorithm provides 27.24% higher accuracy when compared to traditional k-means and PSO on the different data set.  Table 7 shows the different techniques for clustering and analyzing. Our proposed system produces higher accuracy than traditional k-means and PSO. The comparative results on three different techniques, namely k-means, k-means, PSO, and the proposed system shows accuracy as shown in Figure 3, which was comparatively high in streaming-based schemes.  Table 7 shows the different techniques for clustering and analyzing. Our proposed system produces higher accuracy than traditional k-means and PSO. The comparative results on three different techniques, namely k-means, k-means, PSO, and the proposed system shows accuracy as shown in Figure 3, which was comparatively high in streaming-based schemes.  From Figure 4, the observed method provides 76.24% faster execution time when compared to k-means, k-Means + PSO, on all datasets, such as the 20 Newsgroups, citation network, LinkedIn social network, mobile phone networks, Reuters-21578, and Twitter social network dataset. Figure 4 also shows the faster execution time when compared to the traditional algorithm. The result of performance metrics like the execution time is efficient (69.68%) on the Reuters-21578 dataset with From Figure 4, the observed method provides 76.24% faster execution time when compared to k-means, k-Means + PSO, on all datasets, such as the 20 Newsgroups, citation network, LinkedIn social network, mobile phone networks, Reuters-21578, and Twitter social network dataset. Figure 4 also shows the faster execution time when compared to the traditional algorithm. The result of performance metrics like the execution time is efficient (69.68%) on the Reuters-21578 dataset with the proposed scheme.
. From Figure 4, the observed method provides 76.24% faster execution time when compared to k-means, k-Means + PSO, on all datasets, such as the 20 Newsgroups, citation network, LinkedIn social network, mobile phone networks, Reuters-21578, and Twitter social network dataset. Figure 4 also shows the faster execution time when compared to the traditional algorithm. The result of performance metrics like the execution time is efficient (69.68%) on the Reuters-21578 dataset with the proposed scheme.

Conclusions
The most excellent development level of research motivates the field of big data handling using Map-Reduce to have abrupt growth so that simple conventional methods can be utilized and have been a demanding task for categorization, sentiment analysis, and map reducing for the scope of devolving higher-quality accuracy from input data. Map reducing with big data analysis area mostly concentrated on the development of efficient big data management which contributed to the furtherance of the Louvain method. The proposed temporal Louvain approach allows the analyst to

Conclusions
The most excellent development level of research motivates the field of big data handling using Map-Reduce to have abrupt growth so that simple conventional methods can be utilized and have been a demanding task for categorization, sentiment analysis, and map reducing for the scope of devolving higher-quality accuracy from input data. Map reducing with big data analysis area mostly concentrated on the development of efficient big data management which contributed to the furtherance of the Louvain method. The proposed temporal Louvain approach allows the analyst to represent the complex structure of streaming data to implement knowledge about the simple structure of big data handling. Finally, the experimental results of the proposed algorithm significantly perform better for all six datasets in terms of accuracy and execution time. In addition to that, the accuracy and execution time was significantly useful in the custom-made TF/IDF algorithm with the temporal Louvain approach. The various stages and various data set models can be parallelized to improve their accuracy as well as efficient execution time [37][38][39]. Further combinations of various approaches and datasets can be probed and combined for better big data handling.