Real-World Data Difficulty Estimation with the Use of Entropy

In the era of the Internet of Things and big data, we are faced with the management of a flood of information. The complexity and amount of data presented to the decision-maker are enormous, and existing methods often fail to derive nonredundant information quickly. Thus, the selection of the most satisfactory set of solutions is often a struggle. This article investigates the possibilities of using the entropy measure as an indicator of data difficulty. To do so, we focus on real-world data covering various fields related to markets (the real estate market and financial markets), sports data, fake news data, and more. The problem is twofold: First, since we deal with unprocessed, inconsistent data, it is necessary to perform additional preprocessing. Therefore, the second step of our research is using the entropy-based measure to capture the nonredundant, noncorrelated core information from the data. Research is conducted using well-known algorithms from the classification domain to investigate the quality of solutions derived based on initial preprocessing and the information indicated by the entropy measure. Eventually, the best 25% (in the sense of entropy measure) attributes are selected to perform the whole classification procedure once again, and the results are compared.


Introduction
In present times, we are facing the problem of a large amount of data flowing from different sources. In the era of the Internet of Things (IoT) and big data, the challenge is to effectively use and present the acquired data without generating redundant information. Due to the size of data available for decision-makers, it is nearly impossible to manually make any complex decisions. This difficulty is experienced even in machine learning algorithms, which must manage too many attributes, variables, and additional constraints, resulting in the whole process being lengthy and complicated [1]. As such, it is essential to simplify data in the cases where the decisions should be made very quickly, and a need exists to use a decision support system to maintain the decision-maker's sovereignty.
The main drawback of the existing datasets is their uniform structure. For the data related to a single domain, the distribution of attribute values, the size of data, or the overall difficulty of the given dataset classification is expected to be on a similar level. However, in the case of more general approaches, we often face inconsistency in data, including the need to use additional knowledge from the domain experts. In general, data available in repositories are mostly preprocessed and directed on a particular problem (like the classification or the regression). At the same time, the initially collected data may still be very complex.
The above problem had led to the construction of many complex algorithms and methods intending to decrease the complexity of the data used in the decision process.
Among these methods, we can emphasize approaches for reducing the number of variables included in the algorithm [2,3]. The idea of initially preprocessing the data related to the feature selection, removing the redundant data, or including more general attributes replacing the existing ones is not a new concept and it was deeply studied in many articles, where initial data limitation was needed. Examples of such feature selection methods can be found, for example, in extensions of the Principal Component Analysis method. One of the newest review articles in this subject can be found in [4]. A more general approach for future selection involving the swarm methods is presented in [5,6]. In comparison, one of the newest review articles related to the swarm methods is [7]. The second large set of algorithms used for the feature selection is related to the tree-based methods. In these methods, the attributes can be selected based on the importance of the attribute in the process of building the tree (classifier). An example of comparison for such algorithms can be found in [8].
For many cases, data dependencies are not linear. Thus, a complex method of variables elimination should be applied. For example, in the case of periodically important variables or in situations where the linear dependencies between elements are not obvious, different methods must be used to emphasize the crucial variables in the system. To avoid redundancy in the data, the selected variables should exhibit little or no mutual correlation. This requirement was described by [9], in which the phenomenon of the illusion of validity occurs: people have confidence in the results, which are based on redundant data. Thus, in decision support systems and during attribute selection, the role of decision-makers can also be marginalized.
A method that effectively identifies the crucial variables present in the complex data can be essential for the whole system's efficiency. However, in the case where the data structure and its complexity makes the data difficult or even impossible to process, the decision-maker faces a two-step problem: First, there is a need to adapt the data to fit the algorithm's input format. This can be achieved by some additional preprocessing methods, leading to a data format acceptable as the algorithm's input. However, the whole process may be lengthy and complex. It often covers concepts such as filling the missing data, discretization, and scalarization. Dealing with missing data cannot be solved with simple methods, and the literature covers various approaches to this problem [10][11][12].
Thus, today we observe many algorithms dedicated to a particular domain, which, opposite to the general approaches, can deal with the problems more efficiently. However, one should know that such available methods can still be beneficial, even as a starting point for emerging domains related to complex or big data. Our idea was to collect raw data from different fields and prepare it in a uniform, easy-to-analyze format based on decision tables. At the same time, we tried to use as general tools as possible, which unfortunately can lead to a decrease in classification quality. However, it maintains the generalized approach for all datasets.
Furthermore, we selected entropy as a concept, which allows us to describe the disorder of the data. By the disorder, we understand here the measure of complexity, where the more complex data (fewer dependencies between objects and attributes is visible) is defined by the higher entropy values. Therefore, we assumed that the increase in entropy could be equated with data difficulty. Furthermore, this assumption is verified by performing the actual classification on various datasets. Eventually, the results from the classification on the full set of attributes and subset generated on the basis of entropy can be compared. It is expected that high entropy should lead to less effective classification.
The entropy measure is considered from the point of view of all attributes. Thus, it is possible to identify the attributes with small disorder values (smaller entropy values). A subset of attributes with small entropy could be used to perform the classification while the data is limited.
In our data, a clear distinction exists between conditional attributes and decision class. Data from various fields cover several objects as well as different numbers of attributes. However, the common goal is to perform a classification task on the presented data. The second step of our research completely focused estimating the impact of the entropy-based measure on the classification task. First, we tried to determine if entropy can be effectively used to indicate data difficulty. Eventually, we investigated the results of the classification of the data. We expected that, initially, all conditional attributes analyzed in the dataset could be treated uniformly (i.e., have similar entropy values). Thus, the main questions were: is there a correlation between the entropy values and the quality of classification, and can the entropy-based measure be used to select the best-fitted attributes for the classification problem? To summarize, our research steps were as follows: • initial preprocessing of real-world data; • entropy calculation for different datasets; • classification on all datasets; • selection of the best-fitted 25% of attributes based on the lowest entropy measure; • the comparison between the classification results for the full and limited set of attributes on different datasets.
To generalize our observations as much as possible, we tried to select data from various fields and describe the whole preprocessing framework with the use of domain knowledge presented by experts from different fields. Moreover, this preprocessing schema allowed us to use a general data format, which can be effectively used in entropy calculation and, finally, in classification problems.
The paper is organized as follows: In the next section, we present the related studies. In Section 3, we discuss the theoretical background related to the subject, including a description of entropy, decision tables, and efficiency measures used in classification tasks. Section 4 contains a description of the real-world data covering different domains. Section 5 presents the results of our experiments based on entropy calculation as well as the classification problem. Eventually, we conclude the study in Section 6.

Related Works
By classical entropy, we understand the measure of uncertainty related with some data. The idea was introduced by Shannon in 1948 [13] and further extended, for example, by Renyi and Tsallis [14,15], where Renyi entropy is the generalization of the Shannon entropy for specific parameters.
The classical entropy measure is used as a crucial element in many different algorithms and methods. Amongst the most prominent examples are the well-known classification algorithm C4.5 developed by Quinlan [16] as an extension of algorithm ID3 [17]. In both examples, entropy was used as a measure to generate a classifier (a decision tree). In C4.5, entropy was used for all algorithm steps to calculate the information gain based on the entropy for every attribute available in the dataset. A similar idea is used in greedy heuristic ID3, where, once again, the attribute used as a split criterion for the data is based on the highest information gain. Such an approach has been successfully used in machine learning [18] and signal processing [19].
Entropy is often used as an element of broader methods rather than a standalone measure. It has a role in novel metaheuristics such as an extension of classical particle swarm optimization [20]. In [21], it was used as an alternative approach to the concept of fuzzy sets to measure the uncertainty of the task in a task assignment problem. Entropy was used as an extension of the binary classification problem solved by particle swarm optimization [22]. In many articles, entropy has often been used as a replacement for classical measures such as variance [23].
Entropy mixed with the concept of fuzzy sets was included in an outlier detection approach [24]. In [25], entropy was included as a part of the feature selection mechanism based on fuzzy sets. Finally, a more complex approach, including the fuzzy multicriteria approach based on the TOPSIS method, was presented in [26].
Entropy was used in many different approaches to measure randomness in a clinical trial [27]. In [28], entropy was introduced to measure the uncertainty of ordered sets. In general, it can be used as an idea of measure for different fields such finance [29,30], chemistry [31], physics [32], and more. However, no works used entropy as a general measure for different domains simultaneously. A separate direction of research is devoted to various extensions of classical entropy. In [33], the idea of measuring an entropy on different scales (multiscale entropy) was presented. In the case of time-series data, the concept of approximate entropy is often used [34]. In [35], approximate entropy was extended, called sample entropy. This idea was further extended in [36]. Both methods were used in different applications to address various dynamic aspects of systems.
Another prevalent extension of the classical measure is permutation entropy, effectively used as a nonlinear measure in different fields such as cyber-security [37] and fault diagnosis in systems [38]. Some preliminary comparisons between the classical entropy measure and Pearson correlation were introduced [39]. In this example, the authors focused on the data derived from the system from the Internet of Things, focusing on spatio-temporal data.
The idea of using entropy as a complexity measure is well-known, and it has been recently studied by many researchers. Among interesting examples, we mention [40], where information entropy was used to measure the genetic diversity in colonies. Another example covers the general idea of measuring the complexity of time series [41].
Entropy as a measure of diversity was presented in [42], where the authors used Shannon entropy to measure the urban growth dynamics for a case study related to real-world data from the city of Sheffield in the U.K. More complex examples related to health and perception can be found in [43,44]. In the first case, the authors used entropy-based concepts for knowledge discovery in heart rate variability, whereas in the second example, approximate entropy was used for EEG data. Finally, among the newest works from the medical domain, Coates et al. [45] used entropy in the Parkinson's disease recognition process.

Methodology
For a set of objects X, every element can be described by a vector of n conditional attributes x atr = {x atr 1 , x atr 2 , . . ., x atr n } where n is a number of conditional attributes. A decision class is denoted as x class . Thus, every object is described by a pair ( a atr , x class ). For every conditional attribute, we have the attribute and value pair, and every attribute can have a numeric or symbolic value. In the case of attributes with continuous values, the discretization procedure, leading to limiting the number of values for a single attribute, is often performed.
In classification problems, the decision class x class , including information about the decision class for a single object, has one of the values belonging to the decision class set of values.
In this article, we perform the preprocessing of real-world data, which allows transforming the initial raw data into a decision table defined as follows: (1) All analyzed data differ in terms of the size of set X and the number of attributes in the vector of conditional attributes x atr . We did not assume simplifications related to the cardinality of the decision class. Thus, for some sets, this attribute is continuous, and an additional discretization procedure is needed. Eventually, for all datasets, the number of values in decision class x class is discrete.

Entropy as a Measure of Classification Uncertainty
According to our aim, we wanted to explore the possibility of using entropy as an indicator of data difficulty. Therefore, we treated entropy as a measure of classification uncertainty. In addition, we explored how data can be simplified using only attributes selected in terms of entropy value. Therefore, we also examined the information attribute to assess the usefulness of entropy for data simplification.
Assuming that several different symbols describe information, entropy, in its basic form, can be calculated as follows: where |C| is the number of different decision classess, and p i is the probability of occurence of the i-th decision class. With such a definition, entropy can be understood as a measure of data complexity. With an increasing number of decision classess available in the data, the overall complexity increases. In the most trivial case, for a single decision class, the p i value is equal to 1, whereas log(p i ) is zero (as well as the entropy). Thus, any increase in this value leads to higher entropy. The value of the information attribute (Equation (3)) is determined for each conditional attribute to determine how it can change the entropy of the decision table DS. The resulting value determines the entropy that can be obtained by considering that attribute.
The information attribute is thus based on the calculation of entropy due to decision classes (Equation (2)), but this is performed due to the cases grouped by the values of the attribute being analyzed.
Formally, the information attribute is written as Equation (3), but note that these determinations are required for each attribute, where k is the number of attributes being analyzed, m is the number of possible values of the k-th attribute, and |DS i | is the number of instances having the i-th attribute value (analogously, DS i is the subset of the decision table DS that has only the i-th attribute value on attribute k).
In our considerations, in f o_att is crucial for simplifying the dataset. For each decision table DS with the number of conditional attributes n, values are determined based on Equation (4). This observation is used for further analysis.

Classification Measures
In our research, we wanted to examine the classification quality using state-of-the-art machine learning algorithms. We chose decision trees (CART algorithm) and ensemble methods: Random Forest, Bagging, and AdaBoost. To assess the quality of classification, in addition to the classical measures of classification quality (accuracy), we also used precision (called positive predictive value (PPV)) and recall (called true positive rate (TPR)). Notably, these are binary classification measures, i.e., for a dataset with only two decision classes. In real datasets, there are often more decision classes. Several methods can be used to generalize precision and recall. We wanted to provide as much information as possible in our solutions, so we computed precision and recall for each decision class.
Therefore, for PPV, the analyzed decision class is treated as positive and all others as negative, and analogously for TPR. So, in the definition of the measures of the quality of classification (accuracy in Equation (5), precision in Equation (6), and recall in Equation (7)), we denote: TP: to identify all correctly classified cases of the analyzed class; TN: to identify all cases outside the analyzed class that were not assigned to this class; FP: to identify all cases outside the analyzed class that were assigned to this class; FN: to identify all misclassified cases of the analyzed class.

Data Preparation and Preprocessing
In this section, we provide details of the real-world data used in further experiments. The data were collected from external sources and cover various fields. We adapted the raw data into a decision table format, described in detail in the previous section, to perform the tests based on the classification problem. All necessary steps for data processing are described in this section.
However, despite the processing of all datasets, some general preprocessing steps were used. Below we indicate these steps in points with a short description.
• collect data in the raw format-the first step was to obtain the entire data. Please note that for some cases, these data were obtained from different sources; however, all information initially was presented as a Please note that the last step was used for both conditional attributes as well as the decision attribute (if needed). Moreover, these were general steps adapted for all data. However, additional steps were explicitly performed for the selected data (for example, related to the natural language processing), described in detail in subsections related to different data.

Fake News Data
Universal access to the Internet created the possibility of the rapid creation and gaining of knowledge by users, which became a threat through the easy spread of false information in the form of fake news. Fake news aims to present users with a view that is not in line with reality or leads them to make wrong decisions or actions based on false information.
The problem of disinformation is best visible on social networking services and news sites, where fake news is spreading widely in the form of sharing, passing on to friends, or creating documents based on unreliable sources [46]. Therefore, it is essential to quickly classify the documents posted and adequately mark the articles as true or fake news. The subject matter of the documents from the fake news dataset is related to many different fields; in particular, it concerns political, media, and financial content, as well as current events [47,48].
Kannan et al. [49] claimed that preprocessing real text data for analysis using machine learning algorithms is always the longest stage and often amounts to around 80% of the total processing time. Therefore, to transform the fake news dataset into a decision table, we propose applying the statistical approach of natural language processing (NLP).
In the first step of NLP, the tokenization process is carried out, dividing a given text into the smallest unit (e.g., a sequence of words, bytes, syllables, or characters) called a token. The result is the creation of an n-gram model that is used to identify and analyze attributes used in natural language modeling and processing [50]. In our research, we used n-gram to define individual words from document titles, from which we additionally rejected words appearing on the stop word list. An example of a stop words list is presented in Figure 1.
a, an, about, are, be, is, was, will, as, how, by, f or, o f , f rom, in, on,at, or, and, the, that, these, this, too, what, when, where, who, The next step in NLP is to perform the normalization process using two methods: stemming and lemmatization. The stemming method is used to extract the subject and the endings of the words. Eventually, similar words are replaced by the same base word [51]. The method of lemmatization consists of reducing the word to its basic form [52]. The purpose of the normalization process is to reduce the variability in the set of terms.
The final step in the NLP covered in this research is creating a word vector model as a document representation. Our vector model is presented as a matrix (Figure 2), where documents ( dok_1-dok_n) are presented in the form of feature vectors representing particular attributes (at_1-at_n). In the model, we use a binary representation, where each value from the {0,1} set determines whether the word appears in a given document. In addition, the number of attributes is limited to the most common words in the title of the document. On this basis, the fake news dataset was transformed into a decision table consisting of the attributes of the most common words and a decision attribute (decision) containing two classes (true or fake). at_1 at_2 at_3 at_4 at_5 at_6 at_7 at_8 . . . at_n decision  The decision table structure consists of columns with conditional attributes and one decision, whereas rows include all documents from the set. Conditional attributes are words most often appearing in the text. The presence of specific words (in the decision table) is strictly dependent on the analyzed dataset. For this reason, the number of attributes is limited. Table 1 shows an example of the frequency of words (selected as conditional attributes) in the titles of true and fake news. Real text datasets are challenging to analyze due to the large number of attributes [53] that constitute single words for the fake news dataset. The distribution of attribute values due to decision classes (fake and true) is presented in Figure 3.
For each attribute, there is one histogram ( Figure 3) consisting of two columns, which corresponds to the number of values for each attribute. The first column shows the number of objects (article content) in which the selected word does not appear (as an attribute value), while the second column shows the number of objects in which the selected word appears at least once. These numbers are shown in the chart. Additionally, each column shows the assignment of a word to the appropriate class: blue is the true class, and red is the fake class.
By such a distribution of attributes due to decision classes (fake and true), it can be seen that some words (such as word_3, word_7, word_17, word_19) do not appear at all in the fake class-the right column is entirely blue. However, in the case of the first column, the division into both classes is equal for almost all attributes.

User Websites Navigation Data
Electronic commerce (e-commerce) has become popular as the Internet has grown, with many websites offering online sales, and e-commerce activity is undergoing a significant revolution. The major challenges in research are the collection, identification, and adoption of data supplied by Internet services to provide actionable marketing intelligence.
The main difficulty in web usage mining is the procurement of the desired database, as the only information we can collect from users visiting a website is through tracing the pages they have accessed.
Data collected from log files must be processed before data mining techniques (based on machine learning algorithms) can be used. Then, the personalization process is performed in the six main steps generally used in the field:

1.
Data collection: Collecting the data from the server or the user side.

2.
Data filtering: Removing or correcting undesirable data such as the log information obtained by crawlers.

3.
User identification: Identification of user by IP address, cookies, and direct identification.

4.
Session identification: Tracking the activity of the same user.

5.
Characteristics selection: Selecting characteristics that can be useful for user behavior analysis.

6.
User behavior analysis: Studying the behavior of different users for selecting dominant ones (i.e., the characteristics that change significantly from one behavior to another).
The main idea of analyzing the users' behavior during user navigation was to limit the users' sessions to 10 actions. Each action corresponds to a one-page view by the users. We chose the 10 actions limitation in the session because it was impossible to perform a pertinent clustering using less than 10 actions for the user session; the cluster was not significant enough, and differences between clusters were negligible.
Before the phase of navigation conditional attributes selection, the hierarchy of the website was derived. An example division of the site is as follows: First, we separated thematic websites to create universes. Websites from each universe were about the same topic. Then, we divided the entire site into seven different universes:

1.
Store: the main universe (for example with products list), 2.
Quick order: direct purchases by entering the catalog reference, 3.
Condition: Terms of sale and shipping, and 7.
Various: all others, such as home pages.
The universe store was divided into three levels of hierarchy: section, subsection, and subsubsection. Generally, the final product page corresponds to the subsubsection.
From this hierarchy, we selected conditional attributes that describe the user navigation of our commercial partner's website. The attributes are presented in Table 2. No. product pages seen No. of same product seen The presented attributes are described as follows: • User ID describes the ID based on cookies, a unique ID for each user; • Session ID describes the session ID during one day each session is considered as closed after 30 min of inaction; • Purchase is a binary value that shows if the user made a purchase during their action; • Discount code is a boolean value describing the presence of a discount code during the purchase; • New user describes whether the user was recognized as a user who already made a purchase on the site; • Source of navigation describes whether the user is entered into our commercial partner's site voluntarily by using, for example, the search engine, or was pushed to visit the site by a mail company; • Total time describes the length of a session in seconds; • Total universe (1-7) represents the seven different attributes that describe the time that a visitor spends in each universe; • Total no. of pages seen describes the number of all the pages visited by the user during a session; • No. pages universe (1-7) seen represents the seven different attributes that describe the number of pages visited by a user in each universe; • No. of universe, section, subsection, and subsubsection changes are the four features that describe the number of changes the user makes during their navigation. If, for example, the user switches universe and then returns to the previous one, the value of this attribute is equal to 2; • No. of sections or subsections seen are the two attributes that describe the number of different sections or subsections seen during the user session; • No. of product pages seen describes the number of product pages seen in total; • No. of same product seen describes the sum of product pages that have been seen several times.
For the decision attribute, we chose the binary attribute purchase. Decisions classes were "yes" and "no". All the attributes were normalized.
The distribution of attribute values due to decision classes-purchase is presented in the Figure 4. Two colors correspond to the decision classes: blue indicates sessions not completed with the purchase, and red indicates the sections in which the purchase was made. As we can see in the Figure 4, some attributes do not discriminate the decision class. For example, the decision class distribution is identical for attributes such as Day/Month/Year, Hour_o f _end, and Source_o f _navigation. On the other hand, attributes such as Discount_code, Total_time, or New_user, clearly indicate the purchase class. According to the presented data distribution, we can determine that the user's session ending with purchase has the following attribute values: avg._no_o f _pages_viewed and average_amount_o f _time_spent_on_navigation, the customer is not the first time on the website, and he has a discount code, the customer does not spend a lot of time in the store section, but frequently changes subpages in this category.

Real Estate Market Data
The real estate market has grown rapidly during the recent years [54]. As such, both the volume of data and the number of processed details have increased. Investors are looking for attractive properties from which profit can easily be earned. As customer habits change, so do the features connected to a particular property that is essential for buyers.
The change in investor and end-consumer behavior has led to the inclusions of additional details in advertisements of properties. Each advertisement is currently filled with much additional information, some of it structured and some of it only provided in descriptive text. The real estate market data used in this paper originated from actual advertisements presented on multiple Polish market web pages. The details of the adverts are often hidden inside the text describing a particular property. However, many details are often presented in a structured form, allowing less sophisticated automatic scrapers to gather the data. For some of the conditional attributes, it is still necessary to perform more advanced processing. For instance, the_ f loor_number is usually provided as a number in the vast majority of cases. However, there are some occasions where it is stated verbally as "ground floor" or "higher than the 10th floor". Most of the advertising portals do not provide a good enough validation of this data, which is why, during the data acquisition, we had to construct more detailed methods to handle the special types of values and data. A similar process had to be performed for geo-encoding the spatial data. In almost every advertisement, the exact address of the property was not given; only the street name and the city were described. Sometimes the street names had spelling errors, were not correctly placed on a map, or used an old street name before the mandatory change of street names in Poland that recently occurred [55].
Notably, the process of acquiring data from web pages is complicated. The dataset used in the current study consists of the following conditional attributes: • Build date is the year the property was built. This attribute needed extra preprocessing steps, as some of the records provided a textual representation such as "the late 1980s". Therefore, the dates were given as is without any numerical processing. • Total number of floors in a whole building. As mentioned before, more advanced NLP methods were applied to clean up the data. • Building material. As the materials change across the decades, a whole dictionary of construction materials was created using both automatic methods and expert knowledge. We also constructed a synonyms dictionary. The provided value was then compared to the dictionaries and cleaned up. This is a categorical attribute. • Floor number on which the particular property is situated. • Area] of the property. Here, the vast majority of data were provided in square meters, but some of the land properties provided this value in acres, which had to be converted to an SI-derived unit of measure. • Building type is a categorical attribute denoting the building type (e.g., semi-detached building, loft, etc.). Here, we used similar preprocessing techniques to those used for building material. • Condition state describes the overall condition of a property. As this is highly subjective, as there are virtually no norms that can standardize this attribute, we used a two-fold approach. As a starting point, the value presented in the advertisement was taken directly as-is. Next, this value was then compared to the dictionary of values and corrected for spelling errors and synonyms. In a second step, the description text was analyzed to find keywords that could decrease or increase the overall condition of a property. For instance, if the property was marked as "ready to move in", but the description mentioned that "painting needed" or "kitchen is not equipped with stove", the overall condition was decreased. Although this is considered a categorical attribute, current works involve introducing the order relation to items from the condition dictionary. Additionally, we are working on an image classifier that will automatically label the state of a property. • Windows with which the property is equipped (wooden, PCV, etc.). • Private ad is a dichotomous attribute discriminating if an advert was published by a professional dealer or a private party. As research has shown, these two types are constructed vastly differently. Most of the time, private advertisements have lowerquality photographs, but the description is more accurate and meaningful than in professional ads. The former often includes additional costs in the description (such as a mandatory extra-paid parking space). • Market type has two values: primary and aftermarket. • Ownership type describes the legal ownership type of a given property.
The last attribute, being the decision one, denotes the price per square meter. As this value can fluctuate widely, we transformed it using a simple discretization: Because of the nature of scrapped data and the frequent necessity for repairing or transforming the data (e.g., converting units of measurement between imperial and metri-cal), this data is rather difficult to analyze. Furthermore, many attributes, all interesting for the end-user, make this processing even more complicated.
The distribution of attribute values in accordance with decision classes was created, as shown in the Figure 5. Please note that due to many values in the decision class, there was no visible distinction related to color for each class. Even though the data has been preprocessed extensively, some of the original values with mistakes were left intact. This is the case for area attribute, where one of the flat's areas is set to 349,000 square meters. This is clearly seen in the distribution plot, where the plot is heavily skewed. The same thing is happening with the build_date (a building has a date set to 892,007; there are also some spelling errors with a date like 19,000 or 20,014 where an individual probably inserted an additional 0). Because the number of records with such mistakes is relatively small (less than 0.02%), the authors included these outliers in the dataset to determine their influence on the overall entropy and classification results.
It is clearly seen that most of the properties are situated below fourth floor, which is expected, as it is far more easy to build such buildings in Poland compared to skyscrappers due to legal reasons. The owners tend to over-estimate the quality of interior, therefore the vast majority of apartments have the "ready to be moved" condition_state. Most of the analyzed apartments also have modern PVC windows.

Sport Data
Sport is a valuable part of many people's lives, understood both as physical activity and in terms following individual teams or athletes. Football is the most popular sport known, with the European leagues being some of the most famous in the world. Therefore, the top leagues from Germany, Italy, and Spain were selected for our analysis.
Numerous studies based on both expert analysis and machine learning techniques for predicting sports results can be found in the literature [56][57][58][59]. The most popular and accessible are predictions of match results in the form of win/loss/draw; however, both analyses and predictions may concern other elements such as the number of goals scored, the exact score, or the number of yellow cards [56,60].
The dataset was created from the tabular data available on a website [61]. For complete information, the data were extracted using the scraping method from two tables. The first one contains data about the league table. The second one consists of information about individual matches. The tables were then combined to obtain a full decision table that was divided into sets for each country. The conditional attributes included in the decision tables are presented below: • Season: The season in which the games were played: a nominal variable using data for 10 consecutive seasons from 2011-2012 ("11/12") to 2020-2021 ("20/21"). The same attributes are available for the second team as for the first team. The conditional attributes for the second team were marked by "T2". The last of the attributes is the decision class (match_result), which can have three values: 1 indicates a win for team 1, 2 indicates a win for team 2, and X is a draw. Team 1 is the team playing the game on its home field; team 2 is the team playing away.
The Figure 6 shows examples of distributions for the data of the German Bundesliga. A significant part of the data is characterized by right-hand asymmetry, which is naturally related to the domain specificity of the data. Representative examples of this fact are, among others, Winnings_T1, Draws_T1, Goals_scored_T1, Goals_conceded_T1, Points_T1. A team starts with a value of 0 for the number of games won/lost, goals scored/conceded or the number of points. During the game, teams increase the values of these attributes, or they remain unchanged. This behavior contributes to the right asymmetry in the data. The distribution for Goal_di f f erence_T1 is much closer to the normal distribution. In the decision class distribution, it can be seen that the most common values are related to the home team win (color = red), then the visiting team wins (color = cyan) and draw (color = blue). The last two classes have numbers much more similar to each other. The following rules are also observed for the "Team2" data and for other countries' leagues.

Financial Data
From financial data, we can highlight two main groups of data. The first one is related to the well-known Markowitz model (and its extensions) and the portfolio selection problem, which is beyond the scope of this study. The second group is related to the price and indicator data from various markets. In this group, the most popular data are obtained from the financial markets (also known as forex market or foreign exchange market) and concerns the currency pairs.
A single market indicator (or group of indicators used jointly) is used in trading systems to generate buy signals. All indicator data were calculated according to market indicator formulas, which can be divided into two separate groups. The first covers trendfollowing indicators, which include the moving average (MA) market indicator. The MA for time t and s periods, denoted MA s (t), is calculated as: where price i is the value of the corresponding instrument at time i. In the above context, the period is the number of values considered when calculating the indicator. The second group of indicators covers the oscillators, whose primary purpose is to indicate rising or falling potential for the given currency pair. The indicator value is calculated using the currency value and can include the closing, opening, minimum, or maximum currency pair value from previous sessions (or any combination of the above). As an example, the oscillator Relative Strength Index (RSI ) is calculated based on the last n periods in time t as follows: RSI s (t) = 100 − 100 where avg gain is the sum of gains over the past s periods and avg loss is the sum of losses over the past p periods. All mentioned, indicators are calculated based on the currency pair value, which was included in the data. The decision (BUY or SELL) is based on the indicator value in time t and its relation to the indicator value at time t − 1. Therefore, the general rule for opening the trade for indicators can be defined as follows: where ind s (t) is the value of indicator ind in the present reading t considering the last s readings, t − 1 is the value in the previous reading, and c is the indicator level (different for each indicator), which should be crossed, to observe the signal. As shown, the crucial aspect related to generating the signal by the indicator is the value difference between two successive readings. Thus, we decided to include this information in our data in some limited way (in the case of the MA indicator). For the remaining indicators, a discretization procedure was performed because, in the classification process performed in the experimental section, only a limited number of indicator values was accepted. The summary for each indicator is presented in Table 3. Table 3. Discretization procedure for the market indicators. * in the rare cases, where indicator value exceeds the border value (cases with the word "above" or "below", the indicator value is set to the border value).

Indicator Name Range * Discretization
Step Each of our readings in data also included the decision taken as one of the following values: STRONG BUY, BUY, WAIT, SELL, or STRONG SELL. Each set's decision was based on calculating the difference between the present instrument value and the value observed after p readings. This schema is presented in Figure 7. In this study, we examined p equal to 5. The distribution of attribute values in accordance to decision classes was created, as shown in Figure 8. We selected an example data for the AUDUSD instrument; however, a similar distribution of attribute values was noted for the remaining datasets. The blue color on the chart denotes the number of objects for which the STRONG BUY class was observed. Cyan color is related to the STRONG SELL class. Both classes cover the majority of all objects in the data. The red color shows the objects belonging to the SELL class. The two remaining classes are BUY and WAIT, respectively.
In general, we can divide the whole attribute set into three different categories. The first one is related to the instrument price (which is Close on the chart) and two indicators (the_moving_average) based on the price. For this category, we observe attributes, for which there are several values with a reasonably high number of objects assigned. The second category is related to the same indicators, where the difference between two successive readings was calculated. It gives us a distribution close to the normal distribution, where the minor differences (close to the 0) have a high number of objects assigned. Finally, the last category is related to the oscillator indicators like Bulls or OSMA, for which once again the approximation of the normal distribution is observed. Also, for these attributes, relative change between successive readings was included. The main problem in this data is that the slight differences (the middle part of attributes number 4 to 11) are frequently observed in the data. At the same time, most information comes from the relatively significant differences (tails of the distribution). Thus the most promising attribute values are the least observed in the data.

Numerical Experiments
In this section, we describe the experiments we performed on different real-world datasets. For every set, the experiments consisted of four steps: • calculation of the information for each conditional attribute (information attribute); • classification of the obtained data; • classification on the limited set of attributes (including the best 25% of the conditional attributes selected based on the information attributes) as well as the classification on the set of attributes selected by the correlation-based approach; • sensitivity analysis on the parameter related to the percent of attributes included in the limited set of attributes.
We selected a group of well-known state-of-the-art algorithms for the classification: decision tree, Random Forest, Bagging, and AdaBoost. Two measures were used to estimate the quality of classification: the positive predictive value (PPV) and the true positive rate (TPR). Additionally, the accuracy of the classification was measured.

Fake News Data
The fake news detection research was conducted on the ISOT Fake News Dataset provided by the University of Victoria, Canada [63]. This collection includes 44,898 documents, of which 21,417 are real news cases and 23,481 are fake news. Each document in the set is described with the following attributes: Additionally, to determine the decision class, the main file was divided into two separate files: • FAKE: documents that were detected and marked by Politifact.com as untrue sources; • TRUE: real documents from Reuters.com, accessed on 31 August 2021.
In our fake news detection experiments, the dataset was limited to the title, and the decision (true or fake news) attributes only. This restriction allowed us to quickly mark the document based on the title without analyzing its content. In our previous research [64], we showed that the fake news detection model analyzing the titles produces accurate results and reduces the runtime of classification algorithms compared to the analysis of the entire content of the document.
In the first step of the experiments, we calculated the entropy of the decision class (see Equation (2)) and the information for each conditional attribute, which were the most common words in the documents. Notably, the values of the decision table (frequency of the occurrence of certain words) are strictly dependent on the documents that comprise the set on which the algorithm was trained. For this reason, the number of attributes was limited to 20. The results of this experiment are presented in Table 4. As can be seen, almost all information values for individual attributes are close to the maximum entropy value (1.0) and are in the range of 0.958-0.998. However, the last row in Table 4 shows the entropy value for the entire dataset.
In general, it is difficult to determine the set of attributes that most impact the classification results. Only attribute word_17 has an advantage over other attributes because, for attribute word_17, the value of the information is visibly lower and amounts to 0.83. This means that after a single attribute-in this case, one word per document title-whether the document's full title is true or false cannot be determined. Moreover, the conditional attributes are different for a different set of documents, which entails the possibility of entirely different entropy values.
In the next step of the experiments, the values of the classification evaluation measures were calculated using selected machine learning algorithms, which were derived for each of two decision classes (true or fake news). Table 5 shows the results for the classification of fake news data by decision class for all twenty attributes.
In the case of decision class FAKE, PPV values were in the range of 91.38-98.88%, where the best result was obtained using a decision tree, where TPR values were in the range 46.05-58.67%, and the best result was obtained with Bagging. However, in the case of decision class TRUE, PPV values were in the range of 62.70-67.46% (Bagging was superior), and TPR values were in the range of 94.65-99.43% and the best results were obtained by the decision tree.
We also checked the influence of a limited number of attributes on the classification results. For this purpose, 25% of the attributes with the lowest value information attribute were selected (in this case, the top five attributes were selected). The obtained results are presented in Table 6, where the values are similar to those in Table 5. This proves that with a significantly limited number of attributes-in this case, up to 5 single words per document-the classification results for the algorithms used are the same as for a full set of conditional attributes.
The classification accuracy values for the entire set were calculated in terms of the number of attributes (five or 20 attributes), and the results are presented in Table 7. As can be seen, for the three algorithms, the accuracy was in the range of 74.17-75.49%, while for the decision tree, the accuracy was slightly lower at 71.51%.
When detecting fake news by title only, the classification accuracy measure determined how many documents were correctly classified. However, when using the PPV and TPR measures, it was possible to assess how many documents in a given class were correctly recalled and with what confidence (precision).

User Websites Navigation Data
The vital part of preprocessing the data is converting the raw data into a set of navigation attributes. During our research, we obtained the data of our commercial partner for one entire year. This data was more than 85 GB in size. For our learning base, we used a sample of data of one month. We chose the month of April due to avoid any marketing actions. The database for one month represents more than one million sessions with more than 10 actions performed. On account of the scale of the database, the treatment is time-consuming. After performing the limitation, we obtained 211,639 user sessions.
For entropy and classification analyses, we eliminated significantly correlated attributes such as total_amount. In the end, we obtained 31 attributes and one binary decision attribute, purchase.
The dataset for user behavior analysis consists of 211,639 unique rows. Each entry represents a unique user navigation session. First, the entropy value represents the entropy of a decision class of individual conditional attributes. Second, the results are shown in Table 8 along with the cardinality of the value set for each conditional attribute.
The entropy values for most attributes were near 0.5. For several attributes, the entropy value was lower than 0.5. For few attributes, the entropy was less than 0.2. An explanation may be the distribution of values for these attributes, which was strongly unbalanced. In most cases, the value of an attribute was equal to zero, only occasionally taking different values. Examples of these attributes are discount_code and new_user. When analyzing other attributes, the values of entropy were similar, indicating that most attributes carried an equivalent level of information. Intuitively, it seems that some attributes should be more discriminatory, but the analysis of the results did not confirm this. There were no highly biased attributes in the analyzed dataset. Table 9 provides the classification results for the same dataset divided by each decision class value. The efficiency measures indicated relatively accurate results: PPV, TPR, and accuracy values were in the range of 0.89-1. However, both PPV and TPR were better for the decision class equal to "no". The results for the decision tree, Random Forest, and AdaBoost were similar. The results obtained using the Bagging algorithm were visibly worse than for the other algorithms. The PPV value for the class "yes" was around 0.5. Again, the reason seems to be the uneven distribution of the values of the target class. Finally, we performed the limitation of the attributes used in classification. The limitation was based on the analysis of the value of entropy for each attribute. We selected the 25% most significant conditional attributes and performed the classification with a limited number of attributes. The classification results for user websites navigation data by decision class values for 25% of the attributes with the lowest information attribute are presented in Table 10.
The accuracy of the results for user websites navigation data is compared in Table 11. The number of all attributes participating in the classification process was 31. After limiting the set of attributes to seven, the results of the classifier efficiency increased, which may be counterintuitive. Depending on the classifier used, the improvement in efficiency ranges from 0% (DT) to 10% (Bagging). The presented analysis shows the importance of limiting the attributes at the data preprocessing stage and of classification parameterization.

Real Estate Market Data
The goal of the real estate market data experiment presented in this paper was to find which attributes are crucial and essential for AI model creation based on the presented decision table. To achieve this, the values of the information attributes were computed.
The dataset consisted of 14,344 unique rows. There were 13 conditional attributes (described earlier) and one decision (price bucket). In the first experiment, we computed the entropy of a decision class and the information of individual conditional attributes. The results are shown in Table 12, along with the cardinality of the value set for each attribute.
Because the data were obtained from actual advertisements, the cardinality of a decision class fell more or less in a normal distribution (Figure 9). The most frequent price fell into the PLN 6000-7000 per square meter bucket. The far-right side of the histogram plot shows the luxury properties that are part of the dataset. Remember that the property's region heavily influences the real estate market. A property located in the capital is far more expensive than the same property in a less rich part of the country. The overall decision entropy is relatively high, as the classification problem is rather difficult. Most of the attributes maintain a similar entropy value, with a single exception being the property area. Because of the cardinality of this attribute and the fact that the price of a property is usually heavily correlated with the location, this is to be expected. However, the surprising finding is that the value of entropy is also relatively high, which means that the price fluctuation between a property with a similar area is also significant. We found no noticeable changes in information for attributes such as market_type or ownership_type, indicating that such features have secondary importance for the selling price.
All attributes except the area obtained an information value close to the maximal entropy for the whole dataset. That means that no single conditional attribute was enough to predict the price bucket of a given property. Even the area conditional attribute, with a visibly lower information value equal to 2.07, was insufficient to correctly predict the price range. The price range agrees with intuition: a large but poorly located and unfurnished ruin might be cheaper than a downtown loft. Table 13 provides the classification results for the same dataset divided by each decision class value. The Bagging algorithm produced the best results by far in nearly every decision class, both in terms of PPV and TPR. When using the limited set of attributes the following results were obtained (Table 14). Overall accuracy results were also superior using the Bagging algorithm (Table 15). Further research is required to determine whether a precise fine-tuning of hyper-parameters would increase the quality of results produced by the other algorithms.

Attribute Name Value Count Entropy
Price bucket (decision) 16 3.579787 Figure 9. Histogram of cardinality of the decision set.

Sport Data
Three datasets with 3362 unique rows for Spain, 2674 for Germany, and 3359 for Italy were analyzed. There was a total of 26 conditional attributes with match_result as a decision class. In the first stage, we calculated the entropy of a decision class and the information attribute values. The results are shown in Table 16, along with the cardinality of the value set for each attribute.
In all three analyzed datasets, the information attribute value was relatively small. It was the lowest for Goal difference T1 and Goal difference T2, oscillating between 1.38 and 1.40. The highest information attribute value was recorded for Season. The next conditional attributes with high values were Round and Matches T1 (T2). For the remaining measures, the values of attributes were similar. Table 16 presents the entropy of the datasets, all of which are similar (1. 52-1.55).
Of the selected methods, random forest had the highest accuracy, followed by the Ad-aBoost algorithm. The decision tree performed the worst in the classification. None of the algorithms provided a significant advantage in terms of efficiency measures. A summary of the results is presented in Table 17.
Tests were also conducted using fewer attributes (from 24 to 6; 25% of the set based on the information attributes values). The results obtained are presented in Tables 18 and 19. As can be observed, similar results were obtained with a limited list of attributes. For some cases, the results obtained with a limited set of attributes were better. The best algorithms, in this case, were AdaBoost and random forest, whereas Bagging worked poorly.   The results (Table 17) show a problem with the prediction of class X (draw), which is best exemplified by the complete lack of prediction results by the Random Forest algorithm for data from Germany and Spain; for the remaining cases, this class had poor results. The unbalanced values in the decision class may be the reason for this finding. Note that a draw between teams seldom occurs.
The classification accuracy for the three sets and all selected algorithms oscillated between 51.55% and 55.85%, being higher than the random approach (for the three decision classes = 33.33%). The Random Forest algorithm achieved the highest classification accuracy on the Italy dataset and the lowest was achieved by Bagging on the Spain dataset. The exact results are presented in Table 18.

Financial Data Results
We used daily forex data in this study, which means that every new value was obtained at the beginning of the daily market session. We selected four different currency pairs as separate datasets: AUDUSD, EURUSD, GBPUSD, and NZDUSD, each containing 2865 readings. In addition, we used six different oscillator indicators: the Bulls indicator (Bulls), Commodity Channel Index (CCI), DeMarker indicator (DM), Oscillator of Moving Average (OSMA), Relative Strength Index (RSI), and the stochastic_oscillator. Additionally, the moving average (MA) indicator, calculated for 14 (MA14) and 50 (MA50) past readings, were included. For the results, we used the MA indicator and MA to denote the absolute difference between two successive readings for the indicator. It provided us with an overall number for 10 attributes.
In Table 20, we present the entropy of the decision class along with the information attributes values for the four different datasets. Firstly, there are no visible differences between the entropy values for the different datasets. However, a significant difference exists in the case of trend-following indicators (the first four attributes related to the MA indicator). This is obvious for MA14 and MA50. However, these attributes were not preprocessed and were used as was. Small entropy values suggest the strong predictive power of these indicators; however, their practical usability is lower due to a large number of different attribute values (in comparison to other oscillator indicators such as RSI).
In the case of oscillators, information attribute values were held on the same level instead, and it would not be easy to identify the best (in the sense of information) indicators. However, it is easy to find many examples of articles confirming that indicators' predictive capabilities are similar. Table 21 presents the results of classification based on the PPV and TPR measures for the complete set of attributes available in the dataset. The decision class values were highly unbalanced, and for some cases, values such as BUY or SELL did not occur even once. For other cases (such as in the case of the GBPUSD dataset), the results were poor quality because we observed the STRONG BUY or STRONG SELL decision for most cases. However, in general, the AdaBoost algorithm for these rare cases with buying or selling values was slightly better than the Bagging algorithm. For the remaining cases, all four algorithms achieved similar results oscillating between 30% and 40%. Lower results for some cases (such as the STRONG BUY for the EURUSD dataset) could be related to the market situation and overall advantage of the bearish trend.  Next, we performed the classification once again on the limited set of attributes. The results are presented in Table 22. For both measures (PPV and TPR), the quality of classification slightly worsened. However, the results improved for some rare cases (for example, EURUSD and GBPUSD and the TPR measure). This was achieved despite considerably reducing the number of conditional attributes included in the classification process.  Eventually, we analyzed the classical accuracy measure for two cases: with the full set of conditional attributes along with the limited set. These results are presented in Table 23. Surprisingly, the results do not indicate that the full set of attributes allows obtaining the highest accuracy values. These results are ambiguous; for some cases, (AUDUSD or EURUSD with the Bagging algorithm), accuracy was higher using the limited number of attributes.
These observations were also confirmed for the remaining sets. Thus, it can be assumed that some core sets of attributes can allow obtaining a relatively accurate classification. However, dependencies between these attributes are more sophisticated than simple linear correlations.

Attributes Selection and the Sensitivity Analysis
To test and evaluate our results based on the attributes selection (based on the entropy values), we used the well-known correlation-based feature selection (CFS) method implemented in the WEKA system [65]. As a result, a subset of attributes, including the essential elements, were selected-comparison of a number of attributes obtained by our method and the WEKA system can be found in Table 24. As it can be noted, for most cases, the number of attributes in our approach is smaller than the number of attributes selected by the CFS method. For example, only the User Websites Navigation Data attribute selection is shown five instead of seven (out of 31 possible) attributes. In the case of the financial data, the number of attributes was the same for both methods. In contrast, for the remaining datasets, our proposed method allowed us to use a smaller number of attributes-extreme cases related to Real Estate Market Data indicated nine instead of three (out of 31) attributes.

CFS Proposed Approach Original
Fake News Data 8 5 20 User Websites Navigation Data 5 7 31 Real-Estate Market Data 9 3 15 Sport Data (Germany) 8 6 24 Financial Data (GBPUSD) 2 2 11 A smaller number of attributes resulting from the use of our method does not affect the overall quality of classification. The results of classification after the selection are presented in Table 25 (names of datasets were written as an acronym). The table shows the difference in classification based on the attribute set calculated using the CFS method and our proposed approach. As can be observed, despite the smaller number of attributes indicated by the proposed method, the classification quality is similar-mostly does not exceed 0.3%. Only for the Random Forest method used for the Real Estate Market Data, an overall improvement close to 1% is observed-it is the case, where the number of attributes selected by the CFS method was equal to nine (instead of three in our proposed method). Similarly for the Sport Data, where there is improvement around 1%. While for the Financial Data, the highest differences (favoring our proposed method) were observed. In the case of the Random Forest and Bagging algorithms, the attributes selection worsens the results for over 2%. For the Financial Data for both cases, the classification was performed based on two attributes.

Data Decision Tree Random Forest Bagging AdaBoost
In the case of the proposed method, we used the threshold of 25% of attributes included in the classification. It was shown to evaluate if the small subset of attributes allows maintaining the relatively high classification quality. Attributes were selected as the most important from the point of view of the entropy measure. This threshold was set experimentally, and it was based on several different indicators. Going below the 25% could limit the subset of attributes to two or even a single value in the case of analyzed data. At the same time, in the case of many attributes, it was possible to observe the visible decrease of classification quality. An example chart for the Sport Data (Germany) is presented in Figure 10, where the quality of classification (the Y-axis) is presented depending on the number of attributes (the X-axis). The vertical line points out the 25% of attributes used in the article.

Conclusions and Future Works
In this study, we investigated the possibilities of using the entropy measure to select the best set of conditional attributes to be used in a classification problem. The general idea of the entropy, related works, and the problem background was introduced in the first part of the article. We also selected real-world data covering different fields. These data were retrieved and described with the use of domain knowledge experts. Finally, preprocessing was applied to all datasets, which were transformed into decision tables.
The datasets differed in their complexity, number of objects, number of conditional attributes, and the number of decision classes. Our goal was to calculate the entropy of decision classes and the information attribute values. Furthermore, we performed the classification with a set of well-known state-of-the-art algorithms. To estimate the quality of classification, we used the recall, precision, and accuracy measures. After the initial results, we selected the 25% best attributes (attributes with the best information attribute values) and performed the classification on the limited number of attributes.
For most of the cases, the algorithms obtained similar results. However, there were some examples, such as the real estate dataset, in which the Random Forest produced better results using only the limited attribute set. The Bagging algorithm showed slightly lower classification accuracy. The nature of the Random Forest algorithm, as the name implies, conducts each run providing similar but different results. The hyperparameters of Random Forest are the most prone to fine-tuning, but optimizing the parameter of each used algorithm for each used dataset was beyond the scope of this study. Notably, the value of real estate cannot be classified only using the significance of attributes but also must consider emotions and non-technical factors. For instance, we were unable to quantize the "cool" factor of a given property.
For the remaining datasets, the results were not uniform. It was difficult to identify the attributes with the best information attributes value. Differences in these values amongst the attributes in the single dataset were often negligible. However, eventually, we were able to select a subset of attributes with which the classification procedure was performed once again. Surprisingly, the limited set of attributes often allowed obtaining similar classification results. Unfortunately, it was impossible to capture the complex, nonlinear relations amongst the conditional attributes within the single dataset.
In the case of classification, we used the classical algorithms considered as a state-of-art approach. However, the multicriteria efficiency measure based on different entropy types could give much more useful information. This can be the case, especially for complex datasets without uniform structure (like Big Data). At the same time, we only investigated entropy in its basic form. An interesting approach could be related to introducing different entropy measures or even deriving estimates based on other entropy types.
In this article, we obtained some advantages over classical methods; however, the obtained results are not uniform. Therefore, our future goal could be related to extending the number of analyzed sets and emphasizing the quantitative results rather than focusing on the description of every single piece of data used in the experiments.