Application of Information Theory Entropy as a Cost Measure in the Automatic Problem Solving †

We study the relation between Information Theory and Automatic Problem Solving to demonstrate that the Entropy measure can be used as a special case of $-Calculus Cost Functions measure. We hypothesize that Kolmogorov Complexity (Algorithmic Entropy) can be useful to standardize $-Calculus Search (Algorithm) Cost Function.


Introduction
Information Theory Field founded by Claude Shannon in 1948 [1] is a branch of statistics that is essentially about uncertainty in communication.Shannon showed that uncertainty can be quantified, linking physical entropy to messages and defined the entropy of a discrete random variable as Entropy ( ) = − ∑ ( )log2 ( ).A (key) result of Shannon entropy is that −log2 ( ) gives the length in bits of the optimal prefix code (e.g., Huffman code) for a message i.Similar like for probabilities, the conditional entropy has been defined to express mutual information, and entropy for continuous random variables.
It looks that Information Theory can be applied practically to anything, including coding theory, communication, data mining, machine learning, physics, bioinformatics.In this paper, we investigate the relations between Information Theory and Automatic Problem Solving.
Universal Problem Solving Methods are the part of AI and Theoretical Computer Science [2][3][4][5].However, automatic problem solving requires construction of the universal algorithm, and this is a Turing machine unsolvable problem.The importance of automatic problem solving methods is so tremendous that even partial solutions are very desirable:


The never ending dream of universal problem solving methods resurrect throughout all computer science history: For such approximation, the models of computation more expressive than TM are needed.They are called superTuring or hypercomputational models of computation or superrecursive algorithms [3,6].


The kΩ-optimization meta-search represents this "impossible" to construct but "possible to approximate indefinitely" universal algorithm, i.e., it approximates the universal algorithm.It is a very general search method, allowing to simulate many other search algorithms, including A*, minimax, dynamic programming, tabu search, evolutionary algorithms.

$-Calculus Syntax:
 Simplicity, everything is $-expression, open system, prefix notation with potentially infinite number of arguments, and consisting of simple and complex $-expressions.


Search can be: offline (n = 0, the complete solution is computed first and executed after without perception), online (n ≠ 0, action execution and computation are interleaved.

On Problem Solving as an Instance of Multiobjective Minimization:
Given an objective/cost function $: A × X → R, A is an algorithm operating on its input X and R set of real numbers, problem solving can be understood as a multiobjective (total) minimization problem to find a* ∈ AF and x* ∈ XF, AF ⊆ A terminal states of algorithm, and XF ⊆ X terminal states of X such that $(a*, x*) = min($1($2(a), $3(x)), a∈A, x∈X}, where $3 is a problem-specific cost function, $2 is a search algorithm cost function, and $1 is an aggregating function combining $2 and $3.


If $1 becomes an identity function we obtain Pareto optimality keeping objectives separate.


For optimization (best quality solutions)-$2 is fixed, and $3 is used only.


For search optimization (minimal search costs)-$3 is fixed, and $2 is used only.


For total optimization (best quality solutions with min search costs)-both $1, $2 and $3 are used. kΩ-optimization (meta-search)-a very general search method that builds dynamically optimal or "satisficing" plans of actions from atomic and complex $-expressions. kΩ-meta-search is controlled by parameters: n-depth of execution, -Ω-alphabet for optimization,  kΩ-optimization works iterating through phases: select, examine, execute.


It is very flexible and powerful method: combines the best of both worlds: deliberative agents for flexibility, and reactive agents for robustness.


The "best" programs are the programs with minimal cost-each statement in the language has its associated cost $ (this leads to a new paradigm-cost languages).

Cost Performance Measures and Standard Cost Function:
 $-calculus is built around the central notion of cost.The cost functions represent a uniform criterion of search and the quality of solutions in problem solving.Cost functions have their roots in von Neumann/Morgenstern utility theory and they satisfy axioms for utilities [5].
In decision theory they allow to choose states with optimal utilities on average (the maximum expected utility principle).


In $-calculus they allow to choose states with minimal costs subject to uncertainty (expressed by probabilities, fuzzy set or rough set membership functions).


It is not clear whether it is possible to define a minimal and complete set of cost functions, i.e., possible to use "for anything".$-calculus approximates this desire for universality by defining a standard cost function possible to use for many things (but not for everything, thus a user may define own cost functions).

Application of Information Theory Entropy as an Instance of $-Calculus Cost Measure
Most algorithms that have been developed for learning decision trees are variations on a core algorithm that employs a top-down, greedy search through the space of possible decision trees, i.e., the ID3 algorithm by Quinlan and its successor C4.5 [5].ID3 performs a simple hill-climbing search through the hypothesis space of possible decision trees using as an evaluation function the

Gain(S,A) = Entropy(S) − Σv ∈ Values(A) Entropy(Sv) |Sv|/|S|, where Entropy(S) = Σi − pi log2 pi.
Let's consider problem solving (learning + classification) for ID3 expressed as a special case of kΩ-search finding the shortest classification tree by minimizing the sum of negative gains, i.e., maximizing the sum of positive gains.The system consists of one agent , i.e., p = 1, which is interested only in information gain for alphabet A = {ai, aij}, i, j = 1, 2, i.e., Ω A, i.e., costs of other actions are ignored (being neutral-in this case having 0 cost), and it uses a standard cost function $ = $3, where $3 represents the quality of solutions in the form of cumulative negative information gains-payoff in $-calculus.In other words, total optimization is not performed-only regular optimization like in the original ID3.A weak congruence is used.In other words, empty actions have zero cost.The number of steps in the derivation tree selected for optimization in the examine phase k = 2, the branching factor b = ∞, and the number of steps selected for execution in the examine phase n = 0, i.e., execution is postponed until learning is over.Flags gp = reinf = update = strongcong = 0.The goal of learning/classification is to minimize the sum of negative information gains.The machine learning takes the form of the tree of $-expressions that are built in the select phase, pruned in the examine phase and passed to execution phase for classification work.Data are split into training and test data as usual.
Let's assume for simplicity that we have only one decision attribute and two input attributes a1 and a2 with data taking two possible values on them denoted by a11, a12, a21, a22.Let's assume that cost of actions is equal to entropy of data associated with this action, i.e., $(ai) = Entropy(ai), $(aij) = Entropy(aij), i, j = 1, 2.
OPTIMIZATION: The goal will be to minimize the sum of costs (negative gains).
0. t = 0, initialization phase init: S0 = ε0: The initial tree consists of an empty action ε0 representing a missing classification tree of which cost is ignored (a weak congruence).Because S0 is not the goal state, the first loop iteration consisting of select, examine, and execute phase replaces an invisible ε0 two steps deep (k = 2) by all offsprings b = ∞.
Let's assume that attribute a1 was selected, i.e., $-expression starting from a1 is cheaper.Note that due to appropriate definition of the standard cost function [2,3] this is a negative gain from ID3.
Note that no estimates of future solutions are used (weak congruence-greedy hill climbing search).Execution is postponed (n = 0), and follow-up ε11 and ε12 will be selected for expansion in the next loop iteration.Let's assume that ε22 has data from one class only, thus this is the leaf node-no further splitting of training data is required.examine phase exam: Nothing to optimize/prune-all attributes were used in the path or the leaf node contained sample data from one class of the decision attribute.Thus the end of the learning phase and the shortest decision tree is designated for the execution: execute phase exec: Test data are classified by the decision tree left from the select/examine phases.After that the kΩ-search re-initializes for the new problem to solve.
Note that we can change for example values of k (considering a few attributes in parallel), b, n and optimization to total optimization, then this will be related, but not ID3 algorithm any more.This is the biggest advantage and flexibility of $-calculus automatic problem solving.It can modify "on fly" existing algorithms and design new algorithms, and not simulation of ID3 alone.

Conclusions and Future Work
Everything what we were able to demonstrate so far in this paper is that entropy can be used as a special case of $-calculus cost function, however it cannot replace all instances of cost functions.One of the main unsolved problems of $-calculus is axiomatization of cost functions in the style of Kolmogorov axiomatization of probability theory [5] (this might be an undecidable problem), or to estimate how good is approximation of all cost functions by $-calculus standard cost function.We hypothesize that Kolmogorov complexity known also as algorithmic entropy [7] can be used to standardize $-calculus search (algorithm) cost function $2, but this is left for future work.