Using Probabilistic Models for Data Compression

: Our research objective is to improve the Huffman coding efﬁciency by adjusting the data using a Poisson distribution, which avoids the undeﬁned entropies too. The scientiﬁc value added by our paper consists in the fact of minimizing the average length of the code words, which is greater in the absence of applying the Poisson distribution. Huffman Coding is an error-free compression method, designed to remove the coding redundancy, by yielding the smallest number of code symbols per source symbol, which in practice can be represented by the intensity of an image or the output of a mapping operation. We shall use the images from the PASCAL Visual Object Classes (VOC) to evaluate our methods. In our work we use 10,102 randomly chosen images, such that half of them are for training, while the other half is for testing. The VOC data sets display signiﬁcant variability regarding object size, orientation, pose, illumination, position and occlusion. The data sets are composed by 20 object classes, respectively: aeroplane, bicycle, bird, boat, bottle, bus, car, motorbike, train, sofa, table, chair, tv/monitor, potted plant, person, cat, cow, dog, horse and sheep. The descriptors of different objects can be compared to give a measurement of their similarity. Image similarity is an important concept in many applications. This paper is focused on the measure of similarity in the computer science domain, more speciﬁcally information retrieval and data mining. Our approach uses 64 descriptors for each image belonging to the training and test set, therefore the number of symbols is 64. The data of our information source are different from a ﬁnite memory source (Markov), where its output depends on a ﬁnite number of previous outputs. When dealing with large volumes of data, an effective approach to increase the Information Retrieval speed is based on using Neural Networks as an artiﬁcial intelligent technique.


Introduction
The assessment of similarity or distance between two information entities is crucial for all information discovery tasks (whether Information Retrieval or Data mining). Appropriate measures are required for improving the quality of information selection and also for reducing the time and processing costs [1].
Even if the concept of similarity originates from philosophy and psychology, its relevance arises in almost every scientific [2]. This paper is focused on measuring similarity in the computer science domain, i.e., Information Retrieval, in images, video, and to some extent audio, respectively. In the paper domain, the similarity measure "is an algorithm that determines the degree of agreement between entities." [1].
The approaches for computing similarity or dissimilarity between various object representations can be classified [2] in: (1) distance-based similarity measures This class includes the following models: Euclidean distance, Minkowski distance, Mahalanobis distance, Hamming distance, Manhattan/City block distance, Chebyshev distance, Jaccard distance, Levenshtein distance, Dice's coefficient, cosine similarity, soundex distance. (2) feature-based similarity measures (contrast model) This method, proposed by Tversky in 1977, represents an alternate to distance-based similarity measures, based on computing similarity by common features of compared entities. Entities are more similar if they share more common features, while they are dissimilar in the case of more distinctive features. The similarity between entities A and B can be determined using the formula: where: (3) probabilistic similarity measures For assessing the relevance among some complex data types, the following probabilistic similarity measures are necessary: maximum likelihood estimation, maximum a posteriori estimation. (4) extended/ additional measures Includes similarity measures based on fuzzy set theory [3], similarity measures based on graph theory, similarity-based weighted nearest-neighbors [4], similarity-based neural networks [5].
The Artificial Neural Networks (ANNs) are widely used in various fields of engineering and science. They generate useful tools in quantitative analysis, due to their unique feature of approximating complex and non-linear equations. Their performance and advantage consists in their ability to model both linear and non-linear relationships.
The Artificial Neural Networks are well-suited for a very broad class of nonlinear approximations and mappings. An Artificial Neural Network (ANN) with nonlinear activation functions is more effective than linear regression models in dealing with nonlinear relationships.
ANNs are regarded as one of the important components in AI. They have been studied [8] for many years with the goal of achieving human-like performance in many branches, such as classification, clustering, and pattern recognition, speech and image recognition and Information Retrieval [9] by modelling the human neural system. IR "is different from data retrieval in databases using SQL queries because the data in databases are highly structured and stored in relational tables, while information in text is unstructured. There is no structured query language like SQL for text retrieval." [10].
Gonzales and Woods [11] have developed Huffman Coding to remove coding redundancy, by yielding the smallest number of code symbols per source symbol.
Burgerr and Burge [12] as well as Webb [13] have developed several algorithms for Image Compression using Discrete Cosine Transformation.
The main objective of this paper consists in improving the performance of the Huffman Coding algorithm by achieving a minimum average length of the code word. Our new approach is important for removing more effectively the Coding Redundancy in Digital Image Processing.
The remainder of the paper is organized as follows. In Section 2 we discuss some general aspects about the Poisson distribution and the data compression.
Then, in Section 3 we introduce and analyze an approach to reduce the documents, entitled Discrete Cosine Transformation, in order to achieve the Latent Semantic Model.
We follow with the Fourier descriptor method in Section 4 to describe the shape of an object by considering its boundaries.
We define some notions from the Information Theory in Section 5 as they are useful to model the information generation like a probabilistic process.
The Section 6 presents the Huffman Coding, which is built to remove the coding redundancy and to find the optimal code for an alphabet of symbols.
We introduce an experimental evaluation of the new model on the task of computing image entropy, the average length of the code words and the Huffman coding efficiency too in Section 7.
We conclude in Section 8 by highlighting that the scientific value added by our paper consists in the fact of computing the average length of the code words in the case of applying of the Poisson distribution.

Discrete Random Variables and Distributions
By definition [14], a random variable X and its distribution are discrete if the possible values of X denoted x 1 , x 2 , x 3 , . . . are finitely many or, at most, countably many values, with the probabilities

The Poisson Distribution
The Poisson distribution (named after S.D. Poisson) is the discrete distribution which has infinite possible values and the probability function The Poisson distribution is similar with the binomial distribution in the mean in that it is achieved as a limiting case of this distribution, for n → ∞ and p → 0, where the product np = λ > 0 is kept constant. As it is used for a rare occurrence of an event, the Poisson distribution is also called the distribution of the rare events that occur in order to achieve success in a sequence of some independent Bernoulli samples.
This distribution is frequently encountered in the study of some phenomena in biology, telecommunications, statistical quality control (when the probability of obtaining a defect is very small), in the study of phenomena that present some agglomerations (in the theory of threads of waiting).
The probability that of the n drawn balls, k are white is: [15] namely it results in the Equation (2). The simulation of X ∼ Po(λ) can be achieved [16,17]: using the Matlab generator, using the inverse transform method, using a binomial distribution.
In [17] we generated a selection of volume 1000 on the random variable X, having some continuous distributions (such as normal or exponential distribution) or discrete distributions (as the geometric or Poisson distribution). The methods used in the generation of the random variables X are illustrated in the Table 1.
The means and the variances corresponding to the resulting selections will be compared with the theoretical ones. In each case we build both the graphical histogram and the probability density of X too. For example, the Figure 1 shows the histograms built for X ∼ Geom(p), using three methods (Matlab generator, inverse transform method, by counting the failures) and the probability density of X. The statistics associated to the concordance tests are also computed in the Table 1.
The data represent the way in which the information is transmitted, such that different amounts of data can be used to represent the same quantity of information.
For example, if n 1 and n 2 are the number of bits in two representations of the same information, then the relative data redundancy R D ∈ (−∞, 1) of the representation with n 1 bits can be defined as: where C R ∈ (0, ∞) signifies the compression ratio and has the expression: There are the following three types of redundancies [11] in the Digital Image Processing: (A) Coding Redundancy-it is necessary to evaluate the optimal information coding by the average length of the code words (L avg ) to remove this kind of redundancy: where: L being the number of the intensity values associated to a an M × N image; MNL avg bits are necessary to represent the respective M × N image; the discrete random variable r k ∈ [0, L − 1] represents the intensities of that M × N image; n k is the absolute frequency of the kth intensity r k ; l(r k ), k ∈ 0, L − 1 means the number of bits that are used to represent each value of r k ; p r (r k ) = n k MN is the probability of the occurrence of the r k value; (B) Interpixel Redundancy one refers to a reduction of the redundancy associated with spatially and temporally correlated pixels through mapping such as the run-lengths, differences between adjacent pixels and so on; reversible mapping implies the reconstruction without error; (C) Psychovisual Redundancy is when certain information which has relatively less importance for the perception of the image quality; it is different from the Coding Redundancy and the Interpixel Redundancy by the fact that it is associated with the real information, which can be removed by a quantization method.

Discrete Cosine Transformation
The appearance based approach constitutes [2] one of the various approaches used to select the features from an image, by retaining the most important information of the image and rejecting the redundant information. This class includes Principal Component Analysis (PCA), Discrete Cosine Transformation (DCT) [29], Independent Component Analysis (ICA) and Linear Discriminant Analysis (LDA).
In case of large document collections, the high dimension of the vector space matrix F generates problems in the text document set representation and induces high computing complexity in Information Retrieval.
The most often used methods for reducing the text document space dimension which have been applied in Information Retrieval are: Singular Value Decomposition (SVD) and PCA.
Our approach is based on using the Discrete Cosine Transformation for reducing the text documents. Thus, the set of keywords is reduced to the much smaller feature set. The resulting model represents the Latent Semantic Model.
The DCT [30] represents an orthogonal transformation, similar with the PCA. The elements of the transformation matrix are obtained using the following formula: n being the size of the transformation and The DCT requires the transformation of the n dimensional vectors X p , p = 1, N, where N denotes the number of vectors that must be transformed), to the vectors Y p , (∀) p = 1, N, using the formula: ,n meaning the transformation matrix.
We have to choose, between all the components of the vectors Y p , p = 1, N, a number of m components, corresponding to the positions which conduct to a mean square belonging to the first m mean squares, in descending order, while the other n − m components will be cancelled.
The vector Y p , p = 1, N, is defined through the formula (8): The mean square of the transformed vectors is given by: where The DCT application consists in determining the vectorsŶ p , p = 1, N corresponding to the m components of the vectors Y p , p = 1, N that do not cancel.

Image Compression Algorithm Using Discrete Cosine Transformation
Digital image processing represents [11][12][13]31,32] a succession of hardware and software processing steps, as well as the implementation of several theoretical methods.
The first step of this process involves the image acquisition. It requires an image sensor for achieving a two-dimensional image, such as a video camera (for example, the Pinhole Camera Model, one of the simplest camera models).
The analog signal (which is continuous in time and values) [33] resulted at the output of the video camera, must be converted into a digital signal, for its processing using the computer. This transformation involves the following steps [12]: Step 1 (Spatial sampling). This step aims to achieve the spatial sampling of the continuous light distribution. The spatial sampling of an image represents the conversion of the continuous signal to its discrete representation and it depends on the geometry of the sensor elements associated to the acquisition device.
Step 2 (Temporal sampling). In this stage, the resulting discrete function is sampled in the time domain to create a single image. The temporal sampling is achieved by measuring at regular intervals the amount of light incident on each individual sensor element. The pixel values are described by binary words of length k (which define the depth of the image); therefore, a pixel can represent any of 2 k different values.
As an illustration, the pixels of the grayscale images: The result of performing the three steps Step 1-Step 3 is highlighted in a "description of the image in the form of a two-dimensional, ordered matrix of integers" [12], illustrated in the Figure 2. CASIA Iris Image Database Ver 3.0 (or CASIA-IrisV3 for short) contains three subsets, totally 22,051 iris images of more than 700 subjects. Figure 3 displays a coordinate system for image processing, which is flipped in the vertical direction, provided that the origin, defined by u = 0 and v = 0 lies in the upper left corner.
After achieving the digital image, it is necessary to preprocess it in order to improve it; we can mention some examples of preprocessing image techniques:

1.
image enhancement, which assumes the transformation of the images for highlighting some hidden or obscure details, interest features, etc.; 2.
image compression, performed for reducing the amount of data needed to represent a given amount of information; 3.
image restoration aims to correct those errors that appear at the image capture.
Among different methods for image compression, the DCT "achieves a good compromise between the ability of information compacting and the computational complexity" [12]. Another advantage of using DCT in image compression consists in not depending on the input data.
The DCT algorithm is used for the compression of 256 × 256 matrix of integers X = (x ij ) i,j=1,256 , where x ij ∈ {0, 1, . . . , 256} means the original pixel values. The Algorithm 1 consists in performing the following steps [2]: We have performed the compression algorithm based on the DCT, using the Lena.bmp image [2], which has 256 × 256 pixels and 256 levels of grey; it is represented in the Figure 4.
The Table 2 and Figure 5 display the experimental results obtained by implementing the DCT compression algorithm, in Matlab. Step 1 Split the initial image into 8 × 8 pixel blocks (1024 image blocks).
Step 2 Process each block by applying the DCT, using relation (8).
Step 3 The first nine coefficients for each transformed block are retained in a zigzag fashion and the rest of (64 − 9) the coefficients are cancelled (by making them equal to 0). This stage is illustrated in Figure 6.
Step 4 The inverse DCT is applied for each of the 1024 blocks resulted in the previous step.
Step 5 The compressed image represented by the matrixX = (x ij ) i,j=1,256 is achieved, wherex ij denotes the encoded pixel values. Then, the pixel values are converted into integer values.
Step 6 The performances of the DCT compression algorithm is evaluated in terms of the Peak Signal-to-Noise Ratio (PSNR), given by [34,35]: where the Mean Squared Error (MSE) is defined as follows [34,35]: where N × N means the total number of pixels in the image (in our case N = 256).

Fourier Descriptors
The descriptors of different objects can be compared for achieving [36] a measurement of their similarity [2].
The Fourier descriptors [37,38] show interesting properties in terms of the shape of an object, by considering its boundaries.
Let γ be [2,35] a closed pattern, oriented counterclockwise, described by the parametric representation: z(l) = (x(l), y(l)), where l denotes the length of a circular arc, along the curve γ, starting from an origin and 0 ≤ l < L, where L means the length of the boundary.
A point lying on the boundary generates the complex function u(l) = x(l) + iy(l). We note that u(l) is a periodic function, with period L.

Definition 2. The Fourier descriptors are the coefficients associated to the decomposition of the function u(l) in a complex Fourier series.
By using a similar approach of implementing a Fourier series for building a specific time signal, which consists of cosine/sine waves of different amplitude and frequency, "the Fourier descriptor method uses a series of circles with different sizes and frequencies to build up a two dimensional plot of a boundary" [36] corresponding to an object.
The Fourier descriptors are computed using the formula [2,35]: such that u(l) = ∞ ∑ n=−∞ a n · e i 2π L nl .
In the case of a polygonal contour, depicted in the Figure 7, we will derive [2,35] an equivalent formula to (12). Denoting namely (see the Figure 8): where from (12) it results: From (14) we deduce: We will regard each coordinate pair as a complex number, as illustrated in the Figure 9: Taking into consideration the previous assumption and relationships (18) and (15) we get: hence, the formula (17) will become: By computing and substituting it into (19) we shall achieve: where Therefore, on the basis of the relations (21) and (16) from which it results the formula (20), which allows us the computation of the Fourier descriptors will be: The principal advantages of the Fourier Descriptor method for object recognition consists in the invariance to translation, rotation and scale displayed by the Fourier descriptors are [36].

Entropy and Information
The fundamental premise in Information Theory is that the information generation can be modeled like a probabilistic process [39][40][41][42][43][44][45][46][47] Hence, an event E, which occurs with probability P(E) will contain I(E) information units, where [11]: There is the convention that the basis of the logarithm determines the unit used to measure the information. In the case when the basis is equal with two, then the information unit it is called a bit (binary digit).
Assuming a discrete set of symbols {a 1 , . . . , a m }, having the associated probabilities {P(a 1 ), . . . , P(a m )} we see that the entropy of the discrete distribution is [11]: where z = (P(a 1 ), . . . , P(a m )) t . We can note that the entropy from the Equation (23) just depends on the probabilities of the symbols and it measures the randomness or unpredictability of the respective symbols drawn from a given sequence; in fact, the entropy defines the average amount of information that can be obtained by observing a single output of a primary source.
The information which is transferred to the receiver through an information transmission system is a random discrete set of symbols {b 1 , . . . , b n } too, with the corresponding probabilities {P(b 1 ), . . . , P(b n )}, where [11]: The Equation (24) can be written in the matrix form [11]: where • v = (P(b 1 ), . . . , P(b n )) t is the probability distribution of the output alphabet {b 1 , . . . , b n }; means is associated with the information transmission system: where One can define [11]: (1) the conditional entropy: where P(a j , b k ) means the joint probability, namely the probability of b k occurring at the same time that a j occurs: (2) the mutual information between z and v, which expresses the reduction of uncertainty about z because of the knowledge of v: namely Taking into account the Bayes Rule [48]: P(a j , b k ) = P(a j ) · P(b k |a j ) = P(b k ) · P(a j |b k ) and from the Equation (24) it will result that: From the Equation (30) we can deduce that the minimum value of I(z; v) is 0 and can be achieved in the case when the input and the output alphabet are mutually independent [48], i.e., P(a j , b k ) = P(a j ) · P(b k ).

The Case of the Binary Information Sources
Let a binary information source with the source alphabet A = {a 1 , a 2 } = {0, 1} and let P(a 1 ), P(a 2 ) be the probabilities that the source to produce the symbols a 1 and a 2 , such that [11]: A = {α 1 , . . . , α m n } has m n possible values α i , each of them consisting in n symbols from the alphabet A.
As there are the inequalities [11]: where l(α i ), introduced by the Equation (5) is the length of the coding word used to represent α i , one achieves: therefore [11]: where L avg means the average number of code symbols required to represent all n-symbol groups.
From the Equation (43) one deduces the Shannon's first theorem (the noiseless coding theorem) [11], which claims that the output of a zero-memory source can be represented with an average of H(z) information units per source symbol: namely: or: The Equation (45) proves that the expression L avg n can be approximated with H(z) by encoding infinitely long extensions corresponding to the single-symbol source.
The efficiency of the coding strategy is given by [11]:

Huffman Coding
Huffman Coding [11] is an error-free compression method, designed to remove the coding redundancy, by yielding the smallest number of code symbols per source symbol, which in practice can be represented by the intensities of an image or the output of a mapping operation.
The Huffman algorithm finds the optimal code for an alphabet of symbols, taking into account the condition that the probabilities associated with the symbols have to be coded one at a time.
The approach of Huffman consists in following: Step 1 Approximate the given data with a Poisson distribution to avoid the undefined entropies.

Step 2
Create a series of source reductions by sorting the probabilities of the respective symbols in a descending order in order to combine the lowest probability symbols into a single symbol, which replace them in the next source reduction. This process can be repeated as long as a source with two symbols is not reached.
Step 3 Code each reduced source, by starting with the smallest source and then go back to the original source of this work, taking into account that the symbols 0 and 1 are the binary codes with minimal length for a two-symbol source.
The Huffman coding efficiency can be computed using the formula [11]: L avg being the average length of the code words, defined in the relation (5) and H(z) being the entropy of the discrete distribution, introduced by the Equation (23).

Data Sets
We will use [2] the images from the PASCAL dataset for evaluating the performance of our method using Matlab.
In this paper we have used 10,102 images from the VOC data sets, which contain significant variability in terms of object size, orientation, pose, illumination, position and occlusion.
We have used The ColorDescriptor engine [49] for extracting the image descriptors from all the images.
The Figure 10 shows 50 images from the VOC data base. The performance of our method has been assessed using images from the PASCAL dataset The Pascal VOC challenge represents [34] a benchmark in visual object category recognition and detection, as it provides the vision and machine learning communities with a standard data set of images and annotation.
Our approach uses 64 descriptors for each image belonging to the training and test set, therefore the number of symbols is 64.

Experimental Results
For our experiments, we used the descriptors corresponding to some images from the VOC data sets and after we approximated our data with a Poisson distribution, we computed the image entropy, the average length of the code words and the Huffman coding efficiency too. We have applied the following Algorithm: Step 1 Approximate the given data with a Poisson distribution.

Step 2
Create a series of source reductions by sorting the probabilities of the respective symbols in a descending order in order to combine the lowest probability symbols into a single symbol, which replace them in the next source reduction. This process can be repeated as long as a source with two symbols is not reached.

Step 3
Code each reduced source, by starting with the smallest source and then go back the original source of this work, taking into account that the symbols 0 and 1 are the binary codes with minimal length for a two-symbol source.
The achieved results are illustrated in the Tables 3-11.

Conclusions
This paper proposes to improve the Huffman coding efficiency by adjusting the data using a Poisson distribution, which avoids the undefined entropies too. The performance of our method has been assessed in Matlab, by using a set of images from the PASCAL dataset.
The scientific value added of our paper consists in applying the Poisson distribution in order to minimize the average length of the Huffman code words.
The PASCAL VOC challenge represents [34] a benchmark in visual object category recognition and detection as it provides the vision and machine learning communities with a standard data set of images and annotation.
In this paper we have used 10,102 images from the VOC data sets, which contain significant variability in terms of object size, orientation, pose, illumination, position and occlusion.
The data of our information source are different from a finite memory source (Markov), where its output depends on a finite number of previous outputs.