The next step in our framework is the selection of a limited number of relevant tags from the class dictionaries in order to perform query expansion. We consider the following different strategies to determine a list of N tags $E={\left\{{w}_{i}\right\}}_{i=1}^{N}$.

#### Entropy-Based Selection

All the dictionary filters described above follow the same principle for ordering and selecting the best tags: higher frequency of use. However, this principle does not take into account the interplay between the different tags. Simply put, they are not independently associated with pictures, but often their presence is correlated and inter-dependent.

If we model the presence or absence of a tag t as a Bernoulli random variable ${y}_{t}$, where ${y}_{t}=1$ if the tag is associated with a given image, then a simple selection based on tag frequencies corresponds to sorting the tags in descending order of probability $P({y}_{t}=1)$, or, more precisely $P({y}_{t}=1|x)$, the probability conditioned to the search being about a specific class x. This particular conditioning can be dropped from the equations if we deal, as we do, with each class independently.

In fact, each image with multiple tags can be seen as a probabilistic event, representing a sample of the joint probability distribution $p({y}_{{t}_{1}},{y}_{{t}_{2}},\dots )$ of all the possible tags. Considering only the marginal probabilities $p\left({y}_{t}\right)$ neglects to take into account the dependency between tags. For example, in an extreme case, two high probability tags that are always used together would both be chosen by frequency selection, but only one would bring in valuable information, and the other would be superfluous. Since it is not feasible to consider all possible tags, we can limit ourselves to the m tags most associated with a certain class, and thus study the joint probability mass function $p({y}_{{t}_{1}},{y}_{{t}_{2}},\dots ,{y}_{{t}_{m}})$.

The most interesting issue at this point is which of the

m tags brings in the most

information? The information-theoretic answer is the random variable with the highest

entropy (see [

27] for related issues). For the Bernoulli variable

${y}_{t}$, the Shannon entropy is given by

where

${p}_{t}=P({y}_{t}=1)$ and the entropy are measured in bits. This entropy is a concave function of

${p}_{t}$, with a maximum for

${p}_{t}=0.5$ and zero when

${p}_{t}=0$ or

${p}_{t}=1$. Thus, tags appearing half of the time contain more information than frequent or rare tags. For example, if a tag is always associated with a certain class, then it is generally useless for expanded searches (it will return roughly the same results).

Once we select the first tag

${t}_{1}$ with maximal entropy, the next most informative one is given by

relative entropy:

where

$p({y}_{t},{y}_{{t}_{1}})$ is the joint probability mass function of selectable variable

${y}_{t}$ and the already chosen variable

${y}_{{t}_{1}}$, and

$p\left({y}_{t}\right|{y}_{{t}_{1}})$ is their conditional probability mass function.

In general, given a set ${Z}_{k}=\{{y}_{{t}_{1}},{y}_{{t}_{2}},\dots ,{y}_{{t}_{k}}\}$ of k selected tags, we can select the next tag with

After every selection, the relative entropy of the remaining variables will be reduced, so that we can stop the selection process either because we reach a given number of selected tags, or the relative entropy gets lower than a certain threshold. Both of these stopping criteria solve the only real issue with these calculations, which is that the state space $({y}_{t},{Z}_{k})$ grows exponentially, as ${2}^{k+1}$. Experimentally, we rarely went beyond $k=15$, where the state space has 32,768 possible combinations for the presence/absence of tags (it takes a few minutes to calculate on a desktop).

An important property of the relative entropy is the chain rule:

By this rule, we can calculate the total entropy of the final selection of tags as the sum of all the relative entropies calculated at each step and the initial marginal entropy of the first choice. In the end we have a list of N tags ranked by how informative they are, together with a measure ${h}_{i}=H\left({y}_{{t}_{i}}\right|{Z}_{i-1})/H\left({Z}_{N}\right)$ of their informativeness, obtained by normalizing each tag’s entropy by the total entropy.