1. Introduction
Online binary matrix completion (OBMC) is standing on the frontier of research to online matrix completion, which is currently an active field in the machine learning community [
1,
2,
3,
4]. Intuitively, the OBMC problem is a sequential game of predicting the given entries from an unknown
target binary matrix. More specifically, the problem can be formulated as a repeated game between the algorithm and the adversarial environment as described below: on each round
t, (i) the environment confirms the location of an entry in the target matrix
, (ii) the algorithm predicts
, and then (iii) the environment reveals the true label
. The goal of the algorithm is to minimize the total number of mistakes
. This OBMC model is widely applied in the real world, such as “Netflix Challenges” [
5], where the rating matrix (target matrix) is composed of rows representing the viewers and the columns corresponding to the movies. The entry
is the rating of the viewer
i to the movie
j, concretely.
For convenience, we define an underlying matrix the comparator matrix, as a good enough approximation matrix to the unknown target matrix. To be more precise, assume that can be factorized into for some matrices and for some . Without loss of generality, we further assume that the rows of and are normalized such that for all i and j, where is the i-th row vector of (interpreted as a linear classifier associated with row i of ) and is the j-th row vector of (interpreted as a feature vector associated with column j of ). Hence, the sign of can be viewed as a classification of the classifier to the feature . Next, to quantify the predictiveness of the comparator matrix , we involve the hinge loss as for a given margin parameter , where is x if and 0 otherwise. Note that can be considered as the margin of the labeled instance with respect to a hyperplane . On the other hand, the hinge loss converges to the 0–1 loss function, if converges to 0.
Recently, Herbster et al. explored the OMBC problem by adding side information in advance to the algorithm [
2]. The side information brings some prior knowledge about the target matrix, or more generally, about a comparator matrix
Moreover, side information is formally represented according to the columns and the rows of the comparator matrix as two symmetric positive definite matrices
and
. To measure the quality of the side information, Herbster et al. involved a concept
quasi-dimension of a comparator matrix
defined as the minimum of
over all the factorizations of
such that
. Then, they proved a mistake bound given by the total hinge loss of
with an additional term expressed in terms of
,
m,
n, and
. In particular, Herbster et al. offered a bound
, if the total hinge loss of
is zero (in the
realizable case). Moreover, they obtained a mistake bound
, when
has a
-biclustered structure (see
Appendix A for details) and side information
are in accordance with this particular structure. Unfortunately, however, there still remains a logarithmic gap from a lower bound of
[
1].
In this paper, unlike the definition of quasi-dimension introduced by Herbster et al., we simplify the
quasi-dimension in the following part as
, only the sum of the trace norms. Then, we obtain a mistake bound
, by improving a logarithmic factor
in the mistake bound of Herbster et al.; further, our bound recovers the lower bound at most by a constant when the comparator matrix has a
-biclustered structure. The basic idea is to reduce the OBMC problem with side information to an online semi-definite programming (OSDP) problem specified by
. In particular, the symmetric positive definite matrix
is transformed from the side information
in our reduction. Thus, the reduced OSDP problem is proposed as a repeated game based on a sparse loss matrices space and a decision space, which is a set of symmetric and positive semi-definite matrices
. Note that our decision set is constrained by the
-norm of the diagonal entries of
and
-trace norm,
, simultaneously. Actually, our reduced OSDP problem is a generalization of the standard OSDP problem [
6,
7], where in the standard form,
. We design and analyze our algorithm for the generalized OSDP problem under the follow-the-regularized-leader (FTRL) framework (see, e.g., [
8,
9,
10]). Note that to guarantee good performance of the proposed algorithm, we choose a specialized regularizer as stated later.
The OSDP framework/problem solved by the FTRL approaching is a classical method for various problems of online matrix prediction, such as online gambling [
6,
11], online collaborative filtering [
12,
13,
14], online similarity prediction [
15], and especially a
non-binary version of online matrix completion with no side information [
6,
7]. To measure the performance of the algorithm for the above problems, a new concept,
regret, the difference between the cumulative loss of the algorithm and the global optimal comparator matrix in hindsight, is always involved.
For the aforementioned results about non-binary online matrix completion with no side information, Hazan et al. [
6] firstly proposed a reduction to the standard OSDP problem, and then, they utilized the FTRL algorithm with an
entropic regularizer, obtaining a sub-optimal regret bound. Moridomi et al. [
7] improved the regret bound by deploying a
log-determinant regularizer, while they noticed that the loss matrices are sparse in the reduction.
Next, for the OMBC problem with side information, Herbster et al. [
2] reduced this problem to another variant of the OSDP problem with different or fewer constraints to the decision space than ours. Then they followed a similar FTRL-based algorithm with the entropic regularizer as [
6]. Note that instead of a general regret analysis for their OSDP problem, they only gave a particular analysis to the OSDP problem instance obtained from the reduction. Due to the inspiration of the work of [
7], the gap of the mistake bound is from the choice of the entropic regularizer. As same as the case mentioned in [
7], in our reduction, the loss matrices are sparse, which implies that the log-determinant regularizer can lead to better performance.
As mentioned previously, the OBMC problem with side information is reduced to an OSDP problem, where the decision space is parameterized with the side information. It requests a new form of the log-determinant regularizer since the standard log-determinant regularizer
performs unsatisfactorily from our examination. Meanwhile, we attempt to reduce our problem to the standard OSDP problem in a straightforward and natural way. Unfortunately, this reduction fails, as we will show a counterexample in the latter section since this trivial reduction damages the sparsity of the loss matrices and the bound to the diagonal entries of the decision space. Therefore, to solve our reduced OSDP problem, a generalized log-determinant regularizer,
, is required and this specified regularizer assists a successful regret bound. Conclusively, our reduction and solution not only show the power of the not well explored log-determinant regularizer compared with the entropic or Frobenius-norm regularizer in the OSDP framework, but also demonstrate that an appropriate choice of the regularizer, which is dependent on the form of the decision set, further on the side information, and the loss space can effectively improve the performance of the algorithm in theory. Note that although our derivation is similar to the analysis of Moridomi et al. [
7], it is in fact a non-trivial generalization.
Furthermore, we apply our online algorithm in the statistical (batch) learning setting by the standard online-to-batch conversion framework (see, for example, the work of Mohri et al. [
16]) and derive a generalization error bound with side information. Our generalized error bound is similar to the known margin-based bound of SVMs (e.g., Mohri et al. [
16]) with the best kernel when the side information is vacuous. It is remarkable that we can not only obtain such a bound without knowing the best kernel, but it also implies that the error bound in the batch learning setting can be improved when the side information is given to the learner in advance.
Our main contribution is summarized as follows:
Firstly, we generalize the OSDP problem by parameterizing a symmetric and positive definite matrix
in the decision set. This generalization definitely extends the standard OSDP problem and offers a more wildly applicable framework. Next, we design an FTRL-based algorithm with a generalized log-determinant regularizer depending on the matrix
from the decision set. Our result recovers the previously known bound [
7] in the case where
is the identity matrix.
We obtain refined mistake bounds to the OBMC problem with side information and the online similarity prediction with side information under the umbrella of the above results. We reduce these problems to the OSDP framework and parameterize the side information into a symmetric positive definite matrix in the setting of decision space. Compared with the analysis of Herbster et al. [
2], our reduction is explicit and easy to follow. Due to the achievement of the generalized OSDP problems, we improve the previously known mistake bounds by logarithmic factors for both of the problems. In particular, for the former problem, our mistake bound is optimal.
In addition, we added the online–offline conversion method to the OBMC problem with side information compared with the preliminary version [
17]. We offer a standard online-to-batch conversion framework, which guarantees that our online algorithm can perform not worse than the traditional offline algorithm (SVM with the best kernel). As we demonstrated in the following section, our error bound recovers the best margin-based bound when the side information is vacuous. With the assistance of the ideal side information, our proposed algorithm performs even better than the previous one. On the one hand, we improve the error bound to the OBMC with side information in batch setting; on the other hand, our result implies that there might be a more effective algorithm for batch setting if the algorithm can make good use of the side information.
This paper is organized as follows. In
Section 2, we give some basic notations and formally formulate the generalized OSDP. Then with a toy example, we show that a naive reduction from our problem to the standard OSDP can even lead to a worse regret bound. It yields the necessity of our generalized log-determinant regularizer. The main algorithm with its regret bound for the generalized OSDP is given in
Section 3. In
Section 4, we give the reductions of the OBMC problem and online similarity prediction with side information to our demonstrated OSDP problems, respectively. Moreover, we show that the mistake bound for the OMBC problem is optimal in the realizable case where the comparator matrix has a biclustered structure. In
Section 5, we derive the batch setting to the OBMC problem with side information. In
Appendix A.1, we describe some necessary lemmata for proof of our results. Further, we define the
-biclustered structure in
Appendix A.2.
2. Preliminaries
For any positive integer T, a subset of is denoted by . Let , and denote the sets of symmetric matrices, symmetric positive semi-definite matrices and symmetric strictly positive definite matrices, respectively. The identity matrix is denoted as . For an matrix and , we denote the i-th row vector of and the entry of by and respectively. Furthermore, is an -dimensional vector according to and arranges all entries and , in some order. For any matrices , denotes the Frobenius inner product of them. The trace norm of matrix is defined as , where denotes the i-th largest eigenvalue of . Meanwhile, . In addition, we generalize the trace norm of matrix to the -trace norm of as , for some . Note that the -trace norm recovers the trace norm when . For a vector , the -norm of is denoted by .
2.1. Generalized OSDP Problem with Bounded -Trace Norm
Our generalized OSDP problem specified with a symmetric positive definite matrix
is formulated by a pair
, where
is called the decision space/set, and
is called the loss space, where
,
and
are parameters. The generalized OSDP problem
is a repeated game between the algorithm and the adversarial environment as described below: On each round
, we have the following:
The algorithm predicts a matrix .
The algorithm receives a loss matrix , returning from the environment.
The algorithm incurs the loss on round t: .
The goal of the algorithm is to minimize the following regret
Note that the standard OSDP problem corresponds to the special case of our setting that .
Due to the definition of the OSDP above, this problem is categorized to online linear optimization framework, since the convexity of the decision space and the Frobenius production is linear. Therefore, as Moridomi et al. [
7] did for the standard OSDP problem, we can apply a standard FTRL algorithm. The FTRL algorithm outputs a matrix
according to
where
is a strongly convex function, and called
regularizer. Entropy and Euclidean norm are classical regularizers in online linear optimization [
9]. In particular, Moridomi et al. chose the log-determinant regularizer, which is not well studied, defined as
where
is a parameter and derived the following regret bound for the standard OSDP problem.
Theorem 1 ([
7]).
For the standard OSDP problem with , The FTRL algorithm with the log-determinant regularizer achieves In the next sub-section, we show that the standard log-determinant regularizer performs unsatisfactorily to our generalized OSDP, even if our generalized OSDP can be reduced to the standard OSDP.
2.2. A Naive Reduction
Define a generalized OSDP
as in Equations (
1) and (
2), naturally, there is a reduction to a standard OSDP problem
where
for some parameters
and
. For convenience, we denote
as a FTRL-based algorithm with the log-determinant regularizer.
The reduction consists of two transformations: one is to transform the decision matrix produced from to the decision matrix for OSDP . The other one is to transform the loss matrices from the environment of to , which is fed to the algorithm . Note that the loss is preserved under this reduction, that is, . Moreover, the -trace norm of is the trace norm of , i.e., . Therefore, setting and appropriately such that for any and , and , respectively, we have that .
Hence, according to the regret bound to
with log-determinant regularizer [
7], we immediately have
by Theorem 1.
In the following part, we give an example, which implies that the above reduction yields a worse regret bound than our proposed algorithm.
Example 1. Define as and let , and so that Next we define loss matrix where if and 0 otherwise, as follows: Then, with a simple calculation, we obtain for all which implies that we need Meanwhile, we have that which suggests that . In other words, the regret bound we obtained must be larger than the order of , if we only process the a naive reduction. Parallel, the regret bound to above example is , if we directly utilize our proposed algorithm in the next section, since . Thus, our algorithm can improve the FTRL-based algorithm with the standard log-determinant regularizer significantly.
3. Algorithm for the Generalized OSDP Problem
In this section, we give the main algorithm and regret bound to the generalized OSDP problem
specified by (
1) and (
2), with respect to some
. We propose the FTRL algorithm (
4) with the Γ-
calibrated log-determinant regularizer:
where
is a parameter.
The following theorem gives a regret bound of our algorithm.
Theorem 2 (Main Theorem).
Given . For the generalized OSDP problem specified with Equations (1) and (2), denoting , running the FTRL algorithm with the Γ-
calibrated log-determinant regularizer for T times, the regret is bounded as follows:
In particular, letting and , we have Note that we can recover the same regret bound of Theorem 1, when .
Before we give the proof of our main theorem, we need to introduce strong convexity, which will play a central role in our proof. The definition of strong convexity is as follows.
Definition 1. For a decision space and a real number , a regularizer is said to be s-strongly convex with respect to the loss space if for any , any and any , the following holds This is equivalent to the following condition: for any and , Note that the notion of strong convexity defined above is quite different from the standard one: usually, the strong convexity is defined with respect to some norm
in decision set [
9], but now it is defined with respect to the decision space and the loss space. A direct reason is that our decision set is constrained by two norms simultaneously, and the same definition is involved by [
18].
The following lemma from [
7] states a general form between the regret bound to the OSDP problem and the strongly convex regularizer of the FTRL-based algorithm.
Lemma 1 ([
7]).
Let be an s-strongly convex regularizer with respect to a decision space for a decision space . Then the FTRL with the regularizer R applied to achieveswhere . Due to the lemma above, it suffices to analyze the strong convexity of our Γ-calibrated log-determinant regularizer with respect to our decision space (
1) and loss space (
2). We show the result in the main proposition.
Proposition 1 (Main proposition). The Γ-calibrated log-determinant regularizer is s-strongly convex with respect to for with , where .
We prove this proposition in the next sub-section. Based on this proposition, we firstly give a proof sketch of our main Theorem. The details are in the next sub-section.
Proof Sketch of Theorem 2. According to Proposition 1 and Lemma 1, the only gap of proving this theorem is the bound of . As we show in the following subsection, , due to the definition of R. Obviously, the size of the matrix N has no effect on our regret bound in Theorem 2. □
Proof for Main Proposition and Theorem
Before we prove Theorem 2, we need to involve some lemmata and notations.
Given a distribution P over , the negative entropy function in respect of P is defined by . We define the characteristic function of P by where i is the imaginary unit. For two distributions P and Q over , denotes the total variation distance between P and Q.
Lemma 2. Let and be two zero mean Gaussian distributions with covariance matrix and , where , respectively. If there exists such that for some , then the total variation distance between and is at least
Proof. Given
and
as characteristic functions of
and
, respectively, due to Lemma A1, we have
so we only need to show the lower bound of
Then we set that characteristic function of and are and , respectively. Let that and for some . Moreover we define that given , for any such that there exists in the same way.
Now, let us show the lower bound of
Then second inequality is due to Lemma A4, since .
Due to the assumption in the Lemma, we obtain for some
that
It implies that one of and has absolute value greater that
Due to the positive definiteness of
and
, we have that for all
and therefore we have that
□
Lemma 3. Let be such thatand Γ
is a symmetric strictly positive definite matrix. Then the following inequality holds that Proof. Let and be zero mean Gaussian distributions with covariance matrix and The total variation distance between and is lower bounded by since the assumption in this Lemma and the result in Lemma 2. Consider the entropy of the following probability distribution of v with probability that and otherwise. Its covariance matrix is
Due to Lemma A3, we obtain that
By Lemma A2, we further have that
where the second inequality is due the fact that
At last, from the assumption in this Lemma,
and
are Gaussian distributions, due to the statement in Lemma A3, we have that
Thus, we have our conclusion. □
Lemma 4 (Lemma 5.4 [
7]).
Let be such that for all and Then for any there exists that Proposition 2 (Main proposition). For the generalized log-determinant regularizer is s-strongly convex with respect to for with Here is identity matrix.
Proof. Since the assumption of the proposition that is positive definite then we have that
For any
we obtain that
where
.
Setting in Lemma 4, combining Lemma 3 and Definition 1, our conclusion follows. □
The proof of the main theorem is given as follows:
Proof of Theorem 2. Due to Lemma 1 (in main part) we obtain that
Due to the main proposition in main part we know that
Thus we need only to show
Denoting
and
as the minimizer and maximizer of
R, respectively, then we obtain that
where the last equality is due to the fact that
for any
. Further, we have that
where the first inequality is from the inequality
. Plugging
we obtain that
□
4. Application to OBMC with Side Information
We demonstrate an explicit reduction from OBMC with side information to our aforementioned OSDP problem in the following part. The reduction is two-fold. In the first step, we reduce OBMC with side information to an online matrix prediction (OMP) problem with side information by involving hinge loss function and mistake-driven technique. In the second step, we reduce OMP with side information to the OSDP problem, while representing the side information into the -trace norm.
Before we show the reductions, we need define some necessary notations and the OBMC problem with side information formally.
4.1. The Problem Statement
In principle, our problem statement simplifies the settings in work of Herbster et al. [
2].
Given , let the pair be the side information given to the algorithm, where and .
The online binary matrix completion (OBMC) problem is a repeated game between the the algorithm and the adversarial environment formulated as follows: On each round :
The environment selects .
The algorithm returns a prediction .
The environment reveals the true label .
The target of the algorithm is to minimize the mistake number during the whole learning process . Particularly, with the assistance of the side information , the mistake bound is supposed to be refined in the case that the side information is constructive.
Equivalently, if we describe the selections and true labels from the environment as a sequence . The problem can be seen as a sequential prediction of the entries in an underlying target matrix. On each single round t, the environment confirms the location of the entry and the algorithm is required to predict the label of the entry. However, we release the consistence of such unknown target matrix, that is, it can happen even if .
Instead of the 0–1 loss in the OBMC problem, we firstly need introduce a convex surrogate loss function for the FTRL framework. Specifically, for a positive parameter
, we define a hinge loss
with respect to
as follows:
where
is also named the margin parameter.
Then for the sequence
we factorize the
comparator matrix with
where
and
for some
d. Combing the definition of the hinge loss, we define the hinge loss of the sequence
as
in terms of the factorization pair
and margin parameter
. This hinge loss can be interpreted as a measurement for the predictions in terms of the
comparator matrix to the true label
. In the following part, we can, without loss of generality, assume that each row of
and
is normalized as
for every
. Moreover, we sometimes call the pair
the comparator matrix. Moreover, in the following part, we involve the row normalized matrix
according to any matrix
, such that
Now for each pair of the comparator matrix, we define the
quasi-dimension, measuring the quality of the side information. Specifically, the quasi-dimension of a comparator matrix
with respect to the side information
, is defined as
where
and
. Note that
and
are set as identity matrices, when the side information is empty. In this case, the quasi-dimension is
for any comparator matrix. Nevertheless, the quasi-dimension will be smaller, if the rows of
and/or the columns of
are correlated to
and/or
, which implies that the side information is appropriate and reflects the useful information to the comparator matrix.
Note that the notion of quasi-dimension is defined in a different way in [
2].
4.2. Reduction from OBMC with Side Information to an Online Matrix Prediction (OMP)
We formulate an OMP problem, to which our problem is firstly reduced. The OMP problem is specified by a decision space and a margin parameter , and again it is described as a repeated game between the algorithm and adversary. On each round ,
The algorithm predicts a matrix .
The adversary returns a triple , and
the algorithm suffers the loss defined by .
The goal of the algorithm is to minimize the regret:
Note that unlike the standard setting of online prediction, we do not require
Below we reduce the OBMC problem with side information
to the OMP problem with the following decision space:
where
is an arbitrary parameter. Assume that we have an algorithm
for the OMP problem
. In this reduction, we involve the mistake-driven technique, that is the algorithm
will be only launched when the prediction error appears. The reduced algorithm is as follows.
Run the algorithm and receive the first prediction matrix from . Then, in each round , we do the following:
Observe an index pair .
Predict .
Observe a true label .
If then , and if , then feed to to let it proceed and receive .
Note that due to the mistake-driven technique, we run the algorithm for at most rounds, where M is the number of mistakes of the reduction algorithm above.
The next lemma shows the performance of the reduction.
Lemma 5. Let denote the regret of the algorithm in the reduction above for a competitor matrix , where . Then, Remark 1. If and are identity matrices, then we have , and thus the decision space is an unconstrained set .
Proof. Let
and
be arbitrary matrices such that
Since
for any
and
, we have
where the second equality follows from the definition of regret, and the third equality follows from the fact that
. Since the choice of
and
is arbitrary, the following inequality holds:
Now, let
and
be the matrices that attain (
26). Hence, our lemma follows from
□
4.3. Reduction from OMP to the Generalised OSDP Problem
Throughout this sub-section, we reduce the OMP with side information to the OSDP problem parameterized with
. Our reduction is similar to the work of [
1,
6]. For convenience, we denote
in the following part.
First of all, we formulate the side information
into
for our generalized OSDP problem in the following equation:
Next we define the decision space
. For any comparator matrix
such that
, we define
Trivially,
is an
symmetric and positive semi-definite matrix. Intuitively, any comparator matrix
such that
, is embedded into the upper right block in
. In addition, since the normalization of
and
,
for all
. Hence, we need to find a convex decision space
which satisfies
First, we involve the following Lemma:
Lemma 6 (Lemma 8 [
2]).
For any pair of side information matrices where and , and Γ is induced as in Equation (28). Due to this lemma and the definition of
, we can directly define
as follows:
Then, we describe the loss matrix class
. We first define a sparse matrix
with any pair
such that only entries
and
are 1, others are 0. More formally,
where
is the
k-th orthogonal basis vector of
. Note that due to the definition of the Frobenius product, we have that
which is what we focus on. Thus,
is defined as
Now we state the reduction from the OMP problem with side information to the OSDP problem specified by . Assume that there is an algorithm for the OSDP problem.
Run the algorithm and receive the first prediction matrix from .
In each round t,
let be the upper right component matrix of .
//
observe a triple ,
suffer loss where ,
let ,
feed to the algorithm to let it proceed and receive .
Due to the convexity of
, a standard linearization argument ([
9]) gives
for any
. Moreover, according to our reduction that
and
, the following lemma immediately follows.
Lemma 7. Let denote the regret of the algorithm in the reduction above for a competitor matrix and denote the regret of the reduction algorithm for . Then, Combining Lemmas 5 and 7, we have the following corollary.
Corollary 1. There exists an algorithm for the OBMC problem with side information with the following mistake bounds. 4.4. Application to Matrix Completion
Combining previous two reductions, our ultimate reduction immediately follows. We reduce OBMC with side information to OSDP specified with
. Compared with the analysis of Herbster et al. [
2], our reduction is explicit and easy to follow. Note that in our reduction, the side information
and
for OBMC is parameterized by
. Again, the
is the identity matrix if the side information is vacuous. Finally, running our proposed FTRL-based algorithm with the
-calibrated log-determinant regularizer, we improve the logarithmic factor in the previous mistake bound [
2]. Particularly, our mistake bound is optimal.
Remark 2. Since the definition of Γ
in Equation (28), we have that According to our analysis, the parameters in the reduced OSDP problem
with
defined in Equation (
28), can trivially be set as
then utilizing Theorem 2, we obtain the following result
Next, we give our main algorithm for the OBMC problem with side information
in Algorithm 1. Putting the two reductions together and proceeding with the FTRL-based algorithm (
4), the main algorithm is as follows:
Theorem 3. Running Algorithm 1 with parameter , the hinge loss of OBMC with side information is bounded as follows: Compared with [
2], the logarithmic factor
is improved in our regret bound to hinge loss. Meanwhile, since the mistake-driven technique is involved in our reduction, the horizon
T is replaced by
M, the number of mistakes, which is unknown in advance. Then, by choosing
independent of
M, we can derive a good mistake bound due to above theorem, resulting in Equation (
32).
Algorithm 1 Online binary matrix completion with side information algorithm. |
- 1:
Parameters: , side information matrices and Quasi dimension estimator is composed as in Equation ( 28), and decision set is given as ( 30). - 2:
Initialize set . - 3:
for
do - 4:
Receive . - 5:
Let . - 6:
Predict and receive . - 7:
if then - 8:
Let and . - 9:
else - 10:
Let and . - 11:
end if - 12:
end for
|
Theorem 4. Algorithm 1 with for some achieves Proof. Combining Corollary 1 and the regret bound (
32), we have
Choosing
for sufficiently small constant
c, we get
from which (
34) follows. □
Again if the side information is vacuous, which means that
are identity matrices, from Remark 1 and Theorem 4, we can set that
and obtain the mistake bound as follows:
Nevertheless, the side information indeed matters in non-trivial cases. When the comparator matrix
contains some latter structure, specifically, when
is
-biclustered, the quasi-dimension estimator
which is strictly smaller than
if the side information
is chosen as a special matrix according to the structure of
(the details are in
Appendix A). According to this instance, the accuracy of the prediction can be effectively improved when the side information is selected correlating to the structure of the underlying matrix.
Note that our mistake bound performs better than previous bound especially in the realizable case. Compared with the bound
in [
2], our mistake bound is
which removes the logarithmic factor
In addition, our bound recovers the known lower bound of Herbster et al. [
1] up to a constant factor. If
contains a
-biclustered structure (
), by setting
, our mistake bound can become
On the other hand, the lower bound of Herbster et al. is
Thus, the mistake bound of Theorem 4 is optimal.
In the next subsection, we show an example, online similarity prediction with side information, where the comparator matrix is -biclustered. With the ideal side information, we can effectively improve the mistake bound.
4.5. Online Similarity Prediction with Side Information
In this subsection, we show the application of our reduction method and generalized log-determinant regularizer to online similarity prediction with side information.
Before we introduce the online similarity prediction problem, we need involve some notations and basic concepts. Denote be an undirected and connected graph, where and . In graph G, if all the vertices are assigned into K different classes, i.e., there is n-dimensional vector , where and the classification of each vertex is represented by in vector , we denote this graph with respect to the assignment as
Next we define the cut-edges of and in abbreviation. The cardinality of is the cut size. For each graph G, we denote the adjacency matrix of G as A if if and , otherwise. The degree matrix of is defined as a diagonal matrix where is the degree of vertex i. The Laplacian is defined as We define the PD-Laplacian , where is a n-dimensional vector that all entries are Given a graph and its Laplacian assume that each edge of G is a unit resistor, the effective resistance of any pair of vertices , where is the standard basis in , and is the pseudo-inverse matrix of
Now the online similarity prediction is formulated as follows. Given a K classified graph On each round , we have the following:
- 1.
The environment confirms a pair of vertices
- 2.
The algorithm predicts whether these two vertices belong to the same class. If they are classified in the same class, the algorithm predicts and , otherwise.
- 3.
The environment reveals the true answer If they are in the same class, then ; otherwise.
The target of the algorithm is to minimize the number of the prediction mistakes Due to this formulation of online similarity prediction, this problem is a special case of an OBMC problem. We denote also a sequence for our online similarity prediction problem.
The side information we defined for online similarity prediction is the PD-Laplacian
of graph
G in our paper. Note that the side information is restricted about only the graph
G itself, and irrelevant to the classification vector
. Gentile et al. [
15] explored this problem by involving the graph
G as the prior information to the algorithm. It is equivalent to our problem setting. The mistake bound from [
15] is described in the following proposition:
Proposition 3. Let be a labeled graph. If we run Matrix Winnow with G as an input graph, we have the following mistake bound: From the definitions of the online similarity prediction with side information and OBMC with side information, online similarity prediction is a special case of OBMC, by considering the comparator matrix , where indicates the classifications of vertices i and j. More specifically, if vertices are in the same class, then and otherwise. Meanwhile, the side information unifies the symmetric positive definite matrices into .
Moreover, due to [
2,
15] (see details in
Appendix A.2), the comparator matrix
is actually a
-biclustered
binary matrix. According to the aforementioned result from [
2], there exists a matrix
such that
, where
denoting
as a
K-dimensional vector for which all entries are
As same as the previous reductions, the reduced
specified OSDP problem corresponding to the online similarity prediction with side information is as follows. Firstly, we define the side information
parameterized
in the following matrix.
where
Next the decision space
and the loss space
are defined as previously as in Equations (
30) and (
31), respectively.
Thus, the following proposition demonstrates that the online similarity prediction with side information can be reduced to an OSDP problem parameterized with .
Proposition 4. Given an online similarity prediction problem with graph , set the side information as . Then we can reduce this problem to a generalized OSDP problem with bounded Γ
-trace norm such thatwhere Γ
is defined as above, and Therefore, the mistake bound of online similarity prediction is bounded as follows:where γ is the margin parameter. According to [
2], there exists
where
such that
It implies that the hinge loss
, when
, more specifically
.
Remark 3. According to Theorem 3 and Section 4.2 in [2], and the characterization of online similarity prediction with side information, the quasi-dimension estimator , where is the Laplacian of the corresponding graph if we set the side information is . Due to our Theorem 4 and running Algorithm 1 in main part, finally, we have the mistake bound as follows: Remark 4. In the work of Herbster et al. [2], the resulting mistake bound is which can recover the bound of in [15] up to a constant factor. Moreover, our bound improves the logarithmic factor compared with [2]. 5. Connection to a Batch Setting
In this section, we employ the well-known online-to-batch conversion technique (see, for example, [
16]) and obtain a batch learning algorithm with generalization error bounds. The results imply that the algorithm performs nearly as well as the support vector machine (SVM) running over the optimal feature space, although the side information is vacuous. Moreover, with the assistance of the side information, a more refined mistake bound for batch setting follows from our online version analysis.
First, we describe our setting formally. We consider the problem in the standard probably approximately correct (PAC) learning framework [
16,
19,
20]. The algorithm is given the side information matrix
and
and a sample sequence
:
where each triple
is randomly and independently generated according to some unknown probability distribution
over
. Then the algorithm outputs a hypothesis
. The goal is to find, with high probability, a hypothesis
f that has small generalization error
In particular, we consider a hypothesis of the form
where
where
is defined as in Equation (
28).
In Algorithm 2, we give the algorithm obtained by the online-to-batch conversion.
Algorithm 2 Binary matrix completion in the batch setting. |
- 1:
Parameter: - 2:
Input: a sample of size T. - 3:
Run Algorithm 1 over and get its predictions . - 4:
Choose from uniformly at random. - 5:
Output .
|
To bound the generalization error, we use the following lemma, which is straightforward from Lemma 7.1 of [
16].
Lemma 8. Let be a function and and be the matrices obtained in Algorithm 2. Then, for any , with probability at least , the following holds: Applying the lemma with the zero-one loss
combined with the mistake bound (
34) of Theorem 4, we have the following generalization bound.
Theorem 5. For any , with probability of at least , Algorithm 2 produces with the following property: On the other hand, when applying the lemma with the hinge loss
combined with an
regret bound of (
32) with the minimizer
, then we have
which is slightly worse than (
40). Note that
, if the side information is vacuous.
Now we examine some implications of our generalization bounds. First, we assume without loss of generality that , because otherwise, we can make everything transposed.
As explained in the aforementioned sections, we can think of each
as a feature vector of item
j. Assume all feature vectors
are given and consider the problem of finding a good linear classifier
for each user
i independently. A natural way is to use the SVM, which solves
for every
for some constant
. Now if we fix
to be a constant for all
i, then the optimization problems above are summarized as
So, if we further optimize feature vectors, we obtain
which roughly is proportional to our generalization bound (
40), when the side information is vacuous (i.e.,
and
are identity matrices). This result implies that our generalization bound is upper bounded by the objective function value of the SVM running over the optimal choice of feature vectors. Meanwhile, we expect a more refined bound when the side information is not vacuous for the batch setting.
Moreover, a well known generalization bound for linear classifiers (see, for example, [
16]) gives
for every
, where
is the conditional distribution of
given that the first component is
i, and
is the number of
that satisfies
. Assume for simplicity that
and
for every
i. Then,
Minimizing
over all
and
such that
, the bound obtained is very similar to our bound (
41). This observation implies that our hypothesis has generalization ability competitive with the optimal linear classifiers
over the optimal feature vectors
.