{"title": "Relative Loss Bounds for Multidimensional Regression Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 287, "page_last": 293, "abstract": null, "full_text": "Relative Loss Bounds for Multidimensional \n\nRegression Problems \n\nJyrki Kivinen \n\nDepartment of Computer Science \nP.O. Box 26 (Teollisuuskatu 23) \n\nManfred K. Warmuth \n\nDepartment of Computer Science \nUniversity of California, Santa Cruz \n\nFIN-00014 University of Helsinki, Finland \n\nSanta Cruz, CA 95064, USA \n\nAbstract \n\nWe study on-line generalized linear regression with multidimensional \noutputs, i.e., neural networks with multiple output nodes but no hidden \nnodes. We allow at the final layer transfer functions such as the soft(cid:173)\nmax function that need to consider the linear activations to all the output \nneurons. We use distance functions of a certain kind in two completely \nindependent roles in deriving and analyzing on-line learning algorithms \nfor such tasks. We use one distance function to define a matching loss \nfunction for the (possibly multidimensional) transfer function, which al(cid:173)\nlows us to generalize earlier results from one-dimensional to multidimen(cid:173)\nsional outputs. We use another distance function as a tool for measuring \nprogress made by the on-line updates. This shows how previously stud(cid:173)\nied algorithms such as gradient descent and exponentiated gradient fit \ninto a common framework. We evaluate the performance of the algo(cid:173)\nrithms using relative loss bounds that compare the loss of the on-line \nalgoritm to the best off-line predictor from the relevant model class, thus \ncompletely eliminating probabilistic assumptions about the data. \n\n1 \n\nINTRODUCTION \n\nIn a regression problem, we have a sequence of n-dimensional real valued inputs Zt E R n , \nt = 1, ... ,f, and for each input Zt a k-dimensional real-valued desired output Yt E R\". \nOur goal is to find a mapping that at least approximately models the dependency between \nZt and Yt. Here we consider the parametric case Yt = f (w; Zt) where the actual output Yt \ncorresponding to the input Zt is determined by a parameter vector w E Rm (e.g., weights \nin a neural network) through a given fixed model f (e.g., a neural network architecture). \n\n\f288 \n\n1. Kivinen and M. K Wannuth \n\nThus, we wish to obtain parameters w such that, in some sense, I(w;:z:t} ~ Yt for all \nt. The most basic model 1 to consider is the linear one, which in the one-dimensional \ncase k = 1 means that I(w;:z:t) = w . :Z:t for w E Rfl. In the multidimensional case \nwe actually have a whole matrix 0 E Rkxfl of parameters and 1(0;:z:t} = O:Z:t. The \ngoodness of the fit is quantitatively measured in terms of a loss function; the square loss \ngiven by Lt,j (Yt,j - ilt,j)2 /2 is a popular choice. \nIn generalized linear regression [MN89] we fix a transfer function 4> and apply it on top of a \nlinear model. Thus, in the one-dimensional case we would have I(w;:z:t) = \u00a2(w\u00b7:z:t). Here \n\u00a2 is usually a continuous increasing function from R to R, such as the logistic function \nthat maps z to 1/(1 + e- Z ). It is still possible to use the square loss, but this can lead to \nproblems. In particular, when we apply the logistic transfer function and try to find a weight \nvector w that minimizes the total square loss over f examples (:Z:t, Yt), we may have up to \n\u00a3fl local minima [AHW95, Bud93]. Hence, some other choice of loss function might be \nmore convenient. In the one-dimensional case it can be shown that any continuous strictly \nincreasing transfer function \u00a2 has a specific matching loss function LtP such that, among \nother useful properties, Lt LtP(Yt, \u00a2(w . :z:t}) is always convex in w, so local minima are \nnot a problem [AHW95]. For example, the matching loss function for the logistic transfer \nfunction is the relative entropy (a generalization of the logarithmic loss for continuous(cid:173)\nvalued outcomes). The square loss is the matching loss function for the identity transfer \nfunction (i.e., linear regression). \n\nThe main theme of the present paper is the application of a particular kind of distance func(cid:173)\ntions to analyzing learning algorithms in (possibly multidimensional) generalized linear \nregression problems. We consider a particular manner in which a mapping 4>: Rk -+ Rk \ncan be used to define a distance function D.4> : Rk x Rk -+ R; the assumption we must \nmake here is that 4> has a convex potential function. The matching loss function LtP men(cid:173)\ntioned above for a transfer function \u00a2 in the one-dimensional case is given in terms of \nthe distance function D.tP as LtP(\u00a2(a), \u00a2(ii)) = D.tP(ii, a). Here, as whenever we use the \nmatching loss LtP (y, iI), we assume that Y and iI are in the range of \u00a2, so we can write \nY = \u00a2(a) and iI = \u00a2(ii) for some a and ii. Notice that for k = 1, any strictly increasing \ncontinuous function has a convex potential (i.e., integral) function. In the more interesting \ncase k > 1, we can consider transfer functions such as the softmax function, which is com(cid:173)\nmonly used to transfer arbitrary vectors a E Rk into probability vectors y (i.e., vectors \nsuch that iii ~ 0 for all i and Li iii = 1). The matching loss function for the softmax func(cid:173)\ntion defined analogously with the one-dimensional case turns out to be the relative entropy \n(or Kul1back-Leibler divergence), which indeed is a commonly used measure of distance \nbetween probability vectors. For the identity transfer function, the matching loss function \nis the squared Euclidean distance. \n\nThe first result we get from this observation connecting matching losses to a general notion \nof distance is that certain previous results on generalized linear regression with matching \nloss on one-dimensional outputs [HKW95] directly generalize to multidimensional out(cid:173)\nputs. From a more general point of view, a much more interesting feature of these distance \nfunctions is how they allow us to view certain previously known learning algorithms, and \nintroduce new ones, in a simple unified framework. To briefly explain this framework with(cid:173)\nout unnecessary complications, we restrict the foUowing discussion to the case k = 1, i.e., \nf(w;:z:) = \u00a2(w . :z:) E R with w E Rfl. \n\nWe consider on-line learning algorithms, by which we here mean an algorithm that pro(cid:173)\ncesses the training examples one by one, the pair (:Z:t, Yt) being processed at time t. Based \n\n\fRelative Loss Bounds for Multidimensional Regression Problems \n\n289 \n\non the training examples the algorithm produces a whole sequence of weight vectors Wt, \nt = 1, ... ,f.. At each time t the old weight vector Wt is updated into WtH based on Zt and \nYt. The best-known such algorithm is on-line gradient descent. To see some alternatives, \nconsider first a distance function ll.,p defined on R n by some function ,p: Rn ~ Rn. \n(Thus, we assume that,p has a convex potential.) We represent the update somewhat indi(cid:173)\nrectly by introducing a new parameter vector 6t ERn from which the actual weights Wt \nare obtained by the mapping Wt = ,p{6t ). The new parameters are updated by \n\n(1) \n\nwhere TJ > 0 is a learning rate. We call this algorithm the general additive algorithm with \nparameterization function ,p. Notice that here 6 is updated by the gradient with respect \nto w, so this is not just a gradient descent with reparameterization [JW98]. However, we \nobtain the usual on-line gradient descent when ,p is the identity function. When,p is \nthe softmax function, we get the so-called exponentiated gradient (EG) algorithm [KW97 , \nHKW95]. \n\nThe connection of the distance function ll.,p to the update (1) is two-fold. First, (1) can \nbe motivated as an approximate solution to a minimization problem in which the distance \nll.,p (6t , 6tH ) is used as a kind of penalty term to prevent too drastic an update based on \na single example. Second, the distance function ll.,p can be used as a potential function \nin analyzing the performance of the resulting algorithm. The same distance functions have \nbeen used previously for exactly the same purposes [KW97, HKW95] in important special \ncases (the gradient descent and EG algorithms) but without realizing the full generality of \nthe method. \n\nIt should be noted that the choice of the parameterization function ,p is left completely \nfree, as long as ,p has a convex potential function. (In contrast, the choice of the transfer \nfunction \u00a2 depends on what kind of a regression problem we wish to solve.) Earlier work \nsuggests that the softmax parameterization function (Le., the EG algorithm) is particularly \nsuited for situations in which some sparse weight vector W gives a good match to the \ndata [HKW95, KW97]. (Because softmax normalizes the weight vector and makes the \ncomponents positive, a simple transformation of the input data is typically added to realize \npositive and negative weights with arbitrary norm.) \n\nIn work parallel to this, the analogue of the general additive update (1) in the context of \nlinear classification, i.e., with a threshold transfer function, has recently been developed \nand analyzed by Grove et al. [GLS97] with methods and results very similar to ours. Cesa(cid:173)\nBianchi [CB97] has used somewhat different methods to obtain bounds also in cases in \nwhich the loss function does not match the transfer function. Jagota and Warmuth [JW98] \nview (1) as an Euler discretization of a system of partial differential equations and investi(cid:173)\ngate the performance of the algorithm as the discretization parameter approaches zero. \n\nThe distance functions we use here have previously been applied in the context of expo(cid:173)\nnential families by Amari [Ama85] and others. Here we only need some basic technical \nproperties of the distance functions that can easily be derived from the definitions. For a \ndiscussion of our line of work in a statistical context see Azoury and Warmuth [AW97]. \n\nIn Section 2 we review the definition of a matching loss function and give examples. Sec(cid:173)\ntion 3 discusses the general additive algorithm in more detail. The actual relative on-line \nloss bounds we have for the general additive algorithm are explained in Section 4. \n\n\f290 \n\nJ. Kivinen and M. K. Warmuth \n\n2 DISTANCE FUNCTIONS AND MATCIllNG LOSSES \n\nLet 4>: R k -t R k be a function that has a convex potential function P 4> (i.e., 4> = V' P 4> \nfor some convex P 4>: Rk -4 R). We first define a distance/unction A4> for 4> by \n\n(2) \n\nThus, the distance A4>(a, a) is the error we make if we approximate P 4>(a) by its first(cid:173)\norder Taylor polynomial around a. Convexity of P 4> implies that A4> is convex in its first \nargument. Further, A4>(a, a) is nonnegative, and zero if and only if 4>(a) = 4>( a). \nWe can alternatively write (2) as A4>(a, a) = I: (4)(r) - 4>(a)) . dr where the integral is \na path integral the value of which must be independent of the actual path chosen between \na and a. In the one-dimensional case, the integral is a simple definite integral, and \u00a2 \nhas a convex potential (i.e., integral) function if it is strictly increasing and continuous \n[AHW95, HKW95]. \nLet now 4> have range V 4> ~ Rk and distance function A4>' Assuming that there is a \nfunction L4>: V4> x V4> -4 R such that L4>(4)(a) , 4>(a\u00bb = A4>(a, a) holds for all a and \na, we say that L4> is the matching loss function for 4>. \n\n~ \n\nExample 1 Let 4> be a linear function given by 4>(a) = Aa where A E R kxk is symmetri(cid:173)\ncal and positive definite. Then 4> has the convex potential function P 4> (a) = aT Aa /2, and \n(2) gives A4>(a, a) = Ha - a)T A(a - a). Hence, L4>(Y' y) = t(y - y)T A-l(y - y) \nforally,YERk. \n0 \n\nExample2 Let 0': Rk -4 Rk, O'i(a) = exp(a;)/E7=1 exp(aj), be the softmax function. \nIt has a potential function given by PO'(a) = In E7=1 exp(aj). To see that PO' is convex, \nnotice that the Hessian n2PO' is given by D2PO'(a);j = dijO'i(a) - O'da)O'j(a). Given \na vector z E Rk, let now X be a random variable that has probability O';{a) of taking \nthe value Xi\u00b7 We have zTDO'(a)z = E7=1 O'i{a)xl- E7=1 E7=1 0'; (a)xiO'j(a)xj = \n(EX)2 = VarX ~ O. Straightforward algebra now gives the relative entropy \nEX 2 -\nLO'(y, y) = E;=l Yj In(Yj/Yj) as the matching loss function. (To allow Yj = 0 or Yj = 0, \nwe adopt the standard convention that OlnO = Oln(O/O) = 0 and yln(y/O) = 00 for \ny> 0.) \n\n0 \n\nIn the relative loss bound proofs we use the basic property [JW98, Ama85] \n\nThis shows that our distances do not satisfy the triangle inequality. Usually they are not \nsymmetrical, either. \n\n3 THE GENERAL ADDITIVE ALGORITHM \n\nWe consider on-line learning algorithms that at time t -first receive an input Zt E R n , \nthen produce an output Yt E R k, and finally receive as feedback the desired output Yt E \nRk. To define the general additive algorithm. assume we are given a transfer function \n\n\fRelative Loss Bounds for Multidimensional Regression Problems \n\n291 \n\nl/J: Rk ~ Rk that has a convex potential function. (We wi11later use the matching loss as \na performance measure.) We also require that all the desired outputs Y t are in the range \nof l/J. The algorithm's predictions are now given by Yt = l/J(Ot:et) where Ot E Rkxn is \nthe algorithm's weight matrix at time t. To see how the weight matrix is updated, assume \nfurther we have a parameterization function ..p: R n ~ R n with a distance D....p. The \nalgorithm maintains kn real-valued parameters. We denote by 8 t the k x n matrix of the \nvalues ofthese parameters immediately before trial t. Futher, we denote by 8 t ,i the jth row \nof 8t. and by ..p(8t} the matrix with ..p(8t,i) as its jth row. Given initial parameter values \n8 1 and a learning rate 1] > 0, we now define the general additive (GA) algorithm as the \nalgorithm that repeats at each trial t the following prediction and update steps. \n\nPrediction: Upon recieving the instance :et, give the prediction Yt = l/J(..p(8t):et). \nUpdate: For j := 1, ... , k, set 8t+l,i = 8t,i - fJ(yt,i - Yt ,i ):et . \n\nNote that (2) implies \\7aD..l/J(a, a)) = l/J(a) -l/J(a), so this update indeed turns out to be \nthe same as (1) when we recall that Ll/J(Yt, Yt) = D..l/J(Ot:et, at} where Yt = l/J(at). \nThe update can be motivated by an optimization problem given in terms of t~e loss and \ndistance. Consider updating an old parameter matrix 8 into a new matrix 8 based on \na single input :e and desired output y. A natural goal would be to minimize the loss \nL l/J (y, l/J( ,p (8):e ) ). However, the algorithm must avoid losing too much of the information \nit has gained during the previous trials and stored in the form of the old parameter matrix 8 . \nWe thus set as the algorithm's goal to minimize the sum D..\"p(8, 8) + fJLl/J(Y' l/J(,p(8):e)) \nwhere fJ > 0 is a parameter regulating how fast the algorithm is willing to move its pa(cid:173)\nrameters. Under certain regularity assumptions, the update rule of the GA algorithm can \nbe shown to approximately solve this minimization problem. For more discussion and ex(cid:173)\namples in the special case of linear regression, see [KW97]. An interesting related idea is \nusing all the previous examples in the update instead of just the last one. For work along \nthese lines in the linear case see Vovk [Vov97] and Foster [Fos91]. \n\n4 RELATIVE LOSS BOUNDS \n\n:= \n\n((:el,yd, . .. ,(:el,Yl)) of training examples, and let \nConsider a sequence S \nLossl/J(GA, S) = 2:!=1 Ll/J(Yt, Yt) be the loss incurred by the general additive algorithm \non this sequence when it always uses its current weights Ot for making the tth prediction \nYt\u00b7 Similarly, let Lossl/J(O, S) = 2:!=1 Ll/J(Yt, l/J(O:ed) be the loss of a fixed predictor \nO. Basically, our goal is to show that if some 0 achieves a small loss, then the algorithm is \nnot doing much worse, regardless of how the sequence S was generated. Making additional \nprobabilistic assumptions allows such on-line loss bounds to be converted into more tradi(cid:173)\ntional results about generalization errors [KW97]. To give the bounds for Lossl/J(GA, S) \nin terms of Lossl/J(O, S) we need some additional parameters. The first one is the distance \nD....p(8 1,8) where 0 = \"p(8) and 8 1 is the initial parameter matrix of the GA algorithm \n(which can be arbitrary). The second one is defined by \n\nbx,,,p = sup {:eT n..p(9):e 19 E Rn,:e EX} \n\nwhere X := {:el, ... ,:el} is the set of input vectors and n..p(9) is the Jacobian with \n(D,p(9))ij = 81Pi(9)/80j . The value bx,,p can be interpreted as the maximum norm of \n\n\f292 \n\nJ. Kivinen and M. K. Wannuth \n\nany input vector in a norm defined by the parameterization function ..p. In Example 3 below \nwe show how b x,..p can easily be evaluated when 1/J is a linear function or the softmax \nfunction. The third parameter ctP' defined as \n\nrelates the matching loss function for the transfer function tP to the square loss. In Ex(cid:173)\nample 4 we evaluate this constant for linear functions, the softmax function, and the one(cid:173)\ndimensional case. \n\nExample 3 Consider bounding the value :I: TDO'( 0):1: where 0' is the softmax func(cid:173)\ntion. As we saw in Example 2, this value is a variance of a random variable with the \nrange {Xl, ... ,Xn }. Hence, we have bx,O' ~ maXzex(maxixi - minixd 2/4 ~ \nmaXzex 11:l:11~ where 11:1:1100 = maXi IXil\u00b7 \nIf 1/J is a linear function with 1/J( 8) = A8 for a symmetrical positive definite A, we clearly \nhave bx,..p ~ Amax max:l:ex :1: 2 where Amax is the largest eigenvalue of A. \n0 \n\nExample 4 For the softmax function 0' the matching loss function LO' is the relative en(cid:173)\ntropy (see Example 2), for which it is well known that LO'(y, y) 2: 2(y - y)2. Hence, we \nhave ctP ~ 1/4. \nIf tP is a linear function given by a symmetrical positive semidefinite matrix A, we see from \nExample 1 that C tP is the largest eigenvalue of A. \nFinally, in the special case k = 1, with \u00a2: R -7 R differentiable and strictly increasing, we \ncan show ctP :::; Z if Z is a bound such that 0 < \u00a2'(z) :::; Z holds for all z. \n0 \n\nAssume now we are given constants b 2: bx ,1/J and C 2: ctP. Our first loss bound states that \nfor any parameter matrix 8 we have \n\nwhen the learning rate is chosen as '1 = 1/(2bc). (Proofs are omitted from this extended \nabstract.) The advantage of this bound is that with a fixed learning rate it holds for any \n8, so we need no advance knowledge about a good 8. The drawback is the factor 2 in \nfront of LosstP (..p (8), S), which suggests that asymptotically the algorithm might not ever \nachieve the performance of the best fixed predictor. A tighter bound can be achieved by \niven constants K 2: 0 and R > 0, if we choose the learning \nmore careful tuning. Thus, \nrate as '1 = ( (bcR)2 + KbcR - bcR)/(Kbc) (with '1 = 1/(2bc) if K = 0) we obtain \n\nfor any 8 that satisfies Loss tP ( ..p (8) , S) :::; K and d..p (8 1 , 8) :::; R. This shows that if \nwe restrict our comparison to parameter matrices within a given distance R of the initial \nmatrix of the algorithm, and we have a reasonably good guess K as to the loss of the best \nfixed predictor within this distance, this knowledge allows the algorithm to asymptotically \nmatch the performance of this best fixed predictor. \n\n\fRelative Loss Bounds for Multidimensional Regression Problems \n\n293 \n\nAcknowledgments \n\nThe authors thank Katy Azoury, Chris Bishop, Nicolo Cesa-Bianchi, David Helmbold, and \nNick Littlestone for helpful discussions. Jyrki Kivinen was supported by the Academy of \nFinland and the ESPRIT project NeuroCOLT. Manfred Warmuth was supported by the NSF \ngrant CCR 9700201. \n\nReferences \n\n[Ama85] S. Amari. Differential Geometrical Methods in Statistics. Springer Verlag, \n\nBerlin, 1985. \n\n[AHW95] P. Auer, M. Herbster, and M. K. Warmuth. Exponentially many local minima \nfor single neurons. In Proc. 1995 Neural Information Processing Conference, \npages 316-317. MIT Press, Cambridge, MA, November 1995. \n\n[AW97] K. Azoury and M. K. Warmuth. Relative loss bounds and the exponential family \n\nof distributions. Unpublished manuscript, 1997. \n\n[Bud93] M. Budinich. Some notes on perceptron learning. 1. Phys. A.: Math. Gen., \n\n26:4237-4247, 1993. \n\n[CB97] \n\nN. Cesa-Bianchi. Analysis of two gradient-based algorithms for on-line regres(cid:173)\nsion. In Proc. 10th Annu. Con! on Comput. Learning Theory, pages 163-170. \nACM,1997. \n\n[Fos91] D. P. Foster. Prediction in the worst case. The Annals of Statistics, 19(2):1084-\n\n1090, 1991. \n\n[GLS97] A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results \nfor linear discriminant updates. In Proc. 10th Annu. Con! on Comput. Learning \nTheory, pages 171-183. ACM, 1997. \n\n[HKW95] D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for \nsigmoided linear neurons. In Proc. Neural Information Processing Systems \n1995, pages 309-315. MIT Press, Cambridge, MA, November 1995. \n\n[JW98] \n\n[KW97] \n\nA. K. Jagota and M. K. Warmuth. Continuous versus discrete-time nonlinear \ngradient descent: Relative loss bounds and convergence. Presented at Fifth Sym(cid:173)\nposium on Artificial Intelligence and Mathematics, Ft. Lauderdale, FL, 1998. \n\nJ. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient up(cid:173)\ndates for linear prediction. Information and Computation, 132( 1): 1-64, January \n1997. \n\n[MN89] \n\nP. McCullagh and J. A. NeIder. Generalized Linear Models. Chapman & Hall, \nNew York, 1989. \n\n[Vov97] V. Vovk. Competitive on-line linear regression. In Proc. Neural Information \n\nProcessing Systems 1997. MIT Press, Cambridge, MA, 1998. \n\n\f", "award": [], "sourceid": 1478, "authors": [{"given_name": "Jyrki", "family_name": "Kivinen", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}