Next: So what about p(u)
Up: lectbss
Previous: Mackay's approach
The training law we have developed up to this point requires
computation of W-T. We can modify this by
This becomes (since
)
This modification to the gradient, multiplying by WTW is called the
natural gradient (Amari, 1998). In this section, we examine
this, with any eye to the question: what is natural about it?
Comment on scaling of update formula.
We follow Amari 1998 in the following discussion. Suppose
is some parameter space (e.g., the space of parameters in
the weighting matrix. Suppose there is some function
defined. Consider a parameter value
,
and some incremental
change to
.
If the parameter space is Euclidean,
then the length of the increment is
However, not all parameter spaces are Euclidean. Consider, for
example, a case where the parameters all lie on a sphere. Then the
appropriate distance measure is not simply the sum of the squares of
the coordinates, especially if
is measured in spherical
coordinates! So we measure the change differently:
Here, g is called the Riemannian metric tensor; it describes
the local curvature of the parameter space at the point
.
In
terms of vectors, we can write
where
(a function of
). G is symmetric.
We see that we are simply
dealing with a weighted distance, induced from a weighted inner
product, defined by
When
,
we simply get the Euclidean distance.
Now consider the problem of learning by ``steepest descent.'' The
question is, do we really go in the right direction, if we take into
account the curvature of the parameter space. We want to decrease
by moving in a direction
to obtain
,
and do the best possible job with the motion. Let us assume
that we have a fixed step length,
for some small positive
.
Observe that the usual ``steepest descent'' that we deal with always
assumes that G=I.
We call
the natural gradient of L in the Riemannian space. In
Euclidean space, it is the same as the usual gradient.
Now consider the BSS problem in the context of natural gradient. We
first formulate the problem. We have, as before, signal vectors
with independent components, so that
and
.
The output is
and we update the matrix by some learning rule
Previously, we took the learning update to be
,
but this will now change.
We observe that in order to obtain equilibrium, the function F must
satisfy
![\begin{displaymath}
E[F(\xbf,W)] = 0
\end{displaymath}](img104.png) |
(2) |
when W = A-1 (we stop changing at the correct answer). Now let
be
an operator that maps a matrix to a matrix, and let
Then
satisfies (2) when F does (same
equilibrium). We want to determine what form the transformation
should take.
Let dW be a small deviation from a matrix W to W + dW. dW
constitutes a ``vector'' starting from the point W. Let us define an
inner product at W as
(Draw a picture of a curved W surface, and the vector on it.) We
can pull back the point, mapping to another surface, by
right-multiplying by W-1. Then W maps to I, and W+dW maps
to
I + dX
where
dX = dW W-1.
A deviation dW at W is equivalent to the deviation dX at I by
this mapping. The key idea is that we want the metric to be invariant
under this mapping: the inner product of dW at W is to be the same
as the inner product of dWY at WY for any Y. Thus we impose the
invariant
In particular, when Y=W-1, we have WY=I. We define the inner
product at I by
the (unweighted, Euclidean) Frobenius norm. Under our principle of
equivalence (using dX = dWW-1), we should therefore have
It follows that the Riemannian tensor has the form
We can determine an explicit form for the natural gradient using the
principle of invariance. We interpret
as a vector
applied at W, and
as a vector applied at I. Then we
must have
We thus have (using the definition of the inner product)
Using the commuting properties of trace we find
Since this must be true for arbitrary dW, we must have
or
Next: So what about p(u)
Up: lectbss
Previous: Mackay's approach
Todd Moon
2000-02-18