One of the more important inequalities we will use throughout
information theory is Jensen's inequality. Before introducing it,
you need to know about convex and concave functions.
To understand the definition, recall that
is simply a line segment connecting x1 and x2 (in the x
direction) and
is a line segment
connecting f(x1) and f(x2). Pictorially, the function is convex
if the function lies below the straight line segment connecting
two points, for any two points in the interval.
You will need to keep reminding me of which is which, since when I
learned this, the nomenclature was ``convex
'' and ``convex
''.
One reason why we are interested in convex functions is that it is
known that over the interval of convexity there is only one
minimum. This can strengthen many of the results we might want.
We now introduce Jensen's inequality.
The theorem allows us (more or less) to pull a function outside of a
summation in some circumstances.
There is another inequality that got considerable use (in many of the
same ways as Jensen's inequality) way back in the dark ages when I
took information theory. I may refer to it simply as the information inequality.
This can also be generalized by taking the line at different points
along the function.
With these simple inequalities we can now prove some facts about some
of the information measures we defined so far.
Let
be the set of values that the random variable X takes on
and let
denote the number of elements in the set. For
discrete random variables, the uniform distribution over the
range
has the maximum entropy.
Note how easily this optimizing value drops in our lap by means of an
inequality. There is an important principle of engineering design
here: if you can show that some performance criterion is
upper-bounded by some function, then show how to achieve that upper
bound, you have got an optimum design. No calculus required!
The more we know, the less uncertainty there is: