Suppose there is a r.v. with true distribution p. Then (as we will see) we could represent that r.v. with a code that has average length H(p). However, due to incomplete information we do not know p; instead we assume that the distribution of the r.v. is q. Then (as we will see) the code would need more bits to represent the r.v. The difference in the number of bits is denoted as D(p|q). The quantity D(p|q) comes up often enough that it has a name: it is known as the relative entropy.
Note that this is not symmetric, and the q (the second argument)
appears only in the denominator.
Another important concept is that of mutual information. How much information does one random variable tell about another one. In fact, this perhaps the central idea in much of information theory. When we look at the output of a channel, we see the outcomes of a r.v. What we want to know is what went into the channel -- we want to know what was sent, and the only thing we have is what came out. The channel coding theorem (which is one of the high points we are trying to reach in the class) is basically a statement about mutual information.
Note that when X and Y are independent,
p(x,y) = p(x)p(y)
(definition of independence), so I(X;Y) = 0. This makes sense: if
they are independent random variables then Y can tell us nothing
about X.
An important interpretation of mutual information comes from the following.
Interpretation: The information that Y tells us about X is the
reduction in uncertainty about X due to the knowledge of Y.
Observe that by symmetry