Information in discrete distributions

The second in a sequence of introductory articles
on information and probability distributions
beta version <g> of 98-06-10
by Marijke van Gans

< Information: the concept | here | Information in continuous distributions >

Quantifying information

The intuitive considerations in the preceding article led us to put a value on the amount of information contained in the knowledge that a system (for example the next symbol in a message) turns out to be in one particular one of a number of possible states or outcomes (in our example, which symbol it is). If there was a probability p for this particular outcome, then the amount of information associated with knowing that we got this outcome is

-log p

The examples in the preceding article were all about probabilities that were 1/powers-of-two, and the information was recorded in binary digits. The question arises how to encode using bits when the probabilities are not 1/powers-of-two. This is a practical problem, and solutions exist (i will write about Huffman coding and arithmetic encoding at another occasion).

But the theoretical significance of quantifying information goes beyond situations where we want to express things in binary, and the values that arise need not be whole numbers. Shannon [1] shows how the logarithm used here is the only expression (under some very general continuity assumptions) that will be additive, and increasing for increasing information content.

In physics we often don't use bits anyway, but dimensionless units (this amounts to a simple units conversion factor, analogous to inches versus cm say, one bit being 0.693... of these units). For this reason i'm leaving the log's unspecified in the rest of this article: interpreting them as ²log's will result in amounts of information being expressed in bits, while using natural log's will express amounts of information in the theoretician's dimensionless units.

We can also ask ourselves before we know an outcome how much information would be contained in knowing it. If (say) the next symbol in a message has probability p₀ of being such-and-such symbol, probability p₁ of being a certain other symbol, and in general probability p_i for the ith possible outcome, (where of course the p_i are such that _i p_i = 1) then we have a probability p₀ of gaining -log p₀ amount of information, a probability p₁ of gaining -log p₁ amount, and so on. The usual rule for calculating expectation values for any quantity tells us that in this case a symbol will give us on average

- _i p_i log p_i

amount of information. Constructs along these lines, sums or integrals of a probability times the log of that probability, abound in information theory, and the above is the reason why. Incidentally, it is exactly this expression we evaluated in the solution to the second coding problem to give us 1.75 bits per character on average there.

Now if a message stream is ergodic, i.e. the probability distribution for the various outcomes for each symbol is the same throughout the message, then this expression for the average information per symbol, multiplied by the number of symbols in the message, gives us immediately the message's information content.

The alternative to an ergodic stream is one where the probabilities for the various outcomes depend on the position of the symbol in the message, and/or on the content of that portion of the message we know already (in other words, correlations between the probabilities, certain combinations of symbols being more or less likely than the simple product of the probabilities for the symbols separately).

Redundancy

The problems near the end of the previous article featured a (rather artificial example of a) language that could be encoded into bits in different ways, one of which takes only 1.75 bits on average, for every 2 bits in the more simple encoding that use a fixed number of bits per symbol. This is an example of redundancy. Redundancy happens when you use a representation of your message that takes more bits than is necessary. You can't help feeling that the language from our second problem example "really" contained only 1.75 N bits of information per N characters, and that coding it using 2 N bits per character is just plain wasteful.

There is a real-life counterpart of our artificial example: for practical reasons ASCII text uses a fixed number of bits per character, even though the various characters have vastly different frequencies (in any language).

We could define redundancy as the difference between the "real" information content and the amount of information needed by the message encoded in its present form. But what is the "real" information content? In our example, it might even be less than 1.75 N bits if we can find a way to squeeze further.

With natural text this is certainly the case: taking character frequency into account can reduce the number of bits needed for a piece of English to almost half of its 8-bit character representation ...but taking frequency of character combinations and even word combinations into account as well can bring it down to a third of its original size.

Computers have made us familiar with data compression, zip files and the like. Good compression software can compact 8-bit text to typically one-third of its regular size, bitmapped images can often be compressed considerably more, and executables somewhat less than text. In practice, we find if two different compression programs each reduce a text to 1/3 of its original size, that carrying out both compressions in succession, in either order, does not make for a file 1/9 of the size. Once data is compressed, another compression algorithm cannot usually make a significant dent in it anymore (the same is true for graphics stored in a compressed format such as JPEG, PNG or GIF). In as far as subsequent compression does make a small improvement, this is just a case of the first compression being less than perfect (and all such software is, by necessity, to keep a reasonable execution speed). It certainly looks like there is, for each file, a certain nonzero incompressible core of real information content, and once that is being approached no algorithm can squeeze beyond that limit. But is this so, or does it only look that way (for instance, because all compression algorithms employ similar techniques) and could a way be found tomorrow to compress files a hundred times smaller?

Can any message be squeezed as small as we want or is there an incompressible core of real information content? The whole point of information theory is that there really is such a thing as an amount of real information in a message (and you can't squeeze it to any smaller than that size). A few points though.

It's certainly not easy to remove all redundancy, it's a question of diminishing returns. In real life situations we don't get exactly to that core as the amount of computation increases exponentially as we approach it. It's a trade off.
With spoken (or otherwise transmitted) signals we want redundancy so the message can still be reconstructed in the presence of noise. A lot of [1] is concerned with noise, and there are beautiful theorems to show how error correction is possible as long as the redundancy is equal or greater than the entropy content of the noise.
The real information content of a message can only be defined by means of probabilities, so it depends on how much we know about what kind of messages we were likely to get.

That last point gets at the heart of information theory.

One way to see that we cannot arbitrarily diminish file sizes is as follows: Suppose we're compressing pieces of English prose and suppose that, after taking into account that not all jumbled letter combinations make words, and not all arrangements of words make sentences, and not all juxtapositions of sentences make a meaningful text, there are still one million possible different messages below a certain message size. We must be able to reconstruct from a compressed file the information that tells us which of the million possible messages was intended. So there must be at least a million different compressed files. Five-bit files, for example, won't cut it, because there's only 32 different ones of them. We must go to twenty bits simply to be able to express a million different things!

Now while this is basically the right idea, it is a fairly blunt tool. Rather than make this sharp distinction between possible and impossible messages, we should take a more nuanced approach, judging how likely any message is. This is how successful data compression works: out of 2¹⁰⁰ possible 100-bit messages, mostly highly unlikely, they select (for example) 2³⁰ likely messages that then are encoded in some 30 bits each, leaving the rest to be encoded in slightly more than their uncompressed size. This means that almost all messages we will actually encounter are encoded in far fewer bits than their uncompressed size.

Information depends on probability

We saw this in the definition of information content at the top of this article. But why? Why is it necessary to drag probability into this? Recall the solution to the second problem. Crucially, it showed how the encoding used was only an improvement given the distribution of frequency of occurrence of the various symbols. If, for example, the frequency had been 25% for each of the symbols, the solution would have been worse than using 2 bits per symbol.

There are actually two issues here, corresponding to the two occurrences of p_i in -_i p_i log p_i.

On the one hand, the best encoding is one where the number of bits used for each outcome is matched to the probability of that outcome, in our example 1, 2 and 3 bits respectively for symbols with probability 1/2¹, 1/2² and 1/2³ (similar considerations also apply when the probabilities are not neat powers of 1/2, for instance arithmetic encoding can provide an exact match to any probability distribution).

On the other hand, even after we decide how many bits will be used to encode each particular outcome, this still does not tell us what the average number of bits per symbol this amounts to unless we know the probabilities of the outcomes -- we cannot calculate a weighted sum unless we know the weights!

There's a number of subtle points involved with probability though. For one thing, probabilities are not carved in stone, the probabilities a receiver expects are ultimately rooted in what they (rightly or wrongly) believe and expect.

It's no surprise of course that the amount of (new) information in a message depends on what the receiver already knows. In fact, if we knew everything we would know exactly what message we were going to get and it would contain zero information! This, BTW, is why a million digits of pi or a prime number table contain no information: you could reconstruct them yourself, given enough time...

In practice, the calculation of amount of information assumes we know certain things (for example, letter frequency in a typical message) and don't know certain other things (such as what the message is going to be).

There are no hard and fast rules what we know in advance. If you archive emails from me and know that half of them start with "sorry i didn't get back to you earlier" you will be able to express the presence or absence of that phrase in one bit <g>. If you think up a coding scheme for the English language you may or may not want to take letter frequency into account, and if you do, you may or may not want to take combination frequencies into account. Point is, amount of information is only defined with reference to certain probabilities. And the probabilities you believe in don't even have to be true, whatever that means <g>. You can still calculate the amount of information with reference to them.

It's time to distinguish between the formal calculation of information, which is with reference to a certain set of probabilities on the one hand...

In (again) our four symbol coding theory problem, with respect to the (unstated) assumption that all symbols are equally likely (for example, we didn't know the symbol frequency a priori) there really are two bits per symbol.
With respect to symbol probabilities 0.5, 0.25, 0.125, 0.125 (but with the unstated assumption that all combinations are equally likely) there are 1.75 bits per symbol there.
With accurate information on language structure and pronounceability of combinations we may find even less than 1 bit per symbol.
With 20/20 hindsight or clairvoyance, knowing what the message is, there's 0 bits there.

...and practical applications where of course it matters what set of probabilities we're going to believe in. If they are true in some sense, we can get at this elusive "real" information content. One problem is that there is no telling how much a priori information you may have about the messages, with as extreme case that if you know everything there's no surprise, no information left. OTOH the situation isn't really hopeless: if you really have some prior knowledge of roughly what sort of message you're going to get, you could count this as having already received part of the information. Together with what you didn't know yet about the message, this gives the total information content, now unrelated to how much you knew already.

This idea is applicable to the following question. Doesn't the "meta information" what symbol frequency is being assumed count as real information to be added to the message? Or how does a compression program know what the symbol frequency is going to be in a message? The answer is that good compression algorithms don't assume a fixed symbol frequency distribution, but get the frequencies from the message itself. The compressed file then contains a header detailing the specific encoding used (and therefore frequencies encountered) so this "meta information" is in a very real sense a finite amount of ordinary information. At least one compression algorithm uses binary codes that correspond to the order in which symbols or symbol combinations are encountered in the plaintext. This way, the compressed file becomes its own catalog, content and "header" now being inextricably linked as a single bit stream. It shows quite forcefully how ultimately the distinction between information (content) and "meta information" (i.e. how the file is compressed) is a false one. All is bits.

< Information: the concept | here | Information in continuous distributions >

______________

To be added:
- examples of redundancy through symbol combination frequency (e.g. Fibonacci bits).
- examples of encoding: Huffman, arithmetic.
- problems/puzzles/exercises.
- graphix.

______________

Bibliography

The classical text in information theory is

Claude E. Shannon, Warren Weaver, The mathematical Theory of Communication, University of Illinois Press, 1949, ISBN 0 252 72548 4

Sadly out of print, but you will find it in many libraries.

Started May '98, major re-write 98-06-10, by
Marijke van Gans
marijke@maxwellian.demon.co.uk