Cierra Castillo

Answered

2022-07-04

I was reading about limits of data compression and I had this question that kept me thinking.

Alphabet set: Set of all the unique symbols in the input data

Q. If the distribution over the alphabet set is uniform then is it possible to compress data theoretically ? Standard compression methods (like Huffman codes) exploit the bias in the frequency distribution to obtain codes that have a lower average length of the input text. Is it possible to work without frequency?

Idea 1: One idea is that since the distribution is uniform over the symbols, then any assignment of codes would result in the same number of symbols in the range of the mapping, thus we would require the same number of bits to store that information. Hence the average length of compressed data should be greater than or equal to the original average length.

Idea 2: Another idea that pops up in my head is that if the distribution of symbols is uniform, why not create a new symbol set S' that is essentially a set of all the bi-grams of symbols. But, this distribution also needs to be uniform as the probability of a pair of symbols is same for all the possible cases.

These ideas force me to conclude that it is not possible to compress data if the distribution is uniform.

Please correct me if I am wrong at any point, also I would like to know your views on the problem.

Edit: If you can also provide a formal proof or a book for reference then it would be great.

Alphabet set: Set of all the unique symbols in the input data

Q. If the distribution over the alphabet set is uniform then is it possible to compress data theoretically ? Standard compression methods (like Huffman codes) exploit the bias in the frequency distribution to obtain codes that have a lower average length of the input text. Is it possible to work without frequency?

Idea 1: One idea is that since the distribution is uniform over the symbols, then any assignment of codes would result in the same number of symbols in the range of the mapping, thus we would require the same number of bits to store that information. Hence the average length of compressed data should be greater than or equal to the original average length.

Idea 2: Another idea that pops up in my head is that if the distribution of symbols is uniform, why not create a new symbol set S' that is essentially a set of all the bi-grams of symbols. But, this distribution also needs to be uniform as the probability of a pair of symbols is same for all the possible cases.

These ideas force me to conclude that it is not possible to compress data if the distribution is uniform.

Please correct me if I am wrong at any point, also I would like to know your views on the problem.

Edit: If you can also provide a formal proof or a book for reference then it would be great.

Answer & Explanation

Tucker House

Expert

2022-07-05Added 9 answers

Your intuitive arguments are correct, however, under the additional assumption (which, I guess you also consider but don't explicitly state):

The source generates symbols that are not only uniformly distributed over the alphabet but also independent.

In that case, there is no (statistical) redundancy of the generated symbol sequence, hence, no room for compression.

Note that if the source symbols are correlated, then one could consider "super-symbols" (such as what you propose in idea 2) that would not be uniformly distributed, and, therefore, compressible.

The source generates symbols that are not only uniformly distributed over the alphabet but also independent.

In that case, there is no (statistical) redundancy of the generated symbol sequence, hence, no room for compression.

Note that if the source symbols are correlated, then one could consider "super-symbols" (such as what you propose in idea 2) that would not be uniformly distributed, and, therefore, compressible.

Most Popular Questions