Saturday, 25 October 2014

The Order of Data Contains Information, But How Much?

For some sampling designs, especially adaptive ones like those in Thompson and Seber (1996) , the order of the values in the sample matter. Statistics derived from the tuple of observed values (3,10,4,9) may be different than statistics derived from the tuple (9,3,10,4). The order in the round parentheses implies the order in which these values were observed.

In other cases, like simple random sampling, only the observation values and not their order matters. In simple random sampling, the samples {3,10,4,9} and {9,3,10,4} are the same. The curly braces imply set notation, where the order doesn't matter.

The unordered dataset can be produced from the ordered data, but the converse isn't true. There is an implication that ignoring the order of a tuple (ordered set) of numbers is destroying some information about that tuple. The unordered set {3.10,4,9} contains less information than the ordered tuple (3,10,4,9), but this isn't obvious because both are described as four numbers. Other sufficient statistics like the total, a sufficient statistic for the simple mean, 26, are obvious data reductions because they describe something using fewer than all of the values from the set or tuple. However, removing ordering doesn't reduce the dimensionality, so how does it reduce data, and by how much?


Example one - Sample of 8 Binary Responses

We can show that some information is lost, and find a lower bound to how much, but describing an unordered set of numbers in fewer bits than a tuple of those same number could be described. First, consider a data set where the only possible measured values are 0 and 1. The ordered data of n such observations (uncompressed, as we're making no assumptions about the data, yet) requires n bits to store.

Let a tuple of eight binary numbers, such as (0,1,1,0,0,1,0,1) come from a distribution of eight-tuples. There are 2^8 = 256 possible tuples that could come from this distribution. Let's assume the maximum entropy case, where all 256 tuples are equally likely to come from this distribution. In this case, tuples require an average of exactly 8 bits to store and transmit.

Now consider the unordered sets derived from these tuples. The set {0,1,1,0,0,1,0,1} is the same as {1,1,0,0,0,0,1,1} or {1,1,1,1,0,0,0,0}. It can be expressed as the number of ones in the data set without loss. That leaves only eleven possibilities, ones = 0, 1, 2, 3, ... ,7, 8, and that information takes at most 4 bits to express. If each tuple was equally likely, then we could use a code like this to reduce the average transmission length to about 2.8 bits:

Output Code Length Probability
4 ones 00 2 70/256
3 ones 010 3 56/256
5 ones 011 3 56/256
2 ones 100 3 28/256
6 ones 101 3 28/256
1 ones 1100 4 8/256
7 ones 1101 4 8/256
0 ones 1110 4 1/256
8 ones 1111 4 1/256

Average length: Sum(bits X probability) = 2.796


Example two - Sample of 10 Integer Responses

Consider a sample of ten values from a discrete uniform (min=0,max=255) distribution. All the values from 0 to 255 are equally possible. In an ordered data set, the shortest expected length that data from this set can be expressed is 80 bits (8 bits of entropy per observation times 10 observations, see Cover and Thomas (2006) ).

However, if this were an unordered set of integers, they can be always be expressed in fewer than 80 bits. Here is one example of how:

X, the number of observed values 0-127 inclusive (4 bits, representing 0-10)
X observations within [0,127] with the most significant bit removed (7X bits)
10-X observations within [128,255] with the most significant bit removed. (70-7X bits)

74 bits in total.

If there are X observations from 0 to 127, all of them have 0 as their most significant bit. Since order doesn't need to be preserved, these X observations are stored first and the most significant bit is assumed to be 0 for each of them. The remaining 10-X observations must begin with a 1, so their most significant bit is redundant as well.

This doesn't work with a tuple where order matters because knowing how many values are within [0,127] isn't enough; we also need to know where they are in the tuple. We would require a collection of ten zeroes and ones in order to describe the first bit. As shown above, this requires 10 bits, making the total length 80 bits and saving nothing.

For large data sets, further savings can be rendered by using 4, 8, or 16 bins and removing the first 2, 3, or 4 significant bits from each value respectively. In signed integers (those that include negatives), the most significant bit is the sign and it can be handled exactly the same way.

Continuous data can be treated similarly if it is quantized to a fixed precision.

By ignoring the order of a collection of 10 integers uniformly from 0 to 255, we destroy at least 6 bits of information, or 7.5%. We can also find an upper bound of the information destroyed by looking at the number of possible mappings.

Example: If all 10 values from 0 to 255 in the above example were unique, then one unordered set of numbers could be the unordered version of any one of 10!, or 3,268,800 tuples with equal likelihood. log2 of 10! is approximately 21.791, so the information lost from ignoring ordering cannot be more than 21.8.

Also, not every element is unique, so for many cases there are fewer than 10! tuples mapping to an unordered set. An extreme case where all the values are the same, no information is lost by ignoring the order. Finding the distribution of the number of bits lost to ordering is tedious, but we can approximate it with a Monte Carlo Simulation

R-code (Version 2.14.1)

set.seed(12345)   # arbitrary seed for reproducability
Nruns = 100000   # sets the number of runs
bitslost = rep(NA,Nruns)  # records the number of bits lost in each case
l2fac10 = log2(factorial(10))   # bits lost with all unique elements

for(k in 1:Nruns)  # for each run...
 tuple = floor(runif(n=10) * 256)    # generate a 10-tuple, discrete unif[0,255] 
 tab = table(tuple)   # get the number of instances of each value
 bitslost[k] = l2fac10 - log2(prod(factorial(tab)))   #calucates log2 of # of mappings


I got the following results:

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.21   21.79   21.79   21.61   21.79   21.79

Bits Lost Relative Frequency
17.206 .00003
18.206 .00008
18.791 .00018
19.206 .00165
19.791 .00866
20.791 .15426
21.791 .83514

Steven K. Thompson and George A. F. Seber, 1996, Adaptive Sampling, Wiley.

Thomas M. Cover and Joy A. Thomas, 2006, Elements of Information Theory, Wiley, 2nd ed.

No comments:

Post a Comment