To
make the discussion more concrete, let us assume that a machine word is 32 bits.
In this case, a WAH tail contains exactly 31 bits from the input bitmap and a WAH
fill contains a multiple of 31 bits that are all the same (either 0 or 1). Because the
bitmap index is known to be efficient for low cardinality attributes, we further restrict
our discussion to high cardinality attributes only, say, c > 100. In an equalityencoded
bitmap index, there are c keys (distinct values of the attribute) and thus
c bitmaps. We do not know exactly how many bits are set to 1 in each individual
bitmap. However, we know that the total number of bits that are 1 is exactly N (the
number of rows in the dataset). In the worst case, there are (N+c) WAH runs in the
bitmaps, where N refers to maximum number of tail words (each containing a single
bit set to 1) and c refers to the maximum number of runs at the end of each bitmap
that are not terminated with a tail word. Each WAH run is encoded by two machine
words. Therefore, we need a total of 2(N+c) words to represent the bitmaps. Assuming
each key is encoded by one word along with one additional word to associate
the key with the bitmap, the total index size is 2N+4c words. In most cases, the attribute
cardinality c is much smaller than N. In these cases, the WAH compressed
equality-encoded bitmap index size is at worst 2N words.
Pages:
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337