On the other hand, very small relations
can be replicated to avoid the need to repartition other very large datasets that may
need to be joined with them. In practice the decision on replication vs. partitioning
for each relation can be taken by a cost-based optimizer that evaluates alternative
execution plans and partitioning scenarios to determine the best one. Horizontallypartitioned
relations can typically be divided using a round-robin, random, range,
or hash-based scheme. We assume horizontal hash-partitioning, as this approach
facilitates key-based tuple location for parallel operation. Partitioning is intimately
related to processing issues. Therefore, first we describe generic query processing
over the NPDW. Then we focus on parallel join and partitioning alternatives.
Generic Processing over the NPDW
Query processing over a parallel shared-nothing database, and in particular over the
NPDW, follows roughly the steps in Figure 2(b). Figure 2(a) illustrates a simple sum
query example over the NPDW. In this example the task is divided into all nodes,
so that each node needs to apply exactly the same initial query on its partial data,
and the results are merged by applying a merge query again at the merging node
with the partial results coming from the processing nodes. If the datasets could be
Efficient and Robust Node-Partitioned Data Warehouses 2
Copyright ?© 2007, Idea Group Inc.
Pages:
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404