This process is called partitioning, if the relation is not
partitioned yet, or repartitioning, if the relation is already partitioned but must be
reorganized. Both operations can be costly because they may require heavy data
exchange over the network connecting the nodes. In this work we will refer to partitioning
(and placement) not as the operation of partitioning while processing a
join but rather as an initial placement and sporadic reorganization task that decides
which relations are to be divided or replicated into nodes and which partitioning
attributes are to be used. Williams and Zhou (1998) review five major data placement
strategies (size-based, access frequency-based, and network traffic based) and
conclude experimentally that the way data is placed in a shared-nothing environment
can have considerable effect on performance. Hua and Lee (1990) use variable partitioning
(size and access frequency-based) and conclude that partitioning increases
throughput for short transactions but complex transactions involving several large
joins result in reduced throughput with increased partitioning.
Some of the most promising partitioning and placement approaches focus on query
workload-based partitioning choice (Rao, Zhang, & Megiddo, 2002; Zilio, Jhingram,
& Padmanabhan, 1994). These strategies use the query workload to determine the
Efficient and Robust Node-Partitioned Data Warehouses 20
Copyright ?© 2007, Idea Group Inc.
Pages:
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396