In addition, these data intensive workflows are quite complex in nature, involving
dozens of sources, cleaning and transformation activities, and loading facilities.
Bouzeghoub, Fabret, and Matulovic (1999) mention that the data warehouse refreshment
process can consist of many different subprocesses, like data cleaning,
archiving, transformations, and aggregations, interconnected through a complex
schedule. For instance, Adzic and Fiore (2003) report a case study for mobile network
traffic data, involving around 30 data flows and 10 sources, while the volume of data
rises to about 2 TB, with the main fact table containing about 3 billion records. The
throughput of the (traditional) population system is 80 million records per hour for
the entire process (compression, FTP of files, decompression, transformation, and
8 Simitsis, Vassiliadis, Skiadopoulos, & Sellis
Copyright ?© 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of
Idea Group Inc. is prohibited.
loading), on a daily basis, with a loading window of only 4 hours. The request for
performance is so pressing that there are processes hard-coded in low level DBMS
calls to avoid the extra step of storing data to a target file to be loaded to the data
warehouse through the DBMS loader. In general, Strange (2002a) notes that the
complexity of the ETL process, as well as the staffing required to implement it, depends
mainly on the following variables: (a) the number and variety of data sources;
(b) the complexity of transformation; (c) the complexity of integration; and (d) the
availability of skill sets.
Pages:
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243