Dividing a job into many sequential stages (each with its input
and output on disk) is a good technique that simplifies the coding and debugging,
but reading and writing the same data many times is very expensive. Processing
many files (or extracting) sequentially is the simplest way but does not permit a
good utilization of computational resources.
To achieve high performance, there are only two ways:
??? execute the minimum number of machine instructions and avoid useless I/O
??? do not waste the wait time that occurs in I/O operations
These elementary rules imply a not always simple balance between the readability and
maintainability, on one hand, and an efficient but complex coding, on the other.
To avoid reading and writing the same data many times, split workload in conformity
to the application logic to exploit parallel features of the machines. Use pipelining
and parallelism as the main objective in ETL.
Parallelism is the ability to split workload in many tasks that work concurrently and
synchronize each other. Split workload is obviously useful when we have more than
one processor, but is even useful in a single processor machine; in the latter case,
we can utilize the I/O idle time (orders of magnitude of CPU time) to do other useful
jobs. In ETL, parallelism primarily means the full utilization of multiprocessor
machines and minimization of waste of time correlated to I/O operations.
Pages:
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204