To cover
these cases in great measure the main-memory support must be read/write (with a
minimal but effective concurrency control) and support not only ???select??? but also
???insert,??? ???delete,??? and ???update.??? This does not mean that main-memory support for
ETL purposes must be a commercial main-memory database. An SQL interface,
ACID properties, full data type set, and so forth are rarely useful in our context and
are expensive; an ad-hoc main-memory support often performs better than a wellstructured
main-memory DBMS.
In a data warehouse, the fact tables are big (often millions or billions of records)
because they contain detail records; consider the single carton, canned foods, and
so forth bought at the supermarket, which are useful for certain types of analysis
(basket analysis in this case) but useless for others. It could be too expensive to access
the fact table every time, so we need some form of summary or aggregation.
These aggregations are the dataset resulting from a SQL ???group by??? performed on
the fact table. The simplest way to do this is to run a SQL at the end of the loading
phase, but this approach implies a serialization and a double scan of the data (the
first read for loading, the second for aggregation). How can one avoid it? In the
loading phase, when one processes a record and the corresponding aggregate row
in memory does not exist, one can create a new one and copy the record data into
it; otherwise, one has to update the existent aggregate row with its calculation (sum,
count, etc.
Pages:
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212