SEARCH
0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Prev | Current Page 183 | Next

Robert Wrembel and Christian Koncilia

"Data Warehouses and Olap: Concepts, Architectures and Solutions"

For this particular purpose, there are specialized
tools that apply ad-hoc algorithms and have their own topographic database
where to store/retrieve the correct name of every street, and so forth. In this case,
cleaning is a specific stage in the ETL process.
Nevertheless, the problem of assuring data correctness may be considered from
different perspectives. Data sources do not have the same risks of errors: one
is a well defined and certified file produced by another system, another is data
inserted directly by users, or data sent via an unstable line or coming from unreliable
systems, and so forth. The risk of data corruption in these examples is different
in quantity and quality; in the case of a certified file, one can process them
without any check; Web form data must be checked and interpreted according
to the application logic, while in the last case, we must pay attention to hardware
corruption, loss/inconsistency/duplication of some information and so on.
It is impossible to find a general rule for ???cleaning??? due to the wide spectrum of
possible types of errors. Apart from specific cases where a dedicated processing
phase is mandatory, is it better to process the files twice (a first scan for cleaning
and a second for transforming) or to incorporate the optimal level of checks in
the transformation phase? In our experience, with large volumes and strict time
constraints, the second solution is better because it saves computational and I/O
time, however, loosing in the implementation conceptual separation between the
cleaning and transform processes.


Pages:
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195