For this particular purpose, there are specialized
tools that apply ad-hoc algorithms and have their own topographic database
where to store/retrieve the correct name of every street, and so forth. In this case,
cleaning is a specific stage in the ETL process.
Nevertheless, the problem of assuring data correctness may be considered from
different perspectives. Data sources do not have the same risks of errors: one
is a well defined and certified file produced by another system, another is data
inserted directly by users, or data sent via an unstable line or coming from unreliable
systems, and so forth. The risk of data corruption in these examples is different
in quantity and quality; in the case of a certified file, one can process them
without any check; Web form data must be checked and interpreted according
to the application logic, while in the last case, we must pay attention to hardware
corruption, loss/inconsistency/duplication of some information and so on.
It is impossible to find a general rule for ???cleaning??? due to the wide spectrum of
possible types of errors. Apart from specific cases where a dedicated processing
phase is mandatory, is it better to process the files twice (a first scan for cleaning
and a second for transforming) or to incorporate the optimal level of checks in
the transformation phase? In our experience, with large volumes and strict time
constraints, the second solution is better because it saves computational and I/O
time, however, loosing in the implementation conceptual separation between the
cleaning and transform processes.
Pages:
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195