methodology: Given a representative sample of the data in the
real world, we define the accuracy of the data source empirically as:
Accuracy = [ MAX ] * 100 ((X - Xreal)^2/Xreal)
where X and Xreal are the data in the sample and in the real world, respectively.
Regarding consistency, if the condition is mandatory for the data element under consideration,
we require a 100% level of fulfillment. Consistency in the data sources
is measured obtaining samples from each source and measuring the number of
inconsistent records with respect to a user query. This means that the user knows
in advance the answer to this query over the sample. Analogously, completeness is
specified as in the previous case and measured from a data sample, posing a set of
queries over this sample and applying the following formula:
(# of queries with incomplete answers / # of queries) * 100
where an incomplete answer is one such that a record (or a part of it) is missing
(remember that we know in advance all the records from the sample that satisfy the
query). We proceed analogously for the other quality dimensions. This allows determining
which data sources can be considered apt for developing the DSS, meaning
that if a data source does not fulfill the minimum bound for a quality dimension,
either data cleaning methods are applied or the data source must be discarded.
Pages:
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161