DOI QR코드

DOI QR Code

Saliency Score-Based Visualization for Data Quality Evaluation

  • Kim, Yong Ki (Department of Computer Science, Chungbuk National University) ;
  • Lee, Keon Myung (Department of Computer Science, Chungbuk National University)
  • Received : 2015.09.17
  • Accepted : 2015.09.25
  • Published : 2015.12.25

Abstract

Data analysts explore collections of data to search for valuable information using various techniques and tricks. Garbage in, garbage out is a well-recognized idiom that emphasizes the importance of the quality of data in data analysis. It is therefore crucial to validate the data quality in the early stage of data analysis, and an effective method of evaluating the quality of data is hence required. In this paper, a method to visually characterize the quality of data using the notion of a saliency score is introduced. The saliency score is a measure comprising five indexes that captures certain aspects of data quality. Some experiment results are presented to show the applicability of proposed method.

Keywords

References

  1. L. L. Pipino, Y. W. Lee, and R. Y. Wang, "Data quality assessment," Communications of the ACM, vol. 45, no. 4, pp. 211-218, 2002. http://dx.doi.org/10.1145/505248.506010
  2. B. Heinrich, M. Kaiser, and M. Klier, "How to measure data quality? A metric-based approach," in Proceedings of the 28th International Conference of Information Systems (ICIS), Montreal, Canada, 2007, pp. 1-15.
  3. US Environmental Protection Agency, "Data quality assessment: statistical methods for practitioners," US Environmental Protection Agency, Washington, DC, EPA/240/B-06/003, 2006.
  4. A. D. Chapman, Principles of Data Quality. Copenhagen: Global Biodiversity Information Facility, 2005.
  5. R. Y.Wang and D. M. Strong, "Beyond accuracy: what data quality means to data consumers," Journal of Management Information Systems, vol. 12, no. 4, pp. 5-33. 1996. https://doi.org/10.1080/07421222.1996.11518099
  6. E. M. Knorr and R. T. Ng, "Finding intensional knowledge of distance-based outliers," in Proceedings of the 25th International Conference on Very Large Data Bases (VLDB), Edinburgh, Scotland,1999, pp. 211-222.
  7. M. M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander, "LOF: identifying density-based local outliers," in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, 2000, pp. 93-104. http://dx.doi.org/10.1145/342009.335388
  8. D. M. Hawkins, Identification of Outliers. London: Chapman and Hall, 1980.
  9. H. P. Kriegel, P. Kroger, and A. Zimek, "Outlier detection techniques," presented at the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, April 27-30, 2009.
  10. K. P. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge, MA: MIT press, 2012.
  11. C. M. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York, NY: Springer, 2006.
  12. C. C. Aggarwal, Data Mining: The Textbook. Cham, Switzerland: Springer, 2015.
  13. G. Merz and P. Murphy, "UCI repository of machine learning databases," Department of Information and Computer Science, University of California, Irvine, CA, Technical Report, 1996.