DOI QR코드

DOI QR Code

A Study on Improving Classification Performance for Manufacturing Process Data with Multicollinearity and Imbalanced Distribution

다중공선성과 불균형분포를 가지는 공정데이터의 분류 성능 향상에 관한 연구

  • 이채진 (LG전자 HE사업본부) ;
  • 박정술 (고려대학교 산업경영공학과) ;
  • 김준석 (고려대학교 산업경영공학과) ;
  • 백준걸 (고려대학교 산업경영공학과)
  • Received : 2014.01.20
  • Accepted : 2014.11.03
  • Published : 2015.02.15

Abstract

From the viewpoint of applications to manufacturing, data mining is a useful method to find the meaningful knowledge or information about states of processes. But the data from manufacturing processes usually have two characteristics which are multicollinearity and imbalance distribution of data. Two characteristics are main causes which make bias to classification rules and select wrong variables as important variables. In the paper, we propose a new data mining procedure to solve the problem. First, to determine candidate variables, we propose the multiple hypothesis test. Second, to make unbiased classification rules, we propose the decision tree learning method with different weights for each category of quality variable. The experimental result with a real PDP (Plasma display panel) manufacturing data shows that the proposed procedure can make better information than other data mining procedures.

Keywords

References

  1. Allison, P., Altman, M., Gill, J., and McDonald, M. P. (2004), Convergence problems in logistic regression, Numerical issues in statistical computing for the social scientist, 238-252.
  2. Banks, D. L. and Giovanni P. (1991), Preanalysis of Superlarge Industrial Datasets, I (S) DS, Duke University, USA.
  3. Benjamini, Y. and Hochberg, Y. (1995), Controlling the false discovery rate : A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society : Series B(Methodological), 57, 289-300.
  4. Boeuf, J. P. (2003), Plasma display panels : physics, recent developments and key issues, Journal of physics D : Applied physics, 36(6), R53. https://doi.org/10.1088/0022-3727/36/6/201
  5. Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984), Classification and Regression Trees, Wadsworth, Califonia, USA
  6. Byeon, S. K., Kang, C. W., and Sim S., B. (2004), Defect Type Prediction Method in Manufacturing Process Using Data Mining Technique, Journal of industrial and systems engineering, 27(2), 10-16.
  7. Cunningham, Sean P., Costas, J. Spanos, and Katalin Voros. (1995), Semiconductor yield improvement : results and best practices, Semiconductor Manufacturing IEEE Transactions, 8(2), 103-109. https://doi.org/10.1109/66.382273
  8. Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003), Multiple hypothesis testing in microarray experiments, Statistical Science, 18(1), 71-103. https://doi.org/10.1214/ss/1056397487
  9. Farcomeni, A. (2008), A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion, Statistical Methods in Medical Research, 17(4), 347-388. https://doi.org/10.1177/0962280206079046
  10. Fernandez, G. (2010), Statistical Data mining using SAS applications, 2nd edition, CRC press, New Yok, USA.
  11. Gibbons, J. D. (1993), Nonparametric statistics : An introduction Vol. 90, Sage, California, USA.
  12. HALL, Mark A. (1999), Correlation-based feature selection for machine learning, Ph.D. Thesis, The University of Waikato.
  13. Hochberg, Y. and Tamhane, A. (1987), Multiple Comparison Procedures, Wiley, New York, USA.
  14. Jang, Y. S., Kim J. W., and Hur J. (2008), Combined application of data imbalance reduction techniques using genetic algorithm, Journal of Intelligence and Information Systems, 14(3), 133-154.
  15. Jang, W. C. (2013), Multiple testing and its applications in high-dimension, Journal of the Korean data & information science society, 24(5), 1063-1076. https://doi.org/10.7465/jkdi.2013.24.5.1063
  16. John, G. H., Kohavi, R., and Pfleger, K. (1994), Irrelevant features and the subset selection Problem, ICML, 94, 121-129.
  17. Kim, J. H. and Jeong, J. B. (2004), Classification of class-imbalanced data : Effect of over-sampling and under-sampling of training data, The Korean Journal of Applied Statistics, 17(3), 445-457. https://doi.org/10.5351/KJAS.2004.17.3.445
  18. Kubat, M., Holte, R., and Matwin, S. (1997), Learning when negative examples abound, Proceedings of the 9th European Conference on Machine Learning, ECML-97, 146-153.
  19. Koksal, G., Batmaz, I., and Testik, M. C. (2011), A review of data mining applications for quality improvement in manufacturing industry, Expert Systems with Applications, 38(10), 13448-13467. https://doi.org/10.1016/j.eswa.2011.04.063
  20. Lemon, S. C., Roy, J., Clark, M. A., Friedmann, P. D., and Rakowski, W. (2003), Classification and regression tree analysis in public health : methodological review and comparison with logistic regression, Annals of Behavioral Medicine, 26(3), 172-181. https://doi.org/10.1207/S15324796ABM2603_02
  21. Lin, W. J. and Chen, J. J. (2012), Class-imbalanced classifiers for high-dimensional data, Briefings in bioinformatics, 14(1), 13-26. https://doi.org/10.1093/bib/bbs006
  22. Little, R. J. and Rubin, D. B. (2002), Statistical Analysis with Missing Data, 2nd edition, John Wiley and Sons, New York.
  23. Park, J. H. and Byun, J. H. (2002), An analysis method of superlarge manufacturing process data using cleaning and graphical analysis, Journal of the Korean Society for Quality Management, 30(2), 72-85.
  24. Polo, J. L., Berzal, F., and Cubero, J. C. (2006), Taking class importance into account, In Hybrid Information Technology, ICHIT'06. International Conference on, 1, 1-6.
  25. Pyle, D. (1999), Data preparation for data mining, Morgan Kaufmann, San Francisco, USA.
  26. Shmueli, G., Patel, N. R., and Bruce, P. C. (2011), Data Mining for Business Intelligence : Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner, 2nd edition, Wiley, New York, USA.
  27. Storey, J. D. (2002), A direct approach to false discovery rates. Journal of the Royal Statistical Society : Series B (Statistical Methodology), 64(3).
  28. Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., and Zeileis, A. (2008), Conditional variable importance for random forests, BMC bioinformatics, 9(1), 307. https://doi.org/10.1186/1471-2105-9-307
  29. Van Hulse, J., Khoshgoftaar, T. M., and Napolitano, A. (2007), Experimental perspectives on learning from imbalanced data, In Proceedings of the 24th international conference on Machine learning, 935-942.
  30. Weiss, G. M. and Provost, F. (2001), The effect of class distribution on classifier learning : an empirical study, Technical Report ML-TR-44, Department of Computer Science, Rutgers University.
  31. Zeng, H. and Cheun, T. (2008), Feature selection for clustering high dimensional data, Lecture Notes in Artificial Intelligence, 5351, 913-922.

Cited by

  1. Short-term Wind Farm Power Forecasting Using Multivariate Analysis to Improve Wind Power Efficiency vol.29, pp.7, 2015, https://doi.org/10.5207/JIEIE.2015.29.7.054