A Study on Improving Classification Performance for Manufacturing Process Data with Multicollinearity and Imbalanced Distribution

Lee, Chae Jin;Park, Cheong-Sool;Kim, Jun Seok;Baek, Jun-Geol;

doi:10.7232/JKIIE.2015.41.1.025

Journal of Korean Institute of Industrial Engineers (대한산업공학회지)

Volume 41 Issue 1
/
Pages.25-33
/
2015
/
1225-0988(pISSN)
/
2234-6457(eISSN)

Korean Institute of Industrial Engineers (대한산업공학회)

DOI QR Code

A Study on Improving Classification Performance for Manufacturing Process Data with Multicollinearity and Imbalanced Distribution

다중공선성과 불균형분포를 가지는 공정데이터의 분류 성능 향상에 관한 연구

Lee, Chae Jin (LG Home Entertainment Company) ;
Park, Cheong-Sool (School of Industrial Management Engineering, Korea University) ;
Kim, Jun Seok (School of Industrial Management Engineering, Korea University) ;
Baek, Jun-Geol (School of Industrial Management Engineering, Korea University)

이채진 (LG전자 HE사업본부) ;
박정술 (고려대학교 산업경영공학과) ;
김준석 (고려대학교 산업경영공학과) ;
백준걸 (고려대학교 산업경영공학과)

Received : 2014.01.20
Accepted : 2014.11.03
Published : 2015.02.15

https://doi.org/10.7232/JKIIE.2015.41.1.025 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

From the viewpoint of applications to manufacturing, data mining is a useful method to find the meaningful knowledge or information about states of processes. But the data from manufacturing processes usually have two characteristics which are multicollinearity and imbalance distribution of data. Two characteristics are main causes which make bias to classification rules and select wrong variables as important variables. In the paper, we propose a new data mining procedure to solve the problem. First, to determine candidate variables, we propose the multiple hypothesis test. Second, to make unbiased classification rules, we propose the decision tree learning method with different weights for each category of quality variable. The experimental result with a real PDP (Plasma display panel) manufacturing data shows that the proposed procedure can make better information than other data mining procedures.

Keywords

References

Allison, P., Altman, M., Gill, J., and McDonald, M. P. (2004), Convergence problems in logistic regression, Numerical issues in statistical computing for the social scientist, 238-252.
Banks, D. L. and Giovanni P. (1991), Preanalysis of Superlarge Industrial Datasets, I (S) DS, Duke University, USA.
Benjamini, Y. and Hochberg, Y. (1995), Controlling the false discovery rate : A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society : Series B(Methodological), 57, 289-300.
Boeuf, J. P. (2003), Plasma display panels : physics, recent developments and key issues, Journal of physics D : Applied physics, 36(6), R53. https://doi.org/10.1088/0022-3727/36/6/201
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984), Classification and Regression Trees, Wadsworth, Califonia, USA
Byeon, S. K., Kang, C. W., and Sim S., B. (2004), Defect Type Prediction Method in Manufacturing Process Using Data Mining Technique, Journal of industrial and systems engineering, 27(2), 10-16.
Cunningham, Sean P., Costas, J. Spanos, and Katalin Voros. (1995), Semiconductor yield improvement : results and best practices, Semiconductor Manufacturing IEEE Transactions, 8(2), 103-109. https://doi.org/10.1109/66.382273
Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003), Multiple hypothesis testing in microarray experiments, Statistical Science, 18(1), 71-103. https://doi.org/10.1214/ss/1056397487
Farcomeni, A. (2008), A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion, Statistical Methods in Medical Research, 17(4), 347-388. https://doi.org/10.1177/0962280206079046
Fernandez, G. (2010), Statistical Data mining using SAS applications, 2nd edition, CRC press, New Yok, USA.
Gibbons, J. D. (1993), Nonparametric statistics : An introduction Vol. 90, Sage, California, USA.
HALL, Mark A. (1999), Correlation-based feature selection for machine learning, Ph.D. Thesis, The University of Waikato.
Hochberg, Y. and Tamhane, A. (1987), Multiple Comparison Procedures, Wiley, New York, USA.
Jang, Y. S., Kim J. W., and Hur J. (2008), Combined application of data imbalance reduction techniques using genetic algorithm, Journal of Intelligence and Information Systems, 14(3), 133-154.
Jang, W. C. (2013), Multiple testing and its applications in high-dimension, Journal of the Korean data & information science society, 24(5), 1063-1076. https://doi.org/10.7465/jkdi.2013.24.5.1063
John, G. H., Kohavi, R., and Pfleger, K. (1994), Irrelevant features and the subset selection Problem, ICML, 94, 121-129.
Kim, J. H. and Jeong, J. B. (2004), Classification of class-imbalanced data : Effect of over-sampling and under-sampling of training data, The Korean Journal of Applied Statistics, 17(3), 445-457. https://doi.org/10.5351/KJAS.2004.17.3.445
Kubat, M., Holte, R., and Matwin, S. (1997), Learning when negative examples abound, Proceedings of the 9th European Conference on Machine Learning, ECML-97, 146-153.
Koksal, G., Batmaz, I., and Testik, M. C. (2011), A review of data mining applications for quality improvement in manufacturing industry, Expert Systems with Applications, 38(10), 13448-13467. https://doi.org/10.1016/j.eswa.2011.04.063
Lemon, S. C., Roy, J., Clark, M. A., Friedmann, P. D., and Rakowski, W. (2003), Classification and regression tree analysis in public health : methodological review and comparison with logistic regression, Annals of Behavioral Medicine, 26(3), 172-181. https://doi.org/10.1207/S15324796ABM2603_02
Lin, W. J. and Chen, J. J. (2012), Class-imbalanced classifiers for high-dimensional data, Briefings in bioinformatics, 14(1), 13-26. https://doi.org/10.1093/bib/bbs006
Little, R. J. and Rubin, D. B. (2002), Statistical Analysis with Missing Data, 2nd edition, John Wiley and Sons, New York.
Park, J. H. and Byun, J. H. (2002), An analysis method of superlarge manufacturing process data using cleaning and graphical analysis, Journal of the Korean Society for Quality Management, 30(2), 72-85.
Polo, J. L., Berzal, F., and Cubero, J. C. (2006), Taking class importance into account, In Hybrid Information Technology, ICHIT'06. International Conference on, 1, 1-6.
Pyle, D. (1999), Data preparation for data mining, Morgan Kaufmann, San Francisco, USA.
Shmueli, G., Patel, N. R., and Bruce, P. C. (2011), Data Mining for Business Intelligence : Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner, 2nd edition, Wiley, New York, USA.
Storey, J. D. (2002), A direct approach to false discovery rates. Journal of the Royal Statistical Society : Series B (Statistical Methodology), 64(3).
Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., and Zeileis, A. (2008), Conditional variable importance for random forests, BMC bioinformatics, 9(1), 307. https://doi.org/10.1186/1471-2105-9-307
Van Hulse, J., Khoshgoftaar, T. M., and Napolitano, A. (2007), Experimental perspectives on learning from imbalanced data, In Proceedings of the 24th international conference on Machine learning, 935-942.
Weiss, G. M. and Provost, F. (2001), The effect of class distribution on classifier learning : an empirical study, Technical Report ML-TR-44, Department of Computer Science, Rutgers University.
Zeng, H. and Cheun, T. (2008), Feature selection for clustering high dimensional data, Lecture Notes in Artificial Intelligence, 5351, 913-922.

Cited by

Short-term Wind Farm Power Forecasting Using Multivariate Analysis to Improve Wind Power Efficiency vol.29, pp.7, 2015, https://doi.org/10.5207/JIEIE.2015.29.7.054

Journal of Korean Institute of Industrial Engineers (대한산업공학회지)

A Study on Improving Classification Performance for Manufacturing Process Data with Multicollinearity and Imbalanced Distribution

다중공선성과 불균형분포를 가지는 공정데이터의 분류 성능 향상에 관한 연구

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)