DOI QR코드

DOI QR Code

Improving the Accuracy of Document Classification by Learning Heterogeneity

이질성 학습을 통한 문서 분류의 정확성 향상 기법

  • Received : 2018.07.16
  • Accepted : 2018.08.28
  • Published : 2018.09.30

Abstract

In recent years, the rapid development of internet technology and the popularization of smart devices have resulted in massive amounts of text data. Those text data were produced and distributed through various media platforms such as World Wide Web, Internet news feeds, microblog, and social media. However, this enormous amount of easily obtained information is lack of organization. Therefore, this problem has raised the interest of many researchers in order to manage this huge amount of information. Further, this problem also required professionals that are capable of classifying relevant information and hence text classification is introduced. Text classification is a challenging task in modern data analysis, which it needs to assign a text document into one or more predefined categories or classes. In text classification field, there are different kinds of techniques available such as K-Nearest Neighbor, Naïve Bayes Algorithm, Support Vector Machine, Decision Tree, and Artificial Neural Network. However, while dealing with huge amount of text data, model performance and accuracy becomes a challenge. According to the type of words used in the corpus and type of features created for classification, the performance of a text classification model can be varied. Most of the attempts are been made based on proposing a new algorithm or modifying an existing algorithm. This kind of research can be said already reached their certain limitations for further improvements. In this study, aside from proposing a new algorithm or modifying the algorithm, we focus on searching a way to modify the use of data. It is widely known that classifier performance is influenced by the quality of training data upon which this classifier is built. The real world datasets in most of the time contain noise, or in other words noisy data, these can actually affect the decision made by the classifiers built from these data. In this study, we consider that the data from different domains, which is heterogeneous data might have the characteristics of noise which can be utilized in the classification process. In order to build the classifier, machine learning algorithm is performed based on the assumption that the characteristics of training data and target data are the same or very similar to each other. However, in the case of unstructured data such as text, the features are determined according to the vocabularies included in the document. If the viewpoints of the learning data and target data are different, the features may be appearing different between these two data. In this study, we attempt to improve the classification accuracy by strengthening the robustness of the document classifier through artificially injecting the noise into the process of constructing the document classifier. With data coming from various kind of sources, these data are likely formatted differently. These cause difficulties for traditional machine learning algorithms because they are not developed to recognize different type of data representation at one time and to put them together in same generalization. Therefore, in order to utilize heterogeneous data in the learning process of document classifier, we apply semi-supervised learning in our study. However, unlabeled data might have the possibility to degrade the performance of the document classifier. Therefore, we further proposed a method called Rule Selection-Based Ensemble Semi-Supervised Learning Algorithm (RSESLA) to select only the documents that contributing to the accuracy improvement of the classifier. RSESLA creates multiple views by manipulating the features using different types of classification models and different types of heterogeneous data. The most confident classification rules will be selected and applied for the final decision making. In this paper, three different types of real-world data sources were used, which are news, twitter and blogs.

최근 인터넷 기술의 발전과 함께 스마트 기기가 대중화됨에 따라 방대한 양의 텍스트 데이터가 쏟아져 나오고 있으며, 이러한 텍스트 데이터는 뉴스, 블로그, 소셜미디어 등 다양한 미디어 매체를 통해 생산 및 유통되고 있다. 이처럼 손쉽게 방대한 양의 정보를 획득할 수 있게 됨에 따라 보다 효율적으로 문서를 관리하기 위한 문서 분류의 필요성이 급증하였다. 문서 분류는 텍스트 문서를 둘 이상의 카테고리 혹은 클래스로 정의하여 분류하는 것을 의미하며, K-근접 이웃(K-Nearest Neighbor), 나이브 베이지안 알고리즘(Naïve Bayes Algorithm), SVM(Support Vector Machine), 의사결정나무(Decision Tree), 인공신경망(Artificial Neural Network) 등 다양한 기술들이 문서 분류에 활용되고 있다. 특히, 문서 분류는 문맥에 사용된 단어 및 문서 분류를 위해 추출된 형질에 따라 분류 모델의 성능이 달라질 뿐만 아니라, 문서 분류기 구축에 사용된 학습데이터의 질에 따라 문서 분류의 성능이 크게 좌우된다. 하지만 현실세계에서 사용되는 대부분의 데이터는 많은 노이즈(Noise)를 포함하고 있으며, 이러한 데이터의 학습을 통해 생성된 분류 모형은 노이즈의 정도에 따라 정확도 측면의 성능이 영향을 받게 된다. 이에 본 연구에서는 노이즈를 인위적으로 삽입하여 문서 분류기의 견고성을 강화하고 이를 통해 분류의 정확도를 향상시킬 수 있는 방안을 제안하고자 한다. 즉, 분류의 대상이 되는 원 문서와 전혀 다른 특징을 갖는 이질적인 데이터소스로부터 추출한 형질을 원 문서에 일종의 노이즈의 형태로 삽입하여 이질성 학습을 수행하고, 도출된 분류 규칙 중 문서 분류기의 정확도 향상에 기여하는 분류 규칙만을 추출하여 적용하는 방식의 규칙 선별 기반의 앙상블 준지도학습을 제안함으로써 문서 분류의 성능을 향상시키고자 한다.

Keywords

References

  1. Ando, R. K. and T. Zhang, "A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data," Journal of Machine Learning Research, Vol. 6 (2005), 1817-1853.
  2. Angelova, R. and G. Weikum, "Graph-Based Text Classification: Learn from Your Neighbors," Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (2006), 485-492.
  3. Belkin, M., P. Niyogi, and V. Sindhwani, "Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples," Journal of Machine Learning Research, Vol. 7(2006), 2399-2434.
  4. Bennett, K. P. and A. Demiriz, "Semi-Supervised Support Vector Machines," Advances in Neural Information Processing Systems, Vol. 11(1999), 368-374.
  5. Blei, D.M., A. Y. Ng, and M. I. Jordan, "Latent Dirichlet Allocation," Journal of Machine Learning Research, Vol. 3, No. Jan(2003), 993-1022.
  6. Blum, A. and T. Mitchell, "Combining Labeled and Unlabeled Data with Co-Training," Proceedings of the eleventh annual conference on Computational learning theory, (1998), 92-100.
  7. Breiman, L., "Bagging Predictors," Machine learning, Vol. 24, No. 2(1996), 123-140. https://doi.org/10.1023/A:1018054314350
  8. Dasarathy, B. V. and B. V. Sheela, "A Composite Classifier System Design: Concepts and Methodology," Proceedings of the IEEE, Vol. 67, No. 5(1979), 708-713. https://doi.org/10.1109/PROC.1979.11321
  9. Dietterich, T.G., "Ensemble Methods in Machine Learning," Multiple Classifier Systems, Vol. 1857(2000), 1-15.
  10. Freund, Y. and R. E. Schapire, "Experiments with a New Boosting Algorithm," Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, (1996),148-156.
  11. Freund, Y. and R. E. Schapire, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting," Journal of Computer and System Sciences, Vol. 55, No. 1(1997), 119-139. https://doi.org/10.1006/jcss.1997.1504
  12. Hansen, L. K. and P. Salamon, "Neural Network Ensembles," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, No. 10(1990), 993-1001. https://doi.org/10.1109/34.58871
  13. Hofmann, T., "Unsupervised Learning by Probabilistic Latent Semantic Analysis," Machine learning, Vol. 42, No. 1-2(2001), 177-196. https://doi.org/10.1023/A:1007617005950
  14. Jacobs, R. A., M. I. Jordan, S. J. Nowlan, and G. E. Hinton, "Adaptive Mixtures of Local Experts," Neural Computation, Vol. 3, No. 1(1991), 79-87. https://doi.org/10.1162/neco.1991.3.1.79
  15. Joachims, T., "Transductive Inference for Text Classification using Support Vector Machines," International Conference on Machine Learning, Vol. 99(1999), 200-209.
  16. Jordan, M. I. and L. Xu, "Convergence Results for the EM Approach to Mixtures of Experts Architectures," Neural Networks, Vol. 8, No. 9(1995), 1409-1431. https://doi.org/10.1016/0893-6080(95)00014-3
  17. Jordan, M. I. and R. A. Jacobs, "Hierarchical Mixtures of Experts and the EM Algorithm," Neural Computation, Vol. 6, No. 2(1994), 181-214. https://doi.org/10.1162/neco.1994.6.2.181
  18. Kim, M., "Ensemble Learning with Support Vector Machines for Bond Rating," Journal of Intelligence and Information Systems, Vol. 18, No. 2(2012), 29-45. https://doi.org/10.13088/JIIS.2012.18.2.029
  19. Kim, D., and N. Kim, "Mapping Categories of Heterogeneous Sources using Text Analytics," Journal of Intelligence and Information Systems, Vol. 22, No. 4(2016), 193-215. https://doi.org/10.13088/jiis.2016.22.4.193
  20. Kim, S., H. Zhang, R. Wu, and L. Gong, "Dealing with Noise in Defect Prediction," Proceedings of the 33rd International Conference on Software Engineering, (2011), 481-490.
  21. L'Heureux, A., K. Grolinger, H. F. ElYamany, and M. Capretz, "Machine Learning with Big Data: Challenges and Approaches," IEEE Access, Vol. 5(2017), 7776-7797. https://doi.org/10.1109/ACCESS.2017.2696365
  22. Li, M. and Z. H. Zhou, "SETRED: Self-Training with Editing," Pacific-Asia Conference on Knowledge Discovery and Data Mining, Vol. 3518(2005), 611-621.
  23. Liu, W., S. Liu, Q. Gu, X. Chen, and D. Chen, "Fecs: A Cluster based Feature Selection Method for Software Fault Prediction with Noises," IEEE 39th Annual Computer Software and Applications Conference (COMPSAC), Vol. 2(2015), 276-281.
  24. Mallapragada, P. K., R. Jin, A. K. Jain, and Y. Liu, "Semiboost: Boosting for Semi-Supervised Learning," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 11(2009), 2000-2014. https://doi.org/10.1109/TPAMI.2008.235
  25. Maulik, U. and D. Chakraborty, "A Self-Trained Ensemble with Semisupervised SVM: An Application to Pixel Classification of Remote Sensing Imagery," Pattern Recognition, Vol. 44, No. 3(2011), 615-623. https://doi.org/10.1016/j.patcog.2010.09.021
  26. McClosky, D., E. Charniak, and M. Johnson, "Effective Self-Training for Parsing," Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, (2006), 152-159.
  27. Min, S., "Bankruptcy Prediction using an Improved Bagging Ensemble," Journal of Intelligence and Information Systems, Vol. 20, No. 4(2014), 121-139. https://doi.org/10.13088/JIIS.2014.20.4.121
  28. Mitra, V., C. J. Wang, and S. Banerjee, "Text Classification: A Least Square Support Vector Machine Approach," Applied Soft Computing, Vol. 7, No. 3(2007), 908-914. https://doi.org/10.1016/j.asoc.2006.04.002
  29. Nigam, K., A. K. McCallum, S. Thrun, and T. Mitchell, "Text Classification from Labeled and Unlabeled Documents using EM," Machine Learning, Vol. 39, No. 2(2000), 103-134. https://doi.org/10.1023/A:1007692713085
  30. Provost, F. and T. Fawcett, Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking, O'Reilly Media, Inc., California, 2013.
  31. Polikar, R., "Ensemble based Systems in Decision Making," IEEE Circuits and Systems Magazine, Vol. 6, No. 3(2006), 21-45. https://doi.org/10.1109/MCAS.2006.1688199
  32. Rosenberg, C., M. Hebert, and H. Schneiderman, "Semi-Supervised Self-Training of Object Detection Models," Seventh IEEE Workshops on Application of Computer Vision, Vol. 1(2005), 29-36.
  33. Saez, J.A., M. Galar, J. Luengo, and F. Herrera, "Tackling the Problem of Classification with Noisy Data using Multiple Classifier Systems: Analysis of the Performance and Robustness," Information Sciences, Vol. 247(2013), 1-20. https://doi.org/10.1016/j.ins.2013.06.002
  34. Salton, G. and C. Buckley, "Term Weighting Approaches in Automatic Text Retrieval," Technical Report, Cornell University, 1987.
  35. Schapire, R.E., "The Strength of Weak Learnability," Machine Learning, Vol. 5, No. 2(1990), 197-227. https://doi.org/10.1023/A:1022648800760
  36. Shahshahani, B.M. and D. A. Landgrebe, "The Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon," IEEE Transactions on Geoscience and Remote Sensing, Vol. 32, No. 5(1994), 1087-1095. https://doi.org/10.1109/36.312897
  37. Tanha, J., M. van Someren, and H. Afsarmanesh, "Disagreement-based Co-Training," 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), (2011), 803-810.
  38. Tanha, J., M. van Someren, and H. Afsarmanesh, "Semi-Supervised Self-Training for Decision Tree Classifiers," International Journal of Machine Learning and Cybernetics, Vol. 8, No. 1(2017), 355-370. https://doi.org/10.1007/s13042-015-0328-7
  39. Triguero, I., J. A. Saez, J. Luengo, S. Garcia, and F. Herrera, "On the Characterization of Noise Filters for Self-Training Semi-Supervised in Nearest Neighbor Classification," Neurocomputing, Vol. 132(2014), 30-41. https://doi.org/10.1016/j.neucom.2013.05.055
  40. Triguero, I., S. Garcia, and F. Herrera, "Self-Labeled Techniques for Semi-Supervised Learning: Taxonomy, Software and Empirical Study," Knowledge and Information Systems, Vol. 42, No. 2(2015), 245-284. https://doi.org/10.1007/s10115-013-0706-y
  41. Wolpert, D.H., 1992. "Stacked Generalization," Neural Networks, Vol. 5, No. 2(1992), 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1
  42. Wu, X. and X. Zhu, "Mining with Noise Knowledge: Error-Aware Data Mining," IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, Vol. 38, No. 4(2008), 917-932. https://doi.org/10.1109/TSMCA.2008.923034
  43. Yarowsky, D., "Unsupervised Word Sense Disambiguation Rivaling Supervised Methods," Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, (1995), 189-196.
  44. Zhu, X., "Semi-Supervised Learning Literature Survey," Computer Sciences TR 1530, University of Wisconsin, 2008. Available at http://pages.cs.wisc.edu/;jerryzhu/pub/ssl_survey.pdf
  45. Zhu, X. and A. B. Goldberg, "Introduction to Semi-Supervised Learning," Synthesis Lectures on Artificial Intelligence and Machine Learning, Vol. 3, No. 1(2009), 1-130.
  46. Zhu, X., J. Lafferty, and R. Rosenfeld, "Semi-Supervised Learning with Graphs," Doctoral Dissertation, Language Technologies Institute, Carnegie Mellon University, 2005.