DOI QR코드

DOI QR Code

Document Classification using Recurrent Neural Network with Word Sense and Contexts

단어의 의미와 문맥을 고려한 순환신경망 기반의 문서 분류

  • 주종민 (전남대학교 전자컴퓨터공학부) ;
  • 김남훈 (전남대학교 전자컴퓨터공학부) ;
  • 양형정 (전남대학교 전자컴퓨터공학부) ;
  • 박혁로 (전남대학교 전자컴퓨터공학부)
  • Received : 2018.03.08
  • Accepted : 2018.05.12
  • Published : 2018.07.31

Abstract

In this paper, we propose a method to classify a document using a Recurrent Neural Network by extracting features considering word sense and contexts. Word2vec method is adopted to include the order and meaning of the words expressing the word in the document as a vector. Doc2vec is applied for considering the context to extract the feature of the document. RNN classifier, which includes the output of the previous node as the input of the next node, is used as the document classification method. RNN classifier presents good performance for document classification because it is suitable for sequence data among neural network classifiers. We applied GRU (Gated Recurrent Unit) model which solves the vanishing gradient problem of RNN. It also reduces computation speed. We used one Hangul document set and two English document sets for the experiments and GRU based document classifier improves performance by about 3.5% compared to CNN based document classifier.

본 논문에서는 단어의 순서와 문맥을 고려하는 특징을 추출하여 순환신경망(Recurrent Neural Network)으로 문서를 분류하는 방법을 제안한다. 단어의 의미를 고려한 word2vec 방법으로 문서내의 단어를 벡터로 표현하고, 문맥을 고려하기 위해 doc2vec으로 입력하여 문서의 특징을 추출한다. 문서분류 방법으로 이전 노드의 출력을 다음 노드의 입력으로 포함하는 RNN 분류기를 사용한다. RNN 분류기는 신경망 분류기 중에서도 시퀀스 데이터에 적합하기 때문에 문서 분류에 좋은 성능을 보인다. RNN에서도 그라디언트가 소실되는 문제를 해결해주고 계산속도가 빠른 GRU(Gated Recurrent Unit) 모델을 사용한다. 실험 데이터로 한글 문서 집합 1개와 영어 문서 집합 2개를 사용하였고 실험 결과 GRU 기반 문서 분류기가 CNN 기반 문서 분류기 대비 약 3.5%의 성능 향상을 보였다.

Keywords

References

  1. J. H. Kim, J. H, Kim, K. M. Kim, and B. T. Zhang, "Large-Scale Text Classification with Convolution Neural Networks," Korean Information Science Society Conference Proceedings, pp.792-794, 2015.
  2. P. Soucy and G. W. Mineau, "Beyond TFIDF weighting for text categorization in the vector space model," IJCAI, Vol. 5, 2005.
  3. C. H. Lee, Chang and S. C. Park, "BPNN Algorithm Using SVD for Korean Document Classification," Journal of the Korea Industrial Information System Society, Vol.15, No.2 pp.49-57, 2010.
  4. G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Communications of the ACM, Vol.18, No.11, pp.613-620, 1975. https://doi.org/10.1145/361219.361220
  5. J. W. Hwang and Y, J, Ko, "A Studyon Sentiment Features Extractionand Their Weight Boosting Methodfor Korean Document Sentiment Classification," Journal of KISS: Computing Practice and Letters, Vol.14, No.3, pp.336-340, 2008.
  6. Y., Goldberg and O. Levy, "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method," arXiv preprint arXiv:1402.3722, 2014.
  7. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in Neural Information Processing Systems, 2012.
  8. J. M. Kim and J. H. Lee, "Text Document Classification Based on Recurrent Neural Network Using Word2vec," Journal of Korea Institute of Intelligent Systems, Vol.27, No.6, pp.560-565, 2017. https://doi.org/10.5391/JKIIS.2017.27.6.560
  9. T. Mikolov, I. Sutskever, K. Chen, G. S., Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," Advances in Neural Information Processing Systems, 2013.
  10. HANTEC Data Set [Internet], http://www.kristalinfo.com/TestCollections/#hkib
  11. M. Cassel, and F. Lima, "Evaluating one-hot encoding finite state machines for SEU reliability in SRAM-based FPGAs," On-Line Testing Symposium, 2006. IOLTS 2006. 12th IEEE International, IEEE, 2006.
  12. J. H. Lau, and T. Baldwin, "An empirical evaluation of doc2vec with practical insights into document embedding generation," arXiv preprint arXiv:1607.05368.
  13. Q. Le and T. Mikolov, "Distributed representations of sentences and documents," International Conference on Machine Learning, 2014.
  14. F. A. Gers, J. Schmidhuber, and F. Cummins, "Learning to forget: Continual prediction with LSTM," pp.850-855, 1999.
  15. K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, "Learning phrase representations using RNN encoder-decoder for statistical machine translation," arXiv preprint arXiv:1406.1078, 2014.
  16. 20Newsgroups Data Set [Internet], http://qwone.com/-jason/20Newsgroups/
  17. Text Classification Data Sets [Internet], http://goo.gl/JyCnZq
  18. Python Package for Natural Language Processing [Internet], http://konlpy.org/en/v0.4.4/
  19. J. Y. Lee, "A Study on the Improvement of Document Classification Performance of SVM Classifier Using Document Similarity," Journal of the Korean Society for Information Management, Vol.22 No.3, pp.261-287, 2005. https://doi.org/10.3743/KOSIM.2005.22.3.261
  20. J. M. Kim and J. H. Lee, "A study on RNN based document classification using Word2vec," Journal of Korea Institute of Intelligent Systems, Vol.27, No.6, pp.560-565, 2017. https://doi.org/10.5391/JKIIS.2017.27.6.560
  21. Jiang, Z., Zhang, S., & Zeng, J. "A hybrid generative/discriminative method for semi-supervised classification," Knowledge-Based Systems, Vol.37, pp.137-145, 2013. https://doi.org/10.1016/j.knosys.2012.07.020
  22. N. H. Kim and H. J. Yang, "Classification of Hangul Documents Based on CNN Using Document Indexing Method Considering Meaning and Order of Words," Korean Computer Education Association Conference Paper, Vol.21, No.2, pp.41-45, 2017.