DOI QR코드

DOI QR Code

Sentence Filtering Dataset Construction Method about Web Corpus

웹 말뭉치에 대한 문장 필터링 데이터 셋 구축 방법

  • Nam, Chung-Hyeon (Department of Computer Engineering, Korea University of Technology and Education) ;
  • Jang, Kyung-Sik (Department of Computer Engineering, Korea University of Technology and Education)
  • Received : 2021.08.16
  • Accepted : 2021.09.04
  • Published : 2021.11.30

Abstract

Pretrained models with high performance in various tasks within natural language processing have the advantage of learning the linguistic patterns of sentences using large corpus during the training, allowing each token in the input sentence to be represented with appropriate feature vectors. One of the methods of constructing a corpus required for a pre-trained model training is a collection method using web crawler. However, sentences that exist on web may contain unnecessary words in some or all of the sentences because they have various patterns. In this paper, we propose a dataset construction method for filtering sentences containing unnecessary words using neural network models for corpus collected from the web. As a result, we construct a dataset containing a total of 2,330 sentences. We also evaluated the performance of neural network models on the constructed dataset, and the BERT model showed the highest performance with an accuracy of 93.75%.

자연어 처리 분야 내 다양한 작업들에서 높은 성능을 보인 사전 학습된 모델은 대량의 말뭉치를 이용하여 문장들의 언어학적 패턴을 스스로 학습함으로써 입력 문장 내 각 토큰들을 적절한 특징 벡터로 표현할 수 있다는 장점을 갖고 있다. 이러한 사전 학습된 모델의 학습에 필요한 말뭉치를 구축하는 방법 중 웹 크롤러를 이용하여 수집한 경우 웹사이트에 존재하는 문장은 다양한 패턴을 갖고 있기 때문에 문장의 일부 또는 전체에 불필요한 단어가 포함되어 있을 수 있다. 본 논문에서는 웹으로부터 수집한 말뭉치에 대해 신경망 모델을 이용하여 불필요한 단어가 포함된 문장을 필터링하기 위한 데이터 셋 구축 방법에 대해 제안한다. 그 결과, 총 2,330개의 문장을 포함한 데이터 셋을 구축하였다. 또한 신경망 모델을 이용하여 구축한 데이터 셋을 학습시켜 성능을 평가하였으며, BERT 모델이 평가 데이터에 대해 93.75%의 정확도로 가장 높은 성능을 보였다.

Keywords

Acknowledgement

This work was supported by the 2020 sabbatical year research grant of KoreaTech.

References

  1. J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, San Francisco, pp. 4117-4186, 2019.
  2. Z. Yang, D. Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, "XLNet: Generalized Autoregressive Pretraining for Language Understanding," in Proceedings of the 33rd Conference on Neural Information Processing System (NeurIPS), Vancouver, pp. 5754-5764, 2019.
  3. Wikipedia, Wikipedia Dump Data [Online]. Available: https://www.wikipedia.org/.
  4. P. W. Park, "Text-CNN Based Intent Classification Method for Automatic Input of Intent Sentences in Chatbot," Journal of Korean Institute of Information Technology, vol. 18, no. 1, pp. 19-25, Jan. 2020. https://doi.org/10.14801/jkiit.2020.18.1.19
  5. J. M. Kim and J. H. Lee, "Text Document Classification Based on Recurrent Neural Network Using Word2vec," Journal of Korean Institute of Intelligent Systems, vol. 27, no. 6, pp. 560-565, Dec. 2017. https://doi.org/10.5391/JKIIS.2017.27.6.560
  6. H. J. Jeon and C. Koh, "Text Extraction Algorithm using the HTML Logical Structure Analysis," The KDCS Transactions, vol. 16, no. 3, pp. 445-455, Jun. 2015.
  7. N. Utiu and V. S. Lonescu, "Learning Web Content Extraction with DOM Features," in Proceedings of the 2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing(ICCP), Doha, pp. 1724-1734, 2014.
  8. B. D. Nguyen-Hoang, B. T. Pham-Hong, Y. Jin, and P. T. V. Le, "Genre-Oriented Web Content Extraction with Deep Convolutional Neural Networks and Statistical Methods," in Proceedings of 32nd Pacific Asia Conference on Language, Information and Computation, Hong Kong, pp. 476-485, 2018.
  9. Huggingface. Transformers [Internet]. Available: https://www.github.com/huggingface/.