DOI QR코드

DOI QR Code

빈도 분석을 이용한 HTML 텍스트 추출

HTML Text Extraction Using Frequency Analysis

  • Kim, Jin-Hwan (Department of Computer Science & Engineering, Korea University of Technology and Education) ;
  • Kim, Eun-Gyung (School of Computer Science & Engineering, Korea University of Technology and Education)
  • 투고 : 2021.06.25
  • 심사 : 2021.07.19
  • 발행 : 2021.09.30

초록

최근 빅데이터 분석을 위해 웹 크롤러를 이용한 텍스트 수집이 빈번하게 이루어지고 있다. 하지만 수많은 태그와 텍스트로 복잡하게 구성된 웹 페이지에서 필요한 텍스트만을 수집하기 위해서는 웹 크롤러에 빅데이터 분석에 필요한 본문이 포함된 HTML태그와 스타일 속성을 명시해야 하는 번거로움이 있다. 본 논문에서는 HTML태그와 스타일 속성을 명시하지 않고 웹 페이지에서 출현하는 텍스트의 빈도를 이용하여 본문을 추출하는 방법을 제안하였다. 제안한 방법에서는 수집된 모든 웹 페이지의 DOM 트리에서 텍스트를 추출하여 텍스트의 출현 빈도를 분석한 후, 출현 빈도가 높은 텍스트를 제외시킴으로써 본문을 추출하였으며, 본 연구에서 제안한 방법과 기존 방법의 정확도 비교를 통해서 본 연구에서 제안한 방법의 우수성을 검증하였다.

Recently, text collection using a web crawler for big data analysis has been frequently performed. However, in order to collect only the necessary text from a web page that is complexly composed of numerous tags and texts, there is a cumbersome requirement to specify HTML tags and style attributes that contain the text required for big data analysis in the web crawler. In this paper, we proposed a method of extracting text using the frequency of text appearing in web pages without specifying HTML tags and style attributes. In the proposed method, the text was extracted from the DOM tree of all collected web pages, the frequency of appearance of the text was analyzed, and the main text was extracted by excluding the text with high frequency of appearance. Through this study, the superiority of the proposed method was verified.

키워드

과제정보

This paper was supported by the Education and Research Promotion Program of KOREATECH in 2021.

참고문헌

  1. H. S. Bang and H. S. Moon, "A study on the methodology to express the main topics of text in time series using text mining," Journal of the Korean Data And Information Science Society, vol. 30, no. 6, pp. 1259-1276, 2019. https://doi.org/10.7465/jkdi.2019.30.6.1259
  2. S. R. Lee and E. J. Choi, "Comparison of responses to issues in SNS and Traditional Media using Text Mining - Focusing on the Termination of Korea-Japan General Security of Military Information Agreement(GSOMIA)," Journal of Digital Convergence, vol. 18, no. 2, pp. 277-284, 2020. https://doi.org/10.14400/JDC.2020.18.2.277
  3. J. H. Lee, H. J. Seon, and H. J. Lee, "Positioning of Smart Speakers by Applying Text Mining to Consumer Reviews: Focusing on Artificial Intelligence Factors," Knowledge Management Review, vol. 21, no. 1, pp. 197-210, 2020. https://doi.org/10.15813/KMR.2020.21.1.011
  4. M. G. Cha and J. Y. Lee, "A Study on Spatial Co-experience through Social Data," Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology, vol. 7, no. 6, pp. 851-859, 2017. https://doi.org/10.35873/AJMAHS.2017.7.6.080
  5. K. W. Cho and Y. W. Woo, "Topic Modeling on Research Trends of Industry 4.0 Using Text Mining," Journal of the Korea Institute of Information and Communication Engineering, vol. 23, no. 7, pp. 764-770, 2019. https://doi.org/10.6109/JKIICE.2019.23.7.764
  6. C. Kohlschutter, P. Fankhauser, and W. Nejdl, "Boilerplate detection using shallow text features," in Proceedings of the third ACM international conference on Web Search and Data Mining (WSDM), New York: NY, pp. 441-450, 2010.
  7. W. M. Song, W. S. Kim, and M. W. Kim, "Contents Extraction from HTML Documents using Text Block Context," Journal of KISS : Software and Applications, vol. 40, no. 3, pp. 155-163, 2013.
  8. H. G. Jeon and C. Koh, "Text Extraction Algorithm using the HTML Logical Structure Analysis," Journal of Digital Contents Society, vol. 16, no. 3, pp. 445-455, 2015. https://doi.org/10.9728/dcs.2015.16.3.445
  9. T. Vogels, O. E. Ganea, and C. Eickhoff, "Web2text: Deep structured boilerplate removal," in Proceedings of the 40th European Conference on Information Retrieval, pp. 167-179, 2018.
  10. J. Leonhardt, A. Anand, and M. Khosla, "Boilerplate Removal using a Neural Sequence Labeling Model," in Companion Proceedings of the Web Conference 2020 (WWW '20), New York: NY, pp. 226-229, 2020.
  11. A. Tharwat, "Classification assessment methods," Applied Computing and Informatics, vol. 17, no. 1, pp. 168-192, 2021. https://doi.org/10.1016/j.aci.2018.08.003
  12. S. H. Kim and H. J. Kim, "Logistic Regression Ensemble Method for Extracting Significant Information from Social Texts," KIPS Transactions on Software and Data Engineering, vol. 6, no. 5, pp. 279-284, 2017. https://doi.org/10.3745/KTSDE.2017.6.5.279