DOI QR코드

DOI QR Code

Design and Implementation of a Web Crawler System for Collection of Structured and Unstructured Data

정형 및 비정형 데이터 수집을 위한 웹 크롤러 시스템 설계 및 구현

  • Bae, Seong Won (Dept. of Division of Computer Engineering, Dongseo University) ;
  • Lee, Hyun Dong (Industry Academy Cooperation Foundation, Dongseo University) ;
  • Cho, DaeSoo (Dept. of Division of Computer Engineering, Dongseo University)
  • Received : 2018.01.12
  • Accepted : 2018.01.23
  • Published : 2018.02.28

Abstract

Recently, services provided to consumers are increasingly being combined with big data such as low-priced shopping, customized advertisement, and product recommendation. With the increasing importance of big data, the web crawler that collects data from the web has also become important. However, there are two problems with existing web crawlers. First, if the URL is hidden from the link, it can not be accessed by the URL. The second is the inefficiency of fetching more data than the user wants. Therefore, in this paper, through the Casper.js which can control the DOM in the headless brwoser, DOM event is generated by accessing the URL to the hidden link. We also propose an intelligent web crawler system that allows users to make steps to fine-tune both Structured and unstructured data to bring only the data they want. Finally, we show the superiority of the proposed crawler system through the performance evaluation results of the existing web crawler and the proposed web crawler.

Keywords

References

  1. C.D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, 2008.
  2. Dustin Boswell, Distributed High-Performance Web Crawlers: A Survey of the State Of the Art, 2003.
  3. A. Heydon and M. Najork, “Mercator: A Scalable, Extensible Web Crawler,” World Wide Web, Vol. 2, No. 4, pp. 219-229, 1999. https://doi.org/10.1023/A:1019213109274
  4. Allan Heydon and Marc Najork, High-Performance Web Crawling, COMPAQ SRC Reserch Report 173, 2001.
  5. S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Proceeding of the Seventh International World Wide Web Conference, pp. 107-117, 1998.
  6. M.S. Kang and Y.S. Choi, "Design Hadoop Based P2P Distributed Web Crawler," Proceeding of Korean Society For Internet Information, pp. 199-202, 2010.
  7. D.M. Seo and H.M. Jung, “Intelligent Web Crawler for Supporting Big Data Analysis Services,” Journal of Korea Contents Association, Vol. 13, No. 12, pp. 575-584, 2013. https://doi.org/10.5392/JKCA.2013.13.12.575
  8. Y.H Kim and M.D Chung, "Analysis of Structured and Unstructured Data and Construction of Criminal Profiling System using LSA" Journal of Korea Multimedia Society, Vol. 20, No. 1, pp. 66-73, 2017. https://doi.org/10.9717/kmms.2017.20.1.066