DOI QR코드

DOI QR Code

Design and Implementation of a Search Engine based on Apache Spark

아파치 스파크 기반 검색엔진의 설계 및 구현

  • Received : 2016.10.10
  • Accepted : 2016.10.28
  • Published : 2017.01.31

Abstract

Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection.

최근 데이터의 활용가치가 높아지면서 데이터에 관한 연구가 활발히 진행되고 있다. 데이터의 수집, 저장, 활용을 위한 대표적인 프로그램으로 웹 크롤러, 데이터베이스, 분산처리 등이 있으며, 최근에는 웹 크롤러가 다양한 분야에 활용할 수 있는 유용성으로 인해 크게 각광받고 있는 실정이다. 웹 크롤러란 자동화된 방법으로 웹서버를 순회하여 웹 페이지를 분석하고 URL을 수집하는 도구라고 정의할 수 있다. 인터넷 사용량의 증가로 매일 대량으로 생성되는 웹 페이지의 처리를 위해 하둡의 맵리듀스를 기반으로 하는 분산 웹 크롤러가 많이 사용되고 있다. 그러나 맵리듀스는 사용이 어렵고 성능에 제약이 있는 단점이 있다. 이러한 맵리듀스의 한계를 보완하여 제시된 인메모리 기반 연산 플랫폼인 아파치 스파크가 그 대안이 되고 있다. 웹 크롤러의 주요용도 중 하나인 검색엔진은 웹 크롤러로 수집한 정보 중 특정 검색어에 맞는 결과를 보여준다. 검색엔진을 기존 맵리듀스 기반의 웹 크롤러 대신 스파크 기반 웹 크롤러로 구현할 경우 더욱 빠른 데이터 수집이 가능할 것이다.

Keywords

References

  1. Wikipedia. Web Crawler [Internet] Available: https://en.wikipedia.org/wiki/Web_crawler.
  2. H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning Spark, 1st ed. Sebastopol, CA: O'Reilly Media, pp.1-9, 2015.
  3. F. Pant, P. Srinivasn, F. Menczer, "Crawling the Web" in Web Dynamics, 1st ed. Berlin, Germany: Springer-Verlag, pp.153-177, 2003.
  4. M. S. Ahuja , J. Singh, and B. Varnica. "Web Crawler: Extracting the Web Data," International Journal of Computer Trends and Technology(IJCTT), vol. 13, no. 3, pp.132-137, Jul. 2014. https://doi.org/10.14445/22312803/IJCTT-V13P128
  5. Pycon. Web Scraper in 30 Minutes [Online]. Available: https://www.pycon.kr/2014/program/15.
  6. H. C. Kim and S. H. Chae. "Design and Implementation of a High Performance Web Crawler," Journal of Digital Contents Society, vol. 4, no. 2, pp.127-137, Dec. 2003.
  7. D. M. Seo and H. M. Jung. "Intelligent Web Crawler for Supporting Big Data Analysis Services," The Journal of the Korea Contents Association, vol. 13, no. 12, pp.575-584, Dec. 2013. https://doi.org/10.5392/JKCA.2013.13.12.575
  8. K. Y. Kim, W. G. Lee, H. M. Yoon, S. H. Shin, and M. H. Lee. "Development of Web Crawler for Archiving Web Resources," The Journal of the Korea Contents Association, vol. 11, no. 9, pp.9-16, Sep. 2011. https://doi.org/10.5392/JKCA.2011.11.9.009
  9. S. H. Hong, "An Implementation of Smart Price Tracker System Using Web Crawling," M.S. Thesis, Seoul National University of Science and Technology, Seoul, Korea, 2015.
  10. V. K. Vavilapalli, et al., "Apache hadoop yarn: Yet another resource negotiator," in ACM Proceedings of the 4th annual Symposium on Cloud Computing, Santa Clara: CA, 2013.
  11. Y. K. Lee, "The Comparison Between Hadoop MapReduce and Spark Device's Machine Learning Performance," M.S. Thesis, Soongsil University, Seoul, Korea, 2015.
  12. B. S. Kim, "Performance Evaluation of HDFS Based SQLOn-Hadoop," M.S. Thesis, Chungbuk National University, Cheongju, Korea, 2015.
  13. J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp.107-113, Jan. 2008. https://doi.org/10.1145/1327452.1327492
  14. USCDataScience. Sparkler [Internet]. Available: https://github.com/USCDataScience/sparkler.
  15. H. O. Song, A. Y. Kim, and H. K. Jung. "Implement on Search Machine using Open Source Framework," Journal of the Korea Institute of Information and Communication Engineering, vol. 19, no. 3, pp.552-557, Mar. 2015. https://doi.org/10.6109/jkiice.2015.19.3.552
  16. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. "Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing," in Proceedings of the 9th USENIX conference on networked systems design and implementation, San Jose: CA, pp.2-2, 2012.
  17. C. Klaussne, J. Nioch. (2013, September). Nutch fight! 1.7 vs 2.2.1 [Internet]. Available: http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html.

Cited by

  1. 빅데이터 분석 기반의 정보 검색을 위한 웹 크롤러 서비스 구현 vol.18, pp.5, 2017, https://doi.org/10.9728/dcs.2017.18.5.933
  2. K Nearest Neighbor Joins for Big Data Processing based on Spark vol.21, pp.9, 2017, https://doi.org/10.6109/jkiice.2017.21.9.1731