DOI QR코드

DOI QR Code

Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation

  • Kim, Hyuntae (Cognitive Intelligence Lab., Department of Computer Engineering, Kumoh National Institute of Technology) ;
  • Byun, Junhyung (Undergraduate student, Department of Computer Engineering, Kumoh National Institute of Technology) ;
  • Na, Yoseph (Undergraduate student, Department of Computer Engineering, Kumoh National Institute of Technology) ;
  • Jung, Yuchul (Cognitive Intelligence Lab., Department of Computer Engineering, Kumoh National Institute of Technology)
  • Received : 2020.12.04
  • Accepted : 2020.12.28
  • Published : 2020.12.31

Abstract

Various websites have been created due to the increased use of the Internet, and the number of documents distributed through these websites has increased proportionally. However, it is not easy to collect newly updated documents rapidly. Web crawling methods have been used to continuously collect and manage new documents, whereas existing crawling systems applying a single node demonstrate limited performances. Furthermore, crawlers applying distribution methods exhibit a problem related to effective node management for crawling. This study proposes an efficient distributed crawler through stepwise crawling node allocation, which identifies websites' properties and establishes crawling policies based on the properties identified to collect a large number of documents from multiple websites. The proposed crawler can calculate the number of documents included in a website, compare data collection time and the amount of data collected based on the number of nodes allocated to a specific website by repeatedly visiting the website, and automatically allocate the optimal number of nodes to each website for crawling. An experiment is conducted where the proposed and single-node methods are applied to 12 different websites; the experimental result indicates that the proposed crawler's data collection time decreased significantly compared with that of a single node crawler. This result is obtained because the proposed crawler applied data collection policies according to websites. Besides, it is confirmed that the work rate of the proposed model increased.

Keywords

Acknowledgement

This research was supported by Kumoh National Institute of Technology (202001890001)

References

  1. Crawler4j Project. [Online]. Available: https://github.com/yasserg/crawler4j, [Accessed: Dec. 28, 2020]
  2. Apache Nutch Project. [Online]. Available: https://cwiki.apache.org/confluence/display/nutch /#Nutch_2.x, [Accessed: Dec. 28, 2020]
  3. Scrapy Project, "Scrapy 1.5 documentation", [Online]. Available: https://docs.scrapy.org/en/latest/, [Accessed: Dec. 28, 2020]
  4. Yu, Linxuan, Yeli Li, Qingtao Zeng, Yanxiong Sun, Yuning Bian, and Wei He. "Summary of Web Crawler Technology Research", Journal of Physics: Conference Series. Vol. 1449, No. 1, pp. 22-24, Feb, 2020. [Online]. Available: https://iopscience.iop.org/article/10.1088/1742-6596/1449/1/012036/meta
  5. Shi, Yuliang, and Ti Zhang. "Design and Implementation of a Scalable Distributed Web Crawler Based on Hadoop", In Proceedings of IEEE 2nd International Conference on Big Data Analysis, ICBDA, pp. 537-41, Oct, 2017 [Online]. Available: https://ieeexplore.ieee.org/document/8078691
  6. ScrapingHub Project. [Online]. Available: https://www.scrapinghub.com/, [Accessed: Dec. 28, 2020]
  7. Zhu, Weiping, Yaodong Li, Shu Li, Yi Xu, and Xiaohui Cui. "Optimal Bandwidth Allocation for Web Crawler Systems with Time Constraints", Journal of Ambient Intelligence and Humanized Computing. pp. 1146-1153, Apr, 2020. [Online]. Available: https://ieeexplore.ieee.org/document/9060415
  8. Su, Linping, and Fengxiao Wang. "Web Crawler Model of Fetching Data Speedily Based on Hadoop Distributed System", In Proceedings of the IEEE International Conference on Software Engineering and Service Sciences, ICSESS, pp.927-31, Mar, 2017. [Online]. Available: https://ieeexplore.ieee.org/document/7883217
  9. Manike, Chiranjeevi, Ashok Kumar Nanda, and Tejashwini Gajulagudem. "Hadoop Scalability and Performance Testing in Homogeneous Clusters", Lecture Notes in Electrical Engineering, Vol.605, pp.907-17, Sept, 2019. [Online]. Available: https://link.springer.com/chapter/10.1007%2F978-3-030-30577-2_81
  10. Thomas, David Mathew, and Sandeep Mathur. "Data Analysis by Web Scraping Using Python", In Proceedings of the 3rd International Conference on Electronics and Communication and Aerospace Technology, ICECA pp.450-54, Sept, 2019 [Online]. Available: https://ieeexplore.ieee.org/document/8822022
  11. Farooq, Bassam; Mohd Shahid Husain; Suaib, Mohammad, 2018. "CRAWLING OF JAPANESE REAL-ESTATE WEBSITES USING SCRAPY", International Journal of Advanced Research in Computer Science; Udaipur Vol. 9, pp. 64-67, Apr, 2018. [Online]. Available: http://www.ijarcs.info/index.php/Ijarcs/article/view/6139
  12. Yin, Fulian, Xiating He, and Zhixin Liu. "Research on Scrapy-Based Distributed Crawler System for Crawling Semi-Structure Information at High Speed", In Proceedings of 2018 IEEE 4th International Conference on Computer and Communications (ICCC), pp.1356-59, Aug, 2018. Institute of Electrical and Electronics Engineers Inc. [Online]. Available: https://ieeexplore.ieee.org/document/8781062
  13. Nisafani, Amna Shifia, Rully Agus Hendrawan, and Arif Wibisono. "ELICITING DATA FROM WEBSITE USING SCRAPY: AN EXAMPLE", In Seminar Nasional Teknologi Informasi Dan Multimedia (SEMNASTEKNOMEDIA), pp. 1-8, Feb, 2017.
  14. Wang, Jiancai, and Jianting Shi. "The Crawl and Analysis of Recruitment Data Based on the Distributed Crawler", Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, Vol. 333, pp. 162-68, Nov, 2020. [Online]. Available: https://link.springer.com/chapter/10.1007%2F978-3-030-62483-5_18
  15. Zhou, Bing, Bo Xiao, Zhiqing Lin, and Chuang Zhang. "A Distributed Vertical Crawler Using Crawling-Period Based Strategy", In Proceedings of the 2010 2nd International Conference on Future Computer and Communication, ICFCC, Vol. 1, pp. 306-11, Jun, 2010. [Online]. Available: https://ieeexplore.ieee.org/document/5497780
  16. Gunawan, Dani, Amalia Amalia, and Atras Najwan. "Improving Data Collection on Article Clustering by Using Distributed Focused Crawler", Data Science: Journal of Computing and Applied Informatics, Vol. 1, pp. 1-12, Jul, 2010. [Online]. Available: https://talenta usu.ac.id/index.php/JoCAI/article/view/82
  17. Kaur, Sawroop, and G. Geetha. "SIMHAR - Smart Distributed Web Crawler for the Hidden Web Using SIM+Hash and Redis Server", IEEE Access, Vol. 8, pp. 117582-92, Jun, 2020. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9123854
  18. Ye, Feng, Zongfei Jing, Qian Huang, and Yong Chen. "The Research of a Lightweight Distributed Crawling System", In Proceedings of 2018 IEEE/ACIS 16th International Conference on Software Engineering Research, Management and Application, SERA, pp. 200-204, Jun, 2018 [Online]. Available: https://ieeexplore.ieee.org/document/8477212
  19. Docker, "docker documentation", [Online]. Available: https://docs.docker.com/, [Accessed: Dec. 28, 2020]
  20. Sharma, Vivek, Harsh Kumar Saxena, and Akhilesh Kumar Singh. 2020. "Docker for Multi-Containers Web Application", In Proceedings of 2nd International Conference on Innovative Mechanisms for Industry Applications, (ICIMIA), pp. 589-92, Apr, 2020. [Online]. Available: https://ieeexplore.ieee.org/document/9074925
  21. Hwang, Jung-Yeon, and Ho-Yong Ryu. "Performance Comparison and Forecast Analysis between KVM and Docker", The Journal of Korean Institute of Information Technology, Vol. 13, No. 11, pp. 127-136, Nov, 2015. [Online]. Available: https://www.kci.go.kr/kciportal/ci/sereArticleSearch/artiPreView.kci?sereArticleSearchBean.artiId=ART002048947
  22. sumologic, "Docker Swarm", [Online]. Available: https://www.sumologic.com/glossary/docker-swarm/, [Accessed: Dec. 28, 2020]
  23. Juwita, Oktalia, Firmansyah, Diksy. "Cloud Computing Implementation with Docker Engine Swarm Mode for Data Availability Infrastructure of Rice Plants", International Journal Of Information System & Technology (IJISTECH), Vol. 1, No 2, pp. 1-24, 2018. [Online]. Available: http://ijistech.org/ijistech/index.php/ijistech/article/view/10
  24. Matallah, Houcine, Ghalem Belalem, and Karim Bouamrane. "Evaluation of NoSQL Databases", International Journal of Software Science and Computational Intelligence Vol. 12, No. 4, pp. 71-91, Dec, 2020. [Online]. Available: https://www.igi-global.com/gateway/article/262589
  25. Ying, Zhe Yu, Feng Li Zhang, and Qing Yu Fan. "Consistent Hashing Algorithm Based on Slice in Improving Scrapy-Redis Distributed Crawler Efficiency", In Proceedings of 2018 IEEE International Conference on Computer and Communication Engineering Technology (CCET), pp. 334-40, Nov, 2018. [Online]. Available: https://ieeexplore.ieee.org/document/8542217
  26. Han, Xiaowei, and Likun Zheng. "Design and Implementation of Firmware Data Acquisition System Based on Scrapy Framework", In Proceedings of 2020 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS), pp. 168-74, Sept, 2020. Institute of Electrical and Electronics Engineers Inc. [Online]. Available: https://ieeexplore.ieee.org/document/9202251
  27. Redis, "redis documentation", [Online]. Available: https://redis.io/documentation [Accessed: Dec. 28, 2020]
  28. Rolando Max Espinoza, "Scrapy-redis documentation, " [Online]. Available: https://scrapyredis.readthedocs.org. [Accessed: Dec. 28, 2020]