Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation

Kim, Hyuntae;Byun, Junhyung;Na, Yoseph;Jung, Yuchul;

doi:10.14801/JAITC.2020.10.2.15

Journal of Advanced Information Technology and Convergence (한국정보기술학회 영문논문지)

Volume 10 Issue 2
/
Pages.15-31
/
2020
/
2234-1072(pISSN)
/
2234-0963(eISSN)

Korean Institute of Information Technology (한국정보기술학회)

DOI QR Code

Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation

Kim, Hyuntae (Cognitive Intelligence Lab., Department of Computer Engineering, Kumoh National Institute of Technology) ;
Byun, Junhyung (Undergraduate student, Department of Computer Engineering, Kumoh National Institute of Technology) ;
Na, Yoseph (Undergraduate student, Department of Computer Engineering, Kumoh National Institute of Technology) ;
Jung, Yuchul (Cognitive Intelligence Lab., Department of Computer Engineering, Kumoh National Institute of Technology)

Received : 2020.12.04
Accepted : 2020.12.28
Published : 2020.12.31

https://doi.org/10.14801/JAITC.2020.10.2.15 Citation

⟨ Previous Next ⟩

Abstract

Various websites have been created due to the increased use of the Internet, and the number of documents distributed through these websites has increased proportionally. However, it is not easy to collect newly updated documents rapidly. Web crawling methods have been used to continuously collect and manage new documents, whereas existing crawling systems applying a single node demonstrate limited performances. Furthermore, crawlers applying distribution methods exhibit a problem related to effective node management for crawling. This study proposes an efficient distributed crawler through stepwise crawling node allocation, which identifies websites' properties and establishes crawling policies based on the properties identified to collect a large number of documents from multiple websites. The proposed crawler can calculate the number of documents included in a website, compare data collection time and the amount of data collected based on the number of nodes allocated to a specific website by repeatedly visiting the website, and automatically allocate the optimal number of nodes to each website for crawling. An experiment is conducted where the proposed and single-node methods are applied to 12 different websites; the experimental result indicates that the proposed crawler's data collection time decreased significantly compared with that of a single node crawler. This result is obtained because the proposed crawler applied data collection policies according to websites. Besides, it is confirmed that the work rate of the proposed model increased.

Keywords

Acknowledgement

This research was supported by Kumoh National Institute of Technology (202001890001)

References

Crawler4j Project. [Online]. Available: https://github.com/yasserg/crawler4j, [Accessed: Dec. 28, 2020]
Apache Nutch Project. [Online]. Available: https://cwiki.apache.org/confluence/display/nutch /#Nutch_2.x, [Accessed: Dec. 28, 2020]
Scrapy Project, "Scrapy 1.5 documentation", [Online]. Available: https://docs.scrapy.org/en/latest/, [Accessed: Dec. 28, 2020]
Yu, Linxuan, Yeli Li, Qingtao Zeng, Yanxiong Sun, Yuning Bian, and Wei He. "Summary of Web Crawler Technology Research", Journal of Physics: Conference Series. Vol. 1449, No. 1, pp. 22-24, Feb, 2020. [Online]. Available: https://iopscience.iop.org/article/10.1088/1742-6596/1449/1/012036/meta
Shi, Yuliang, and Ti Zhang. "Design and Implementation of a Scalable Distributed Web Crawler Based on Hadoop", In Proceedings of IEEE 2nd International Conference on Big Data Analysis, ICBDA, pp. 537-41, Oct, 2017 [Online]. Available: https://ieeexplore.ieee.org/document/8078691
ScrapingHub Project. [Online]. Available: https://www.scrapinghub.com/, [Accessed: Dec. 28, 2020]
Zhu, Weiping, Yaodong Li, Shu Li, Yi Xu, and Xiaohui Cui. "Optimal Bandwidth Allocation for Web Crawler Systems with Time Constraints", Journal of Ambient Intelligence and Humanized Computing. pp. 1146-1153, Apr, 2020. [Online]. Available: https://ieeexplore.ieee.org/document/9060415
Su, Linping, and Fengxiao Wang. "Web Crawler Model of Fetching Data Speedily Based on Hadoop Distributed System", In Proceedings of the IEEE International Conference on Software Engineering and Service Sciences, ICSESS, pp.927-31, Mar, 2017. [Online]. Available: https://ieeexplore.ieee.org/document/7883217
Manike, Chiranjeevi, Ashok Kumar Nanda, and Tejashwini Gajulagudem. "Hadoop Scalability and Performance Testing in Homogeneous Clusters", Lecture Notes in Electrical Engineering, Vol.605, pp.907-17, Sept, 2019. [Online]. Available: https://link.springer.com/chapter/10.1007%2F978-3-030-30577-2_81
Thomas, David Mathew, and Sandeep Mathur. "Data Analysis by Web Scraping Using Python", In Proceedings of the 3rd International Conference on Electronics and Communication and Aerospace Technology, ICECA pp.450-54, Sept, 2019 [Online]. Available: https://ieeexplore.ieee.org/document/8822022
Farooq, Bassam; Mohd Shahid Husain; Suaib, Mohammad, 2018. "CRAWLING OF JAPANESE REAL-ESTATE WEBSITES USING SCRAPY", International Journal of Advanced Research in Computer Science; Udaipur Vol. 9, pp. 64-67, Apr, 2018. [Online]. Available: http://www.ijarcs.info/index.php/Ijarcs/article/view/6139
Yin, Fulian, Xiating He, and Zhixin Liu. "Research on Scrapy-Based Distributed Crawler System for Crawling Semi-Structure Information at High Speed", In Proceedings of 2018 IEEE 4th International Conference on Computer and Communications (ICCC), pp.1356-59, Aug, 2018. Institute of Electrical and Electronics Engineers Inc. [Online]. Available: https://ieeexplore.ieee.org/document/8781062
Nisafani, Amna Shifia, Rully Agus Hendrawan, and Arif Wibisono. "ELICITING DATA FROM WEBSITE USING SCRAPY: AN EXAMPLE", In Seminar Nasional Teknologi Informasi Dan Multimedia (SEMNASTEKNOMEDIA), pp. 1-8, Feb, 2017.
Wang, Jiancai, and Jianting Shi. "The Crawl and Analysis of Recruitment Data Based on the Distributed Crawler", Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, Vol. 333, pp. 162-68, Nov, 2020. [Online]. Available: https://link.springer.com/chapter/10.1007%2F978-3-030-62483-5_18
Zhou, Bing, Bo Xiao, Zhiqing Lin, and Chuang Zhang. "A Distributed Vertical Crawler Using Crawling-Period Based Strategy", In Proceedings of the 2010 2nd International Conference on Future Computer and Communication, ICFCC, Vol. 1, pp. 306-11, Jun, 2010. [Online]. Available: https://ieeexplore.ieee.org/document/5497780
Gunawan, Dani, Amalia Amalia, and Atras Najwan. "Improving Data Collection on Article Clustering by Using Distributed Focused Crawler", Data Science: Journal of Computing and Applied Informatics, Vol. 1, pp. 1-12, Jul, 2010. [Online]. Available: https://talenta usu.ac.id/index.php/JoCAI/article/view/82
Kaur, Sawroop, and G. Geetha. "SIMHAR - Smart Distributed Web Crawler for the Hidden Web Using SIM+Hash and Redis Server", IEEE Access, Vol. 8, pp. 117582-92, Jun, 2020. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9123854
Ye, Feng, Zongfei Jing, Qian Huang, and Yong Chen. "The Research of a Lightweight Distributed Crawling System", In Proceedings of 2018 IEEE/ACIS 16th International Conference on Software Engineering Research, Management and Application, SERA, pp. 200-204, Jun, 2018 [Online]. Available: https://ieeexplore.ieee.org/document/8477212
Docker, "docker documentation", [Online]. Available: https://docs.docker.com/, [Accessed: Dec. 28, 2020]
Sharma, Vivek, Harsh Kumar Saxena, and Akhilesh Kumar Singh. 2020. "Docker for Multi-Containers Web Application", In Proceedings of 2nd International Conference on Innovative Mechanisms for Industry Applications, (ICIMIA), pp. 589-92, Apr, 2020. [Online]. Available: https://ieeexplore.ieee.org/document/9074925
Hwang, Jung-Yeon, and Ho-Yong Ryu. "Performance Comparison and Forecast Analysis between KVM and Docker", The Journal of Korean Institute of Information Technology, Vol. 13, No. 11, pp. 127-136, Nov, 2015. [Online]. Available: https://www.kci.go.kr/kciportal/ci/sereArticleSearch/artiPreView.kci?sereArticleSearchBean.artiId=ART002048947
sumologic, "Docker Swarm", [Online]. Available: https://www.sumologic.com/glossary/docker-swarm/, [Accessed: Dec. 28, 2020]
Juwita, Oktalia, Firmansyah, Diksy. "Cloud Computing Implementation with Docker Engine Swarm Mode for Data Availability Infrastructure of Rice Plants", International Journal Of Information System & Technology (IJISTECH), Vol. 1, No 2, pp. 1-24, 2018. [Online]. Available: http://ijistech.org/ijistech/index.php/ijistech/article/view/10
Matallah, Houcine, Ghalem Belalem, and Karim Bouamrane. "Evaluation of NoSQL Databases", International Journal of Software Science and Computational Intelligence Vol. 12, No. 4, pp. 71-91, Dec, 2020. [Online]. Available: https://www.igi-global.com/gateway/article/262589
Ying, Zhe Yu, Feng Li Zhang, and Qing Yu Fan. "Consistent Hashing Algorithm Based on Slice in Improving Scrapy-Redis Distributed Crawler Efficiency", In Proceedings of 2018 IEEE International Conference on Computer and Communication Engineering Technology (CCET), pp. 334-40, Nov, 2018. [Online]. Available: https://ieeexplore.ieee.org/document/8542217
Han, Xiaowei, and Likun Zheng. "Design and Implementation of Firmware Data Acquisition System Based on Scrapy Framework", In Proceedings of 2020 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS), pp. 168-74, Sept, 2020. Institute of Electrical and Electronics Engineers Inc. [Online]. Available: https://ieeexplore.ieee.org/document/9202251
Redis, "redis documentation", [Online]. Available: https://redis.io/documentation [Accessed: Dec. 28, 2020]
Rolando Max Espinoza, "Scrapy-redis documentation, " [Online]. Available: https://scrapyredis.readthedocs.org. [Accessed: Dec. 28, 2020]

Journal of Advanced Information Technology and Convergence (한국정보기술학회 영문논문지)

Implementation of Efficient Distributed Crawler through Stepwise Crawling Node Allocation

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)