Effective Web Crawling Orderings from Graph Search Techniques

그래프 탐색 기법을 이용한 효율적인 웹 크롤링 방법들

  • Published : 2010.02.15

Abstract

Web crawlers are fundamental programs which iteratively download web pages by following links of web pages starting from a small set of initial URLs. Previously several web crawling orderings have been proposed to crawl popular web pages in preference to other pages, but some graph search techniques whose characteristics and efficient implementations had been studied in graph theory community have not been applied yet for web crawling orderings. In this paper we consider various graph search techniques including lexicographic breadth-first search, lexicographic depth-first search and maximum cardinality search as well as well-known breadth-first search and depth-first search, and then choose effective web crawling orderings which have linear time complexity and crawl popular pages early. Especially, for maximum cardinality search and lexicographic breadth-first search whose implementations are non-trivial, we propose linear-time web crawling orderings by applying the partition refinement method. Experimental results show that maximum cardinality search has desirable properties in both time complexity and the quality of crawled pages.

웹 크롤러는 웹에서 링크를 따라다니며 웹 페이지들을 자동으로 다운로드하는 프로그램으로 주로 웹 환경을 연구하거나 검색 엔진을 만들기 위해 사용된다. 기존의 연구들에서는 웹 크롤러가 인기 있는 웹 페이지들을 먼저 크롤링 할 수 있도록 몇 가지 방법들이 제안되었으나 그래프 이론 분야에서 연구되어 온 몇몇 그래프 탐색 기법들은 아직 웹 크롤링 방법으로 고려되지 않았다. 이 논문에서는 잘 알려진 너비 우선 탐색, 깊이 우선 탐색 뿐 아니라 사전식 너비 우선 탐색, 사전식 깊이 우선 탐색 및 최대 크기 탐색을 웹 크롤링 방법으로 고려하여 이 중에서 선형적인 시간복잡도를 가지면서도 인기 있는 웹 페이지를 효율적으로 수집할 수 있는 웹 크롤링 방법을 찾는다. 특히 선형 구현이 단순하지 않은 최대 크기 탐색과 사전식 너비 우선 탐색에 대해서는 분할 정제 방법을 이용한 선형 시간 웹 크롤링 방법을 제시한다. 실험 결과는 최대 크기 탐색이 다른 그래프 탐색 방법에 비해 시간 복잡도 및 크롤링 된 페이지들의 질에 있어서 바람직한 성질을 가짐을 보여준다.

Keywords

References

  1. J. Cho, H. Garcia-Molina, L. Page. Efficient crawling through URL ordering In Proceedings of 7th World Wide Web Conference, 1998.
  2. J. Cho, H. Garcia-Molina, L. Page. Efficient crawling through URL ordering In Proceedings of 7th World Wide Web Conference, 1998.
  3. S. Abiteboul, M. Preda, and G. Cobena. Adaptive On-Line Page Importance Computation. In WWW 2003, 2003.
  4. P. Boldi, M. Santini, and S. Vigna. Do Your Worst to Make the Best: Paradoxical Effects in Page- Rank Incremental Computations. In International Workshop on Algorithms and Models for the Web-Graph (WAW), LNCS, vol.3, 2004.
  5. R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering. In A. Ellis and T. Hagino, editors WWW (Special interest tracks and posters) ACM, 2005.
  6. R. E. Tarjan and M. Yannakakis. Simple lineartime algorithms to test chordality of graphs, test acyclicity of hypergraphs, and selectively reduce acyclic hypergraphs. SIAM Journal on Computing, 13(3):566 579, Aug. 1984.
  7. D. Corneil, R. Krueger, A unified view of graph searching, ICGT'05: Proceedings of 7th French International Colloquium on Graph Theory, Hyeres, France, 2005.
  8. R. Krueger, Graph searching, PhD thesis, University of Toronto, 2005.
  9. M. Habib, R. McConnell, C. Paul, L. Viennot, Lex- BFS and partition refinement, with applications to transitive orientation, interval graph recognition and consecutive ones testing, Theoretical Computer Science, 234(1-2): 59-84, March 2000. https://doi.org/10.1016/S0304-3975(97)00241-7
  10. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford Digital Library Technologies Project, Stanford University, Stanford, CA, USA, Nov. 1998.
  11. J. Cho, U. Schonfeld. Rankmass crawler: a crawler with high personalized PageRank coverage guarantee. In VLDB Endowment, 2007.
  12. S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine. In Proceeding of 7th World Wide Web Conference, 1998.
  13. A. Berry, J. R. S. Blair, P. Heggernes, and B. W. Peyton. Maximum cardinality search for computing minimal triangulations of graphs. Algorithmica, 39(4):287-298, 2004. https://doi.org/10.1007/s00453-004-1084-3