The impact of inter-host links in crawling important pages early

  • Alam, Hijbul (College of Information and Communications, Korea University) ;
  • Ha, Jong-Woo (College of Information and Communications, Korea University) ;
  • Sim, Kyu-Sun (College of Information and Communications, Korea University) ;
  • Lee, Sang-Keun (College of Information and Communications, Korea University)
  • Published : 2010.06.30

Abstract

The dynamic nature and exponential growth of the World Wide Web remain crawling important pages early still challenging. State-of-the-art crawl scheduling algorithms require huge running time to prioritize web pages during crawling. In this research, we proposed crawl scheduling algorithms that are not only fast but also download important pages early. The algorithms give high importance to some specific pages those have good linkages such as inlinks from different domains or host. The proposed algorithms were experimented on publically available large datasets. The results of experiments showed that propagating more importance to the inter-host links improves the effectiveness of crawl scheduling than the current state-of-the-art crawl scheduling algorithms.

Keywords