DOI QR코드

DOI QR Code

An Improved Approach to Ranking Web Documents

  • Gupta, Pooja (Dept. of Computer Science and Engineering, I.P. University) ;
  • Singh, Sandeep K. (Dept. of Com Science and Engineering, JIIT (deemed to be Univ.)) ;
  • Yadav, Divakar (Dept. of Com Science and Engineering, JIIT (deemed to be Univ.)) ;
  • Sharma, A.K. (Dept. of Comp Science and Engineering, Y.M.C.A. Univ.)
  • Received : 2012.10.29
  • Accepted : 2013.02.14
  • Published : 2013.06.29

Abstract

Ranking thousands of web documents so that they are matched in response to a user query is really a challenging task. For this purpose, search engines use different ranking mechanisms on apparently related resultant web documents to decide the order in which documents should be displayed. Existing ranking mechanisms decide on the order of a web page based on the amount and popularity of the links pointed to and emerging from it. Sometime search engines result in placing less relevant documents in the top positions in response to a user query. There is a strong need to improve the ranking strategy. In this paper, a novel ranking mechanism is being proposed to rank the web documents that consider both the HTML structure of a page and the contextual senses of keywords that are present within it and its back-links. The approach has been tested on data sets of URLs and on their back-links in relation to different topics. The experimental result shows that the overall search results, in response to user queries, are improved. The ordering of the links that have been obtained is compared with the ordering that has been done by using the page rank score. The results obtained thereafter shows that the proposed mechanism contextually puts more related web pages in the top order, as compared to the page rank score.

Keywords

References

  1. Q. Tan, P. Mitra, C. Lee Giles, 'Designing Clustering-Based Web Crawling Policies for Search En-gine Crawlers', Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, New York 2007, pp.535-544.
  2. C. Benincasa, A. Calden, E. Hanlon, M. Kindzerske, K. Law, E. Lam,J Rhoades, I. Roy,M. atz, E. Valentine and N. Whitaker, "Page Rank Algorithm", 2006, http://www.math.umass.edu/-law/Research/PageRank/Google.pdf.
  3. C. Ridings, M. Shishigin, "Page Rank Uncovered", Technical Report, September, 2002, http://www. voelspriet2.nl/PageRank.pdf
  4. P. Gupta, "Context based relevance evaluation of web documents", Proceedings of 5th International Conference, IC3 2012, Noida, India, August 6-8, 2012, pp.201-212.
  5. Kleinberg, J. M., "Authoritative sources in a hyperlinked environment", Journal of ACM, vol.46, no.5, September, 1999, pp.604-632. https://doi.org/10.1145/324133.324140
  6. S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine", Proceedings of 7th International WWW Conference, 1998, pp.107-117.
  7. Emil Gatail, Z. Balogh, "Focused Web crawling Mechanism based on Page Relevance", Proceedings of ITAT-Workshop on Theory and Practice of IT, Rackova dolina, Sept. 2005, pp.41-46.
  8. Z. Liu, Y. Du, Y. Zhao, "Focused Crawler based on Domain Ontology and FCA", Journal of Infor-mation & Computational science, Vol.8, no.10,2011, pp.1909-1917
  9. A. N. Langville, C. D. Meyer, "Deeper Inside Page Rank", Internet Math. J., Vol.1, No 3, 2005, pp.335-380.
  10. M. Persin, "Filtered document retrieval with frequency sorted indexes", Journal of the American Society for Information Science, Vol.47, No.10, October 1996, pp.749-764 https://doi.org/10.1002/(SICI)1097-4571(199610)47:10<749::AID-ASI3>3.0.CO;2-2
  11. Pooja Gupta, A K Sharma, Divakar Yadav, "A Novel Technique for Back-Link Extraction and Rele-vance Evaluation", IJCSIT, Vol.3, No.3, June 2011,pp.227-238. https://doi.org/10.5121/ijcsit.2011.3316
  12. G.Salton, "Developments in automatic text retrieval", science 253, 5023, 30 August, 1991, pp.974-979. https://doi.org/10.1126/science.253.5023.974
  13. M. Cutler, H. Deng, S. S. Maniccam, and W. Meng, "A New Study on Using HTML Structures to Improve Retrieval", Proceedings of the 11th IEEE International Conference on Tools with Artificial Intelligence,1999, pp.406-409.
  14. Sun Kim and Byoung-Tak Zhang, "Genetic Mining of HTML Structures for Effective Web-Document Retrieval", Journal Applied Intelligence (ACM), Vol.18, No.3, May-June 2003, pp.243-256. https://doi.org/10.1023/A:1023293820057
  15. Ingo Feinerer and Kurt Hornik, wordnet: WordNet Interface. R package version 0.1-8, 2011, http://CRAN.R-project.org/package=wordnet
  16. Mike Wallace, Jawbone Java WordNet API, 2007, https://sites.google.com/site/mfwallace/jawbone
  17. Christiane Fellbaum, WordNet: An Electronic Lexical Database. Bradford Books, 1998.
  18. Bayes's Rule, "Lecture 4: Conditional probability, Total Probability", www.stat.cmu.edu/---cshalizi/36-220/lecture4.pdf
  19. Brandon Kountz, Ashwini Miryala, Kyle Scarlett, Zachary Zell, "Bayes Rule, Conditional Probability, independence", November, 2006. https://controls.engin.umich.edu/wiki/index.php/Bayes_Rule,_ conditional_probability,_independence
  20. J. Bar-llan, M. Mat-Hossan and M. Levene, "Methods of comparing rankings of search engine re-sults", The International Journal of Computer and Telecommunications Networking - Web dynamics, Vol.50, No.10, July 2006, pp.1448-1463.
  21. Chakrabarti S., Gibson, D. A., McCurley, K. S.(1999)," Surfing the Web Backwards", In the proceed-ings of 8th World Wide Web Conference.
  22. Gyngyi, Z., Garcia-Molina, H., Pedersen, J., "Combating web spam with trustrank", In: VLDB, pp.576-587. (2004)
  23. R. Jizba, "Measuring Search Effectiveness", 2007. http://www.creighton.edu/fileadmin/user/HSL/ docs/ref/Searching_-_Recall_Precision.pdf
  24. Precision and recall-Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Precision_and_ recall
  25. Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry, "The PageRank Cita-tion Ranking: Bringing Order to the Web", 1999, Technical Report, Stanford InfoLab.
  26. Chattamvelli, Rajan, "Some generalizations of the PageRank metric", National conference on current trends in advanced computing, CTAC'10, Banglore, April 2010, pp.172-175.
  27. PR checker tool, www.prchecker.info/check_page_rank.php

Cited by

  1. SAW Classification Algorithm for Chinese Text Classification vol.7, pp.3, 2015, https://doi.org/10.3390/su7032338
  2. A blog ranking algorithm using analysis of both blog influence and characteristics of blog posts vol.18, pp.1, 2015, https://doi.org/10.1007/s10586-013-0337-9