Combining Multiple Sources of Evidence to Enhance Web Search Performance

  • Yang, Kiduk (Department of Library and Information Science, Kyungpook National University)
  • Received : 2014.08.25
  • Accepted : 2014.09.12
  • Published : 2014.09.30

Abstract

The Web is rich with various sources of information that go beyond the contents of documents, such as hyperlinks and manually classified directories of Web documents such as Yahoo. This research extends past fusion IR studies, which have repeatedly shown that combining multiple sources of evidence (i.e. fusion) can improve retrieval performance, by investigating the effects of combining three distinct retrieval approaches for Web IR: the text-based approach that leverages document texts, the link-based approach that leverages hyperlinks, and the classification-based approach that leverages Yahoo categories. Retrieval results of text-, link-, and classification-based methods were combined using variations of the linear combination formula to produce fusion results, which were compared to individual retrieval results using traditional retrieval evaluation metrics. Fusion results were also examined to ascertain the significance of overlap (i.e. the number of systems that retrieve a document) in fusion. The analysis of results suggests that the solution spaces of text-, link-, and classification-based retrieval methods are diverse enough for fusion to be beneficial while revealing important characteristics of the fusion environment, such as effects of system parameters and relationship between overlap, document ranking and relevance.

웹은 하이퍼링크 및 야후와 같이 수동으로 분류된 웹 디렉토리 처럼 문서의 콘텐츠를 넘어선 다양한 정보의 소스가 풍부하다. 이 연구는 웹문서 내용을 활용한 텍스트기반의 검색 방식, 하이퍼 링크를 활용한 링크 기반의 검색 방식, 그리고 야후의 카테고리를 활용한 분류 기반의 검색 방식을 융합하므로서 여러 정보소스를 결합하면 검색 성능을 향상시킬 수 있다는 기존 융합검색연구들을 확장시켰다. 텍스트, 링크, 분류 기반 검색 결과를 여러가지 선형조합식으로 생성한 융합결과를 기존의 검색 평가 지표를 사용하여 각각의 검색 결과와 비교 한 후, 검색결과 오버랩의 중요성 또한 조사 하였다. 본 연구는 텍스트, 링크, 분류 기반 검색의 솔루션 스패이스들의 다양성이 융합검색의 적합성을 제시한다는 결론과 더불어 시스템 파라미터의 영향, 그리고 오버랩, 문서순위, 관련성들의 상호 관계 같은 융합 환경의 중요한 특성들을 분석하였다.

Keywords

References

  1. Bartell, Brian T., G. W. Cottrell and R. K. Belew. 1994. "Automatic combination of multiple ranked retrieval systems." Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval.
  2. Belkin, Nicholas J., C. Cool, W. B. Croft and J. P. Callan. 1993. "The effect of multiple query representations on information retrieval system performance." Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, 339-346.
  3. Bharat, Krishnaand M. R. Henzinger. 1998. "Improved Algorithms for Topic Distillation in Hyperlinked Environments." Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, 104-111.
  4. Brin, Serge andL. Page. 1998. "The anatomy of a large-scale hyper textual Web search engine." Computer networks and ISDN systems, 30(1): 107-117. https://doi.org/10.1016/S0169-7552(98)00110-X
  5. Buckley, Chris, G. Salton, J. Allan and A. Singhal. 1995. "Automatic query expansion using SMART: TREC 3." In D. K. Harman (Ed.), The Third Text Rerieval Conference (TREC-3) (NIST Spec. Publ. 500-225, pp.1-19). Washington, DC: U.S. Government Printing Office
  6. Buckley, Chris, A. Singhal and M. Mitra. 1997. "Using query zoning and correlation within SMART: TREC 5." In E. M. Voorhees & D. K. Harman (Eds.),The Fifth Text REtrieval Conference (TREC-5) (NIST Spec. Publ. 500-238, pp. 105-118). Washington, DC: U.S. Government Printing Office.
  7. Buckley, Chris, A. Singhal, M. Mitra and G. Salton. 1996. "New retrieval approaches using SMART: TREC 4." In D. K. Harman (Ed.), The Fourth Text REtrieval Conference (TREC-4) (NIST Spec. Publ. 500-236, pp. 25-48). Washington, DC: U.S. Government Printing Office.
  8. Chakrabarti, Soumen, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson and J. Kleinberg. 1998. "Automatic resource list compilation by analyzing hyperlink structure and associated text." Proceedings of the 7th International World Wide Web Conference.
  9. Fishburn, Peter C. 1970. Utility theory for decision making. New York: John Wiley & Sons.
  10. Fox, Edward A. andJ. A. Shaw. 1994. "Combination of multiple searches." In D. K. Harman (Ed.), The Second Text Rerieval Conference (TREC-2) (NIST Spec. Publ. 500-215, pp.243-252). Washington, DC: U.S. Government Printing Office.
  11. Fox, Edward A. and J. A. Shaw. 1995. "Combination of multiple searches." In D. K. Harman (Ed.), The Third Text Rerieval Conference (TREC-3) (NIST Spec. Publ. 500-225, pp. 105-108). Washington, DC: U.S. Government Printing Office.
  12. Frakes, Williams B. and R.Baeza-Yates.eds. 1992. Information retrieval: Data structures & algorithms. Englewood Cliffs, NJ: Prentice Hall.
  13. Gurrin, Cathal and A. F.Smeaton. 2001. "Dublin City University experiments in connectivity analysis for TREC-9." In E. M. Voorhees & D. K. Harman (Eds.), TheNineth Text Rerieval Conference(TREC-9). Washington, DC: U.S. Government Printing Office.
  14. Katzer, Jeffrey, M. J. McGill, J. A. Tessier, W. Frakes and P. DasGupta. 1982. "A study of the overlap among document representations." Information Technology: Research and Development, 1, 261-274.
  15. Keen, E. Michael. 1973. "The Aberystwyth index languages test." Journal of Documentation, 29, 1-35. https://doi.org/10.1108/eb026547
  16. Kleinberg, Jon. 1999. "Authoritative sources in a hyperlinked environment." Journal of the Association for Computing Machinery, 46(5), 604-632. https://doi.org/10.1145/324133.324140
  17. Lee, Joon Ho. 1996. "Combining multiple evidence from different relevance feedback methods(Tech. Rep. No.IR-87)." Amherst: University of Massachusetts, Center for Intelligent Information Retrieval.
  18. Lee, Joon Ho. 1997. "Analyses of multiple evidence combination." Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, 267-276.
  19. Modha, Dharmendra and W. S. Spangler. 2000. "Clustering hypertext with applications to Web searching." Proceedings of the 11th ACM Hypertext Conference, 143-152.
  20. Page, Larry, S. Brin, R. Motwani and T. Winograd.1998. "The Page Rank citation ranking: Bringing order to the Web." Technical Report, Stanford Digital Library Technologies Project.
  21. Plaunt, Christian and B. A. Norgard. 1998. "An Association Based Method for Automatic Indexing with a Controlled Vocabulary." Journal of the American Society for Information Science, 49(10): 888-902.
  22. Saracevic, Tefko and P. Kantor. 1988. "A study of information seeking and retrieving. III. Searchers, searches, overlap." Journal of American Society for Information Science, 39: 197-216. https://doi.org/10.1002/(SICI)1097-4571(198805)39:3<197::AID-ASI4>3.0.CO;2-A
  23. Singhal, Amit, C. Buckley and M. Mitra. 1996. "Pivoted document length normalization." Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, 21-29.
  24. Smith, Linda. C. 1979. Selected Artificial Intelligence Techniques in Information Retrieval Systems Research. Ph. D. diss., Syracuse University, U. S.
  25. Sparck Jones, Karen. 1974. "Automatic indexing." Journal of Documentation 30, 393-432. https://doi.org/10.1108/eb026588
  26. Sumner, Robert. G., K. Yang, R. Akers and W. M. Shaw. 1998. "Interactive retrieval using IRIS: TREC-6 experiments." In E. M. Voorhees & D. K. Harman(Eds.), The Sixth Text REtrieval Conference(TREC-6).
  27. Vogt, Christopher. C and G. W. Cottrell. 1998. "Predicting the performance of linearly combined IR systems." Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, 190-196.
  28. Williams, Martha E. 1977. "Analysis of terminology in various CAS data files as access points for retrieval." Journal of Chemical Information and Computer Sciences, 17: 16-20.
  29. Wong, S. K. Michael, Y. Y. Yao and P.Bollmann. 1988. "Linear structure in information retrieval." Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 219-232.
  30. Wong, S. K. Michael, Y. Y. Yao, G. Salton and C. Buckley. 1991. "Evaluation of an adaptive linear model." Journal of the American Society for Information Science, 42: 723-730. https://doi.org/10.1002/(SICI)1097-4571(199112)42:10<723::AID-ASI5>3.0.CO;2-U
  31. Yang, Kiduk. 2005. "Information retrieval on the web." ARIST, 39(1): 33-80.