DOI QR코드

DOI QR Code

An Experimental Study on Topic Distillation Using Web Site Structure

웹 사이트 구조를 이용한 토픽 검색 연구

  • Published : 2007.09.29

Abstract

This study proposes a topic distillation algorithm that ranks the relevant sites selected from retrieved web pages, and evaluates the performance of the algorithm. The algorithm calculates the topic score of a site using its hierarchical structure. The TREC .GOV test collection and a set of TREC-2004 queries for topic distillation task are used for the experiment. The experimental results showed the algorithm returned at least 2 relevant sites in top ten retrieval results. We peformed an in-depth analysis of the relevant sites list provided by TREC-2004 to find out that the definition of topic distillation was not strictly applied in selecting relevant sites. When we re-evaluated the retrieved sites/sub-sites using the revised list of relevant sites, the performance of the proposed algorithm was improved significantly.

이 연구에서는 TRBC이 제시한 토픽 검색의 정의에 따라 질의에 적합한 웹 사이트를 검색하는 효과적인 토픽 검색 알고리즘을 제안하고 실험을 통해 그 성능을 평가하였다. 이 연구의 토픽 검색 알고리즘은 먼저 질의에 대한 웹 페이지 검색 결과로부터 적합한 웹 사이트를 선정한 다음, 선정된 사이트의 구조를 이용하여 질의에 대한 적합성 점수를 산출한다. TREC의 .GOV 실험 문헌 집단과 TREC-2004 실험의 질의 및 적합문헌 리스트를 이용한 검색 실험 결과 이 토픽 검색 알고리즘은 상위 10위 안에 최소 2개 이상의 적합 사이트를 검색하여 비교적 높은 수준의 성능을 보였다. 또한 TREC-2004의 적합문헌 리스트 분석을 통해 적합문헌 선정에 토픽 검색의 정의가 엄격하게 적용되지 않은 경우가 있음을 확인하고, 수정된 적합문헌 리스트를 이용하여 토픽 검색 성능을 재평가한 결과 이 연구에서 제안한 토픽 검색 알고리즘의 성능이 월등히 향상되었다.

Keywords

References

  1. 박기림, 장유진, 김민구, 박승규. 2003. '문서 내의 주제정보를 이용한 개선된 링크 분석 알고리즘.' 한국정보과학회 학술발표논문집.' 한국정보과학회 학술발표논문집 30(2): 7-9
  2. Bahrat, K., and Henzinger, M. R. 1998. 'Improved Algorithms for Topic Distillation in a Hyperlinked Environment.' In Proceedings of the 21st ACM SIOIR Conference on Research and Development in Information Retrieval, 104-111 https://doi.org/10.1145/290941.290972
  3. Bahrat, K., and Mihaila, G. A. 2002. 'When experts agree: Using non-affiliated Experts to rank popular topics.' ACM Transactions on Information Systems, 20(1): 46-58 https://doi.org/10.1145/503104.503107
  4. Chakrabarti, S., Berg, M., and Dom, B. 1999. 'Focused Crawling: A new approach to topic-specific web resource discovery.' Proceedings of Eighth International World Wide Web Conference.
  5. Craswell, N., and Hawking, D. 2003. 'Task Descriptions: Web Track 2003.' In Proceedings of the Twelfth Text Retrieval Conference(TREC-12). (http:trec.nist.gov/pubs/trec12/papers/web03.guidelines.pdf>
  6. Craswell, N., and Hawking, D. 2004. 'Overview of the TREC-2004 Web Track.' In : Proceedings of the Thirteenth Text Retrieval Conference (TREC-13). (http:trec.nist.gov/pubs/trec13/papers/WEB.OVERVIEW.pdf)
  7. Kamps, J., Monz, C., Rijke, M., and Sigurbjornsson, B. 2003. 'Approaches to Robust and Web Retrieval.' In Proceedings of the Twelfth Text Retrieval Conferen ce (TREC-12).
  8. Kleinberg, J. M. 1999. 'Authoritative sources in a hyperlinked environment.' Journal of ACM 46(5): 604-632 https://doi.org/10.1145/324133.324140
  9. Lim, C. S., Lee, K. J., and Kim, G. C. 2005. 'Multiple sets of features for automatic genre classification of web documents.' Information Processing and Management, 41(5): 1263-1276 https://doi.org/10.1016/j.ipm.2004.06.004
  10. MacFariane, A. 2002. 'Pliers at TREC 2002.' In Proceedings of the Eleventh Text Retrieval Conference(TREC-11).
  11. Plachouras, V., Cacheda, F., Ounis, I., and Rijsbergen, C. J. 2003. 'University of Glasgow at the Web Track: Dynamic Application of Hyperlink Analysis using Query Scope.' In Proceedings of the Twelfth Text Retrieval Conference (TREC-12).
  12. Qin, T., Liu, T., Zhang, X., Feng, G., Wang, D., and Ma, W. 2007. 'Topic distillation via sub-site retrieval.' Information Processing & Management 43(2): 445-460 https://doi.org/10.1016/j.ipm.2006.07.004
  13. Robertson, S.E. and Sparck Jones, K. 1976. 'Relevance weighting of search terms.' Journal of the American Society and Information Science, 27(3):129-146 https://doi.org/10.1002/asi.4630270302
  14. Robertson, S.E., Walker, S. Beaulieu, M. 2000. 'Experimentation as a way of life: Okapi at TREC.' Informa tion Processing & Management 36(1): 95-108 https://doi.org/10.1016/S0306-4573(99)00046-1
  15. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., and Gatford, M. 1994. 'Okapi at TREC-3.' In Proceedings of the Third Text Retrieval Conference (TREC-3).
  16. Song, R., Wen, J., Shi, S., Xin, G., Liu, T., Qin, T., Zheng, X., Zhang, J., Xue, G., and Ma, W. 2004. 'Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004.' In Proceedings of the Thirteenth Text Retrieval Conference (TREC-13).
  17. Sun, A. and Lim, E. 2003. 'Web Unit Mining - Finding and Classifying Subgraphs of Web Pages.' In Proceedings of the twelfth ACM CIKM : 108-115 https://doi.org/10.1145/956863.956885
  18. Tomlinson, S. 2002. 'Experiments in Named Page Finding and Arabic Retrieval with Hummingbird SearchServer$^TM$ at TREC 2002.' In Proceedings of the Eleventh Text Retrieval Con ference (TREC-11).
  19. Tomlinson, S. 2003. 'Robust, Web and Genomic Retrieval with Hummingbird SearchServer$^{TM}$ at TREC 2003.' In Proceedings of the Twelfth Text Retrieval Conference(TREC-12).
  20. Zaragoza, H., Craswell, N., Taylor, M., Saria, S., and Robertson, S. 2004. 'Microsoft Cambridge at TREC-13: Web and Hard Tracks.' In Proceedings of the Thirteenth Text Retrieval Conference(TREC-13).
  21. Zhang, M., Lin, C., Liu, Y., Zhao, L., and Ma, S. 2003. 'THUIR at TREC 2003: Novelty, Robust and Web.' In Proceedings of the Twelfth Text Retrieval Conferen ce (TREC-12).
  22. Zhang, M., Song, R., Lin, C., Ma, S., Jiang, Z., Jin, Y., Liu, Y., and Zhao, L. 2002. 'THU TREC-2002 Web Track Experiments.' In Proceedings of the Eleventh Text Retrieval Conference (TREC-11).