XML Document Retrieval Models for Heterogeneous Data Set using Independent Regular paths

독립적인 질의 경로들을 사용하여 이질적인 문서들을 검색하는 XML 문서 검색 모델

  • 유신재 (서울대학교 전기컴퓨터공학부) ;
  • 민경섭 (서울대학교 인지과학) ;
  • 김형주 (서울대학교 컴퓨터공학부)
  • Published : 2003.02.01

Abstract

An XML document has a structure which may be irregular. It is difficult for end-users to comprehend the irregular document structure exactly. For these XML documents, an end-user has a difficulty in using structured query. Therefore, an end-user formulates no structured query or a query which has a little structure information. In this context, we propose new retrieval models which use the structured information for ranking and compensate the difference between user query structure and document structure. To ease with querying, we assume the independence among querying paths which represent structural constraints. Since this assumption makes degradation of the expression power of a query language, we also propose a model which overcome this problem. As there had been no test collections for XML documents, we made a small test collection from TIPSTER of the RTEC and experimented on this collection without a structured query, From this experiment, we showed that our models improve average precision about 67% over conventional Vector-Space model.

XML 문서는 태그를 가지고 있고 이 태그가 중첩됨에 따라 구조를 나타낼 수 있다. XML 문서가 DTD를 가지지 않거나 여러 곳에서 XML 문서를 모았을 때 그 구조는 비정규적 일 수 있다. 사용자는 이러한 비정규적인 구조에 대해 잘 알기 어려우며 설사 잘 알고 있다고 하더라도 실수하기 쉽다. 특히 비정규적인 구조를 가지는 문서들에 대해 정확한 구조질의를 작성하는 것은 더욱 어렵다. 따라서 사용자는 구조가 없거나 있다 하더라도 적은 양의 구조정보 만을 기술하는 일반적인 질의를 작성하게 된다. 이런 환경에서 구조 정보를 이용하여 문서의 순위결정에 이용하고 사용자 구조 질의와 문서 구조간의 차이에 대해 보상해 주는 검색 모델을 제안한다. 질의 처리를 단순화하기 위하여 질의 경로간의 독립을 가정하였다 이 가정은 질의 언어의 표현능력의 저하를 가져올 수 있는데 이를 해결하는 질의 모델도 제시한다. 지금까지 어떤 문서를 위한 테스트 컬랙션이 없었기 때문에 TIPSTER 컬랙션에서 일부 문서를 추출하여 작은 테스트 컬랙션을 만들고 여기에 구조가 없는 질의를 수행하여 제시한 검색 모델의 유용성을 보였다. 실험 결과 벡터 모델에 비하여 평균 67%의 정확률 개선효과를 얻을 수 있었다.

Keywords

References

  1. Neil Bradley, The XML Companion, 2nd Edition, Addison-Wesley, 1999
  2. http://www.w3.org/TR/xpath
  3. http://www.w3.org/TR/xquecry
  4. S. H. Myaeng and et al., A Flexible Model for Retricval of SGML Documents, SIGIR, pages 138-145, 1998
  5. Gonzalo Mavarro and Ricardo Baezs-Yates, Proximal nodes: a model to query document database by content and structure, TOIS, 15(4):400-435, 1997 https://doi.org/10.1145/263479.263482
  6. D. Shin, H. Jang and H. Jin, BUS: An Effective Indexing and Retrieval Scheme in Structured Documents, journal of Digital Library, pages 235-243, 1998
  7. V.I. Levvenshtein, Binary codes capable of correcting deletions, insertions, and reversal, Sov. Phys. Dokl., pages 707-710, 1966
  8. Takeyuki Shimura, Masatoshi Yoshikawa and Shunsuke Uermura, Storage and Retrieval of XML Documents using Object-Relational Databases, DEXA, pages 206-217, 1999
  9. Alin Deutsch, Mary F. Fernandez and Dan Suciu, Storing Semistructured Data with STORED, SIGMOD, pages 431-442, 1999 https://doi.org/10.1145/304182.304220
  10. Danicla Florescu and Donald Kossmann, Storing and Querying XML Data Using and RDBMS, Data Engineering Bulletin, 22(3), 1999
  11. Roy Goldman, Jason McHugh and Jennifer Widom, From Semistructured Data to XML: Migrating the Lore Data Model and Query Lansguage, WebDB, pages 25-30, 1999
  12. J.P. Callan, Passage-Level Evidence in Document Retrieval, In W. Bruce. Croft and C.J. van Rijsbergen editors. Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 302 310, Dublin, Ireland, July 1994, Spring-Verlag
  13. Charles L. , A. Clarke and Gordon V. Cormack, Shortest-substring retrieval and ranking, TOIS, 18(1), 44-78, January 2000 https://doi.org/10.1145/333135.333137
  14. D. Hawking and P. Thistlewaite, Proximity operators - so near and yet so far
  15. G. Salton and C, Buckley, Automatic text structuring and retrieval; Experiments in automatic encyclopedia searching, Proceedings of the 14th Annual International ACM/SIGIR Conference, pages 21-31, 1991 https://doi.org/10.1145/122860.122863
  16. Ross Wilkinson, Effective retrieval of structured documents, Proceedings of the 17th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR `94 Dublin, Ireland, July 3-6), pages 311-317, 1995
  17. Justin Zobel, Alistair Moffat, Ross Wilkinson, and Ron Sacks-Davis, Efficient retrieval of partial documents, Information Processing and Management, 31(3):361-377, 1995 https://doi.org/10.1016/0306-4573(94)00052-5
  18. C. Clarke, G. Cormack, and F. Burkowski, Shortest substring ranking (MultiText ex-periment for TREC-4), In D. K. Harman editor, Proceedings of the 4th Text Retrieval Conference(TREC-4, Washington, D.C., Nov.), pages 295-304, 1995
  19. M. Kaszkiel and J. Zobel, Passage retrieval revisited, Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97, Philadelphia, PA. USA, July 27-31), pages 178-185, 1997
  20. M. Kaszkiel, J. Zobel and R. Sacks-Davis, Efficient passage ranking for document databases, TOIS, 17(4): 406-439, 2000 https://doi.org/10.1145/326440.326445
  21. G. Salton, J. Allan and C. Buckley, Automatic structuring and retrieval of large text files, CACM, 37(2): 97-108, 1994 https://doi.org/10.1145/175235.175243
  22. C. Stanfill and D. L. Waltz, Statistical methods,artificial intelligence, and information retrieval, In P. S. Jacobs, editor, Text-Based Intelligent Systems: Current Research and Practice in Infromation Extraction and Retrieval, pages 215-225, Lawrence Erlbaum Associates, Inc., 1992
  23. J. McHugh, S. Abiteboul, R. Goldman, D. Quass and J. Widom, Lore: A Database Management System for Semistructured Data, SIGMOD Record, Vol.26, No.3n pp.54-66, 1997 https://doi.org/10.1145/262762.262770
  24. S. E. Robertson and S. Walker, Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval, SIGIR, pages 232-241, 1994
  25. Howard R. Turle and W. Bruce Croft, Evaluation of an Inference Netwofk-Based Retrieval Model, TOIS, 9(3):187-222, 1991 https://doi.org/10.1145/125187.125188
  26. Yong Kyu Lee, Seong-Joon Yoo, Kyoungro Yoon and P. Bruce Berra, Index Structures for Structured Documents, journal of Digital Library, pages 91-99, 1996
  27. Chi young Seo and Hyung-joo Kim, An Efficient Inverted Index Technique Using RDBMS for Supporting Containment Queries, technical report, 2001
  28. Chun Zhang, Jeffrey F. Naughton, David J. DeWitt and Guy M. Lohman Qiong Luo, On Supporting Containment Queries in Relational Database Management Systems, SIGMOD, 2001 https://doi.org/10.1145/376284.375722
  29. Ricardo A. Baeza-Yates and Gonzalo Navarro, Intergrating Contents and Structure in Text Retrieval, SIGMOD Record, 25(1):67-79, 1996 https://doi.org/10.1145/381854.381890
  30. Pekka Kilpelainen and Heikki Mannila, Retrieval from Hierarchical Texts by Partial Patterns, SIGIR, pages 214-222, 1993 https://doi.org/10.1145/160688.160722
  31. I. MacLeod, A query language for retrieving information from hierarchic text structures, The Computer Journal, 34(3):254-264, 1991 https://doi.org/10.1093/comjnl/34.3.254