XML Document Retrieval Models for Heterogeneous Data Set using Independent Regular paths

;;;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 30 Issue 1_2
/
Pages.140-152
/
2003
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

XML Document Retrieval Models for Heterogeneous Data Set using Independent Regular paths

독립적인 질의 경로들을 사용하여 이질적인 문서들을 검색하는 XML 문서 검색 모델

유신재 (서울대학교 전기컴퓨터공학부) ;
민경섭 (서울대학교 인지과학) ;
김형주 (서울대학교 컴퓨터공학부)

Published : 2003.02.01

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

An XML document has a structure which may be irregular. It is difficult for end-users to comprehend the irregular document structure exactly. For these XML documents, an end-user has a difficulty in using structured query. Therefore, an end-user formulates no structured query or a query which has a little structure information. In this context, we propose new retrieval models which use the structured information for ranking and compensate the difference between user query structure and document structure. To ease with querying, we assume the independence among querying paths which represent structural constraints. Since this assumption makes degradation of the expression power of a query language, we also propose a model which overcome this problem. As there had been no test collections for XML documents, we made a small test collection from TIPSTER of the RTEC and experimented on this collection without a structured query, From this experiment, we showed that our models improve average precision about 67% over conventional Vector-Space model.

XML 문서는 태그를 가지고 있고 이 태그가 중첩됨에 따라 구조를 나타낼 수 있다. XML 문서가 DTD를 가지지 않거나 여러 곳에서 XML 문서를 모았을 때 그 구조는 비정규적 일 수 있다. 사용자는 이러한 비정규적인 구조에 대해 잘 알기 어려우며 설사 잘 알고 있다고 하더라도 실수하기 쉽다. 특히 비정규적인 구조를 가지는 문서들에 대해 정확한 구조질의를 작성하는 것은 더욱 어렵다. 따라서 사용자는 구조가 없거나 있다 하더라도 적은 양의 구조정보 만을 기술하는 일반적인 질의를 작성하게 된다. 이런 환경에서 구조 정보를 이용하여 문서의 순위결정에 이용하고 사용자 구조 질의와 문서 구조간의 차이에 대해 보상해 주는 검색 모델을 제안한다. 질의 처리를 단순화하기 위하여 질의 경로간의 독립을 가정하였다 이 가정은 질의 언어의 표현능력의 저하를 가져올 수 있는데 이를 해결하는 질의 모델도 제시한다. 지금까지 어떤 문서를 위한 테스트 컬랙션이 없었기 때문에 TIPSTER 컬랙션에서 일부 문서를 추출하여 작은 테스트 컬랙션을 만들고 여기에 구조가 없는 질의를 수행하여 제시한 검색 모델의 유용성을 보였다. 실험 결과 벡터 모델에 비하여 평균 67%의 정확률 개선효과를 얻을 수 있었다.

Keywords

References

Neil Bradley, The XML Companion, 2nd Edition, Addison-Wesley, 1999
http://www.w3.org/TR/xpath
http://www.w3.org/TR/xquecry
S. H. Myaeng and et al., A Flexible Model for Retricval of SGML Documents, SIGIR, pages 138-145, 1998
Gonzalo Mavarro and Ricardo Baezs-Yates, Proximal nodes: a model to query document database by content and structure, TOIS, 15(4):400-435, 1997 https://doi.org/10.1145/263479.263482
D. Shin, H. Jang and H. Jin, BUS: An Effective Indexing and Retrieval Scheme in Structured Documents, journal of Digital Library, pages 235-243, 1998
V.I. Levvenshtein, Binary codes capable of correcting deletions, insertions, and reversal, Sov. Phys. Dokl., pages 707-710, 1966
Takeyuki Shimura, Masatoshi Yoshikawa and Shunsuke Uermura, Storage and Retrieval of XML Documents using Object-Relational Databases, DEXA, pages 206-217, 1999
Alin Deutsch, Mary F. Fernandez and Dan Suciu, Storing Semistructured Data with STORED, SIGMOD, pages 431-442, 1999 https://doi.org/10.1145/304182.304220
Danicla Florescu and Donald Kossmann, Storing and Querying XML Data Using and RDBMS, Data Engineering Bulletin, 22(3), 1999
Roy Goldman, Jason McHugh and Jennifer Widom, From Semistructured Data to XML: Migrating the Lore Data Model and Query Lansguage, WebDB, pages 25-30, 1999
J.P. Callan, Passage-Level Evidence in Document Retrieval, In W. Bruce. Croft and C.J. van Rijsbergen editors. Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 302 310, Dublin, Ireland, July 1994, Spring-Verlag
Charles L. , A. Clarke and Gordon V. Cormack, Shortest-substring retrieval and ranking, TOIS, 18(1), 44-78, January 2000 https://doi.org/10.1145/333135.333137
D. Hawking and P. Thistlewaite, Proximity operators - so near and yet so far
G. Salton and C, Buckley, Automatic text structuring and retrieval; Experiments in automatic encyclopedia searching, Proceedings of the 14th Annual International ACM/SIGIR Conference, pages 21-31, 1991 https://doi.org/10.1145/122860.122863
Ross Wilkinson, Effective retrieval of structured documents, Proceedings of the 17th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR `94 Dublin, Ireland, July 3-6), pages 311-317, 1995
Justin Zobel, Alistair Moffat, Ross Wilkinson, and Ron Sacks-Davis, Efficient retrieval of partial documents, Information Processing and Management, 31(3):361-377, 1995 https://doi.org/10.1016/0306-4573(94)00052-5
C. Clarke, G. Cormack, and F. Burkowski, Shortest substring ranking (MultiText ex-periment for TREC-4), In D. K. Harman editor, Proceedings of the 4th Text Retrieval Conference(TREC-4, Washington, D.C., Nov.), pages 295-304, 1995
M. Kaszkiel and J. Zobel, Passage retrieval revisited, Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97, Philadelphia, PA. USA, July 27-31), pages 178-185, 1997
M. Kaszkiel, J. Zobel and R. Sacks-Davis, Efficient passage ranking for document databases, TOIS, 17(4): 406-439, 2000 https://doi.org/10.1145/326440.326445
G. Salton, J. Allan and C. Buckley, Automatic structuring and retrieval of large text files, CACM, 37(2): 97-108, 1994 https://doi.org/10.1145/175235.175243
C. Stanfill and D. L. Waltz, Statistical methods,artificial intelligence, and information retrieval, In P. S. Jacobs, editor, Text-Based Intelligent Systems: Current Research and Practice in Infromation Extraction and Retrieval, pages 215-225, Lawrence Erlbaum Associates, Inc., 1992
J. McHugh, S. Abiteboul, R. Goldman, D. Quass and J. Widom, Lore: A Database Management System for Semistructured Data, SIGMOD Record, Vol.26, No.3n pp.54-66, 1997 https://doi.org/10.1145/262762.262770
S. E. Robertson and S. Walker, Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval, SIGIR, pages 232-241, 1994
Howard R. Turle and W. Bruce Croft, Evaluation of an Inference Netwofk-Based Retrieval Model, TOIS, 9(3):187-222, 1991 https://doi.org/10.1145/125187.125188
Yong Kyu Lee, Seong-Joon Yoo, Kyoungro Yoon and P. Bruce Berra, Index Structures for Structured Documents, journal of Digital Library, pages 91-99, 1996
Chi young Seo and Hyung-joo Kim, An Efficient Inverted Index Technique Using RDBMS for Supporting Containment Queries, technical report, 2001
Chun Zhang, Jeffrey F. Naughton, David J. DeWitt and Guy M. Lohman Qiong Luo, On Supporting Containment Queries in Relational Database Management Systems, SIGMOD, 2001 https://doi.org/10.1145/376284.375722
Ricardo A. Baeza-Yates and Gonzalo Navarro, Intergrating Contents and Structure in Text Retrieval, SIGMOD Record, 25(1):67-79, 1996 https://doi.org/10.1145/381854.381890
Pekka Kilpelainen and Heikki Mannila, Retrieval from Hierarchical Texts by Partial Patterns, SIGIR, pages 214-222, 1993 https://doi.org/10.1145/160688.160722
I. MacLeod, A query language for retrieving information from hierarchic text structures, The Computer Journal, 34(3):254-264, 1991 https://doi.org/10.1093/comjnl/34.3.254

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

XML Document Retrieval Models for Heterogeneous Data Set using Independent Regular paths

독립적인 질의 경로들을 사용하여 이질적인 문서들을 검색하는 XML 문서 검색 모델

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)